How to query S3 data from Athena using SQL | AWS Athena Hands On Tutorial | Create Athena Tables
Summary
TLDRThis video tutorial demonstrates how to use Amazon Athena to query data stored in an S3 bucket with SQL. The presenter uploads a Netflix TV shows and movies dataset to S3, creates a table using AWS Glue crawler, and then performs sample queries in Athena. The video highlights Athena's capability to analyze data directly in S3 without the need to move it into a traditional database, showcasing its efficiency and ease of use.
Takeaways
- 💻 Amazon Athena allows you to query data stored in S3 using simple SQL, making it an interactive and flexible service.
- 🗂️ The example dataset used in the video is a Netflix TV shows and movies dataset downloaded from Kaggle, containing information like show ID, title, and director.
- 📂 The first step is uploading the dataset to an S3 bucket, where a folder named 'Netflix data' is created for this purpose.
- 🔍 Amazon Glue Crawler is used to automatically scan the S3 file, infer the schema, and create a corresponding table in Athena.
- 🔧 The video demonstrates how to create a Glue Crawler, configure it, and run it to generate the table needed for querying.
- 🛠️ If you don't have the necessary IAM role, Glue can automatically create one with the required permissions for scanning and table creation.
- 🏛️ A new database, named 'Netflix DB', is created in Athena to store the table generated by the Glue Crawler.
- 📊 The video shows how to query the newly created table using SQL, including setting up the query output location in S3.
- 🎬 You can run specific queries in Athena, such as finding movies directed by a particular person or filtering content based on country.
- 📈 Amazon Athena allows you to query data directly from S3 without needing to move it to a traditional database, making data analysis more efficient.
Q & A
What is Amazon Athena and how does it work?
-Amazon Athena is an interactive query service that allows users to run SQL queries on data stored in Amazon S3. It works by creating a table catalog for the data and then enabling users to query the data using standard SQL syntax.
Where is the sample dataset for the video from?
-The sample dataset used in the video is from Kaggle and it contains information about Netflix TV shows and movies.
What is the structure of the Netflix dataset?
-The Netflix dataset has a simple schema that includes fields such as show ID, type, title, director, and other general information about TV shows and movies.
How do you upload the dataset to an S3 bucket?
-To upload the dataset to an S3 bucket, you select the bucket, create a folder (e.g., 'Netflix data'), select the CSV file, and then click on upload.
What is the purpose of creating a table catalog in Athena?
-Creating a table catalog in Athena allows you to define the schema of the data and makes it easier to query the data stored in S3 using SQL.
What is a glue crawler in AWS Glue?
-A glue crawler in AWS Glue is a tool that automatically scans data stored in a data store, infers the schema, and creates a metadata catalog table for the data.
How does AWS Glue help in creating a table for the data?
-AWS Glue helps by using a glue crawler to scan through the data, enforce the schema, and create a table in the metadata catalog, which can then be queried using Athena.
What is an IAM role in AWS and why is it needed for the crawler?
-An IAM role in AWS is a set of permissions that defines what actions a user or service can perform. It is needed for the crawler to grant the necessary permissions to scan the S3 folder, infer the schema, and create the table.
How do you run a crawler in AWS Glue?
-To run a crawler in AWS Glue, you create a crawler, specify the source data store, set up the IAM role, define the database to store the results, and then click on 'Run Crawler'.
What is the significance of configuring the query output location in Athena?
-Configuring the query output location in Athena specifies where the results of the queries will be stored in S3, making it easier to access and analyze the query results.
Can you provide an example of a SQL query that could be run on the Netflix dataset?
-An example SQL query could be 'SELECT * FROM netflixdb.netflix_data WHERE director = 'Vikram';' to find all movies directed by Vikram.
Outlines
📚 Introduction to Querying S3 Data with Amazon Athena
This paragraph introduces the video's main topic: how to use Amazon Athena to query data stored in an S3 bucket using SQL. The presenter explains that Athena is an interactive query service and demonstrates the process with a sample dataset from Netflix, which includes information about TV shows and movies. The initial steps involve uploading the dataset to an S3 bucket, creating a folder named 'Netflix data', and then using Athena to analyze the data. The presenter guides viewers through launching the Athena query editor and creating a table catalog for the uploaded file using AWS Glue crawler, which automatically scans the file, infers the schema, and creates a table.
🔍 Querying and Analyzing Data with Amazon Athena
In this paragraph, the presenter continues the tutorial by explaining how to verify the uploaded data and use Athena to query it. The process includes creating a crawler in AWS Glue to scan the S3 folder and create a table. The presenter details the steps to configure an IAM role with necessary permissions for the crawler to access the S3 bucket, infer the schema, and create the table. After the crawler has completed its task, the presenter shows how to verify the creation of the table in both Athena and Glue. The paragraph concludes with a demonstration of how to run SQL queries on the newly created table in Athena, including configuring the query output location in S3 and executing sample queries to filter data based on specific criteria such as director and country.
Mindmap
Keywords
💡Amazon Athena
💡S3 Bucket
💡SQL
💡Kaggle
💡Schema
💡CSV File
💡AWS Glue Crawler
💡IAM Role
💡Database
💡Query Output Location
💡Data Analysis
Highlights
Introduction to querying data in an S3 bucket using SQL with Amazon Athena.
Amazon Athena is an interactive query service that allows SQL queries on S3 data.
Demonstration of uploading a Netflix TV shows and movies dataset to S3.
Explanation of creating a folder in S3 for organizing the dataset.
Instructions on uploading a CSV file to the designated S3 folder.
Verification of successful file upload to the S3 bucket.
Accessing the Athena console to start querying the uploaded file.
Creating a table catalog in Athena for the uploaded file.
Utilizing AWS Glue Crawler to scan and infer schema for the dataset.
Description of the Glue Crawler process and its role in table creation.
Steps to create a new IAM role for Glue Crawler with necessary permissions.
Running the Glue Crawler to scan the S3 folder and create a table.
Verification of the table creation in both Athena and AWS Glue.
Configuring the query output location in Athena for storing results.
Previewing the table in Athena to see the inferred schema.
Executing SQL queries to analyze specific data within the dataset.
Example query to find movies directed by a specific director.
Example query to filter TV shows and movies from a specific country.
Emphasizing the ability to analyze data in S3 without moving it to a database.
Conclusion highlighting the benefits of using Athena for SQL querying on S3 data.
Transcripts
hello everyone in this video we are
going to see how to query a data that is
sitting in your S3 bucket using SQL from
Amazon Athena Amazon Athena is an
interactive query service that lets you
query the data in S3 using simple SQL
okay so let's see how to do that so to
that I have downloaded the sample data
set from kaggle it's a Netflix TV shows
and movies data set and this is how it
looks like it has you know simple schema
show ID type title director you know
general information about the TV shows
and movies okay so uh first thing let's
upload this data set into our S3 and
then see how to analyze this data using
SQL from Athena
okay I'm going to upload this data into
my S3 bucket I'm going to select this
bucket
I'm going to create a folder and call it
Netflix
data
and I will upload the data into this
folder
click on add files select this CSV file
and click on upload
okay uh so once the file is uploaded
just verify if it is uploaded
successfully and once it is done let's
go to Athena and see how to query this
uh file from Athena using SQL okay so
once you are in Athena console this is
how it looks like click on launch query
editor and uh so this is how it looks
okay so first thing is we need to create
a table catalog table for this file that
we just uploaded and then we will start
querying that file so to do that you can
click on create here okay and select
glue crawler okay so what glue crawler
does is it automatically like scans
through this file and enforce the schema
and creates a table for you
okay
okay so I just clicked on AWS glue
crawler here and it took me into this
screen I'm going to create a crawler
here and call it as Netflix uh
data
crawler okay select next and the source
type is data store because the catalog
table is not existing already and I will
select all our folders and click on next
and the data store is S3 the connection
is not required here and we will select
the path in which we want to crawl okay
going to expand this and I'm going to
select this entire folder and we need to
add a slash in the end okay so that
it Scrolls through all the files under
this folder okay I'm going to click on
next here add another data store now
click on next here and here it is asking
uh if we
we want to create a new IAM role so if
you uh you can type in any role that you
want so what it does is it automatically
creates an IAM role on your behalf and
grants all the required permissions like
the permission to scan through this S3
folder to infer the schema and
everything and also create table and all
those things okay so if you already have
an IAM role with all those permissions
you can click this choose an existing IM
role and select the role that you want
or if you don't have it you can just
create a new role here I am going to
call it as
Netflix
data crawler
I am roll okay so what it does it it
does is it will create a I am role with
this name and attach all the required
permissions to that okay but to do this
make sure that you have all these three
permissions create role create policy
and attachable policy only then uh glue
will be able to create this I am role on
your behalf okay so now let's click on
next here
and this is the frequency with which you
want to run your crawler I'm going to
select run on demand because I only run
it I want to run it only when I need it
let's click on next here so this is a
database to which the table uh will be
added so if you want to create a new
database click on ADD database here and
you can type in the database name here
I'm going to call it Netflix DB and all
other things are optional I'm going to
click on create so it will create a
database called Netflix DB and then it
will add the table under the database
Okay click on next and then click on
finish
okay so here it has created that crawler
here so select that crawler and click on
run crawler here okay so it says that
the crawler is now running
you can keep refreshing here and once
that is done it will automatically
create a table for you
okay it has completed running and now
it's in stopping state so let's wait for
this crawler to come back to uh ready
state
okay so it says that crawler is
completed and it has created one table
so now let's go to Athena and see if you
you can also verify it in glow itself if
you click on tables here
it actually shows that it has created
this Netflix data table okay so let's go
back to Athena here and I'll just
refresh this
and if I if you see here it has created
that reflex DB and you can see the table
here okay so if you expand that this is
the schema that it has inferred for us
so show ID string and uh title is string
I think
release Here is a bigint so yeah it has
automatically inferred the schema and
created this table for us now let's see
how to query this table so before you do
that if this is the first time you are
running a query in Athena you need to
configure your query output location so
query output location is basically where
Athena will store the results of your
queries in S3 you can go to settings and
you can click on you know manage here
and you can configure your query output
location to some S3 path okay so if you
already done that that's fine you can
start querying this table so you can
click on this three dots here I can just
do a preview table
okay so if you see uh it just ran select
start from
netflixdb.netflix data limit10 so it's
just showing uh some 10 rows of the data
okay so you can run any uh like
any query that you want to analyze this
data so if you want to see all the
movies that are directed by let's say
Vikram but okay so I can say select star
from this
where
director is equal to
chromebot and click on run
okay so there is only one movie which is
uh
directed by Vikram but so uh yeah you
can do I mean pretty much anything that
you would do in a SQL database to uh
like
analyze the data
so if you want to see something like
where country is equal to
India so there's a country option here
so let's see uh all the movies and TV
shows from India click on run
okay it looks like maybe there is
India yeah okay let's run this
okay yeah so if you see we can see a lot
of uh results here so these are all the
shows and you know TV shows and movies
from India so yeah you can use Athena uh
like just to run your SQL queries just
like any other SQL database so but the
beauty of Athena is that you need not
move the data from your S3 into any
database your data is still sitting in
uh S3 but you have created a table on
top of it and interactively querying the
data so yeah that's how you can analyze
your history data using SQL without
moving the data into a database I hope
you found this video helpful if you did
please uh like this video and uh also
subscribe to my channel and I'll see you
in the next video thank you
浏览更多相关视频
Dream Report: Acquiring Data from SQL Server
Amazon Redshift Tutorial | Amazon Redshift Architecture | AWS Tutorial For Beginners | Simplilearn
#2 How to PASS exam MLS-C01 AWS Certified Machine Learning Specialty in 14 hours | Part 2
How To Host S3 Static Website With Custom Route 53 Domain (4 Min) | AWS | Set Alias To S3 Endpoint
Database Design Tips | Choosing the Best Database in a System Design Interview
Part3 : Database Testing | How To Test Schema of Database Table | Test Cases
5.0 / 5 (0 votes)