How to query S3 data from Athena using SQL | AWS Athena Hands On Tutorial | Create Athena Tables

AWS Made Easy

4 Feb 202308:11

Summary

TLDRThis video tutorial demonstrates how to use Amazon Athena to query data stored in an S3 bucket with SQL. The presenter uploads a Netflix TV shows and movies dataset to S3, creates a table using AWS Glue crawler, and then performs sample queries in Athena. The video highlights Athena's capability to analyze data directly in S3 without the need to move it into a traditional database, showcasing its efficiency and ease of use.

Takeaways

💻 Amazon Athena allows you to query data stored in S3 using simple SQL, making it an interactive and flexible service.
🗂️ The example dataset used in the video is a Netflix TV shows and movies dataset downloaded from Kaggle, containing information like show ID, title, and director.
📂 The first step is uploading the dataset to an S3 bucket, where a folder named 'Netflix data' is created for this purpose.
🔍 Amazon Glue Crawler is used to automatically scan the S3 file, infer the schema, and create a corresponding table in Athena.
🔧 The video demonstrates how to create a Glue Crawler, configure it, and run it to generate the table needed for querying.
🛠️ If you don't have the necessary IAM role, Glue can automatically create one with the required permissions for scanning and table creation.
🏛️ A new database, named 'Netflix DB', is created in Athena to store the table generated by the Glue Crawler.
📊 The video shows how to query the newly created table using SQL, including setting up the query output location in S3.
🎬 You can run specific queries in Athena, such as finding movies directed by a particular person or filtering content based on country.
📈 Amazon Athena allows you to query data directly from S3 without needing to move it to a traditional database, making data analysis more efficient.

Q & A

What is Amazon Athena and how does it work?
-Amazon Athena is an interactive query service that allows users to run SQL queries on data stored in Amazon S3. It works by creating a table catalog for the data and then enabling users to query the data using standard SQL syntax.
Where is the sample dataset for the video from?
-The sample dataset used in the video is from Kaggle and it contains information about Netflix TV shows and movies.
What is the structure of the Netflix dataset?
-The Netflix dataset has a simple schema that includes fields such as show ID, type, title, director, and other general information about TV shows and movies.
How do you upload the dataset to an S3 bucket?
-To upload the dataset to an S3 bucket, you select the bucket, create a folder (e.g., 'Netflix data'), select the CSV file, and then click on upload.
What is the purpose of creating a table catalog in Athena?
-Creating a table catalog in Athena allows you to define the schema of the data and makes it easier to query the data stored in S3 using SQL.
What is a glue crawler in AWS Glue?
-A glue crawler in AWS Glue is a tool that automatically scans data stored in a data store, infers the schema, and creates a metadata catalog table for the data.
How does AWS Glue help in creating a table for the data?
-AWS Glue helps by using a glue crawler to scan through the data, enforce the schema, and create a table in the metadata catalog, which can then be queried using Athena.
What is an IAM role in AWS and why is it needed for the crawler?
-An IAM role in AWS is a set of permissions that defines what actions a user or service can perform. It is needed for the crawler to grant the necessary permissions to scan the S3 folder, infer the schema, and create the table.
How do you run a crawler in AWS Glue?
-To run a crawler in AWS Glue, you create a crawler, specify the source data store, set up the IAM role, define the database to store the results, and then click on 'Run Crawler'.
What is the significance of configuring the query output location in Athena?
-Configuring the query output location in Athena specifies where the results of the queries will be stored in S3, making it easier to access and analyze the query results.
Can you provide an example of a SQL query that could be run on the Netflix dataset?
-An example SQL query could be 'SELECT * FROM netflixdb.netflix_data WHERE director = 'Vikram';' to find all movies directed by Vikram.