How to query S3 data from Athena using SQL | AWS Athena Hands On Tutorial | Create Athena Tables

AWS Made Easy
4 Feb 202308:11

Summary

TLDRThis video tutorial demonstrates how to use Amazon Athena to query data stored in an S3 bucket with SQL. The presenter uploads a Netflix TV shows and movies dataset to S3, creates a table using AWS Glue crawler, and then performs sample queries in Athena. The video highlights Athena's capability to analyze data directly in S3 without the need to move it into a traditional database, showcasing its efficiency and ease of use.

Takeaways

  • 💻 Amazon Athena allows you to query data stored in S3 using simple SQL, making it an interactive and flexible service.
  • 🗂️ The example dataset used in the video is a Netflix TV shows and movies dataset downloaded from Kaggle, containing information like show ID, title, and director.
  • 📂 The first step is uploading the dataset to an S3 bucket, where a folder named 'Netflix data' is created for this purpose.
  • 🔍 Amazon Glue Crawler is used to automatically scan the S3 file, infer the schema, and create a corresponding table in Athena.
  • 🔧 The video demonstrates how to create a Glue Crawler, configure it, and run it to generate the table needed for querying.
  • 🛠️ If you don't have the necessary IAM role, Glue can automatically create one with the required permissions for scanning and table creation.
  • 🏛️ A new database, named 'Netflix DB', is created in Athena to store the table generated by the Glue Crawler.
  • 📊 The video shows how to query the newly created table using SQL, including setting up the query output location in S3.
  • 🎬 You can run specific queries in Athena, such as finding movies directed by a particular person or filtering content based on country.
  • 📈 Amazon Athena allows you to query data directly from S3 without needing to move it to a traditional database, making data analysis more efficient.

Q & A

  • What is Amazon Athena and how does it work?

    -Amazon Athena is an interactive query service that allows users to run SQL queries on data stored in Amazon S3. It works by creating a table catalog for the data and then enabling users to query the data using standard SQL syntax.

  • Where is the sample dataset for the video from?

    -The sample dataset used in the video is from Kaggle and it contains information about Netflix TV shows and movies.

  • What is the structure of the Netflix dataset?

    -The Netflix dataset has a simple schema that includes fields such as show ID, type, title, director, and other general information about TV shows and movies.

  • How do you upload the dataset to an S3 bucket?

    -To upload the dataset to an S3 bucket, you select the bucket, create a folder (e.g., 'Netflix data'), select the CSV file, and then click on upload.

  • What is the purpose of creating a table catalog in Athena?

    -Creating a table catalog in Athena allows you to define the schema of the data and makes it easier to query the data stored in S3 using SQL.

  • What is a glue crawler in AWS Glue?

    -A glue crawler in AWS Glue is a tool that automatically scans data stored in a data store, infers the schema, and creates a metadata catalog table for the data.

  • How does AWS Glue help in creating a table for the data?

    -AWS Glue helps by using a glue crawler to scan through the data, enforce the schema, and create a table in the metadata catalog, which can then be queried using Athena.

  • What is an IAM role in AWS and why is it needed for the crawler?

    -An IAM role in AWS is a set of permissions that defines what actions a user or service can perform. It is needed for the crawler to grant the necessary permissions to scan the S3 folder, infer the schema, and create the table.

  • How do you run a crawler in AWS Glue?

    -To run a crawler in AWS Glue, you create a crawler, specify the source data store, set up the IAM role, define the database to store the results, and then click on 'Run Crawler'.

  • What is the significance of configuring the query output location in Athena?

    -Configuring the query output location in Athena specifies where the results of the queries will be stored in S3, making it easier to access and analyze the query results.

  • Can you provide an example of a SQL query that could be run on the Netflix dataset?

    -An example SQL query could be 'SELECT * FROM netflixdb.netflix_data WHERE director = 'Vikram';' to find all movies directed by Vikram.

Outlines

plate

Esta sección está disponible solo para usuarios con suscripción. Por favor, mejora tu plan para acceder a esta parte.

Mejorar ahora

Mindmap

plate

Esta sección está disponible solo para usuarios con suscripción. Por favor, mejora tu plan para acceder a esta parte.

Mejorar ahora

Keywords

plate

Esta sección está disponible solo para usuarios con suscripción. Por favor, mejora tu plan para acceder a esta parte.

Mejorar ahora

Highlights

plate

Esta sección está disponible solo para usuarios con suscripción. Por favor, mejora tu plan para acceder a esta parte.

Mejorar ahora

Transcripts

plate

Esta sección está disponible solo para usuarios con suscripción. Por favor, mejora tu plan para acceder a esta parte.

Mejorar ahora
Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
Amazon AthenaS3 BucketSQL QueryData AnalysisNetflix DataCSV UploadAWS GlueIAM RoleDatabase TableInteractive Query
¿Necesitas un resumen en inglés?