How to query S3 data from Athena using SQL | AWS Athena Hands On Tutorial | Create Athena Tables

AWS Made Easy
4 Feb 202308:11

Summary

TLDRThis video tutorial demonstrates how to use Amazon Athena to query data stored in an S3 bucket with SQL. The presenter uploads a Netflix TV shows and movies dataset to S3, creates a table using AWS Glue crawler, and then performs sample queries in Athena. The video highlights Athena's capability to analyze data directly in S3 without the need to move it into a traditional database, showcasing its efficiency and ease of use.

Takeaways

  • 💻 Amazon Athena allows you to query data stored in S3 using simple SQL, making it an interactive and flexible service.
  • 🗂️ The example dataset used in the video is a Netflix TV shows and movies dataset downloaded from Kaggle, containing information like show ID, title, and director.
  • 📂 The first step is uploading the dataset to an S3 bucket, where a folder named 'Netflix data' is created for this purpose.
  • 🔍 Amazon Glue Crawler is used to automatically scan the S3 file, infer the schema, and create a corresponding table in Athena.
  • 🔧 The video demonstrates how to create a Glue Crawler, configure it, and run it to generate the table needed for querying.
  • 🛠️ If you don't have the necessary IAM role, Glue can automatically create one with the required permissions for scanning and table creation.
  • 🏛️ A new database, named 'Netflix DB', is created in Athena to store the table generated by the Glue Crawler.
  • 📊 The video shows how to query the newly created table using SQL, including setting up the query output location in S3.
  • 🎬 You can run specific queries in Athena, such as finding movies directed by a particular person or filtering content based on country.
  • 📈 Amazon Athena allows you to query data directly from S3 without needing to move it to a traditional database, making data analysis more efficient.

Q & A

  • What is Amazon Athena and how does it work?

    -Amazon Athena is an interactive query service that allows users to run SQL queries on data stored in Amazon S3. It works by creating a table catalog for the data and then enabling users to query the data using standard SQL syntax.

  • Where is the sample dataset for the video from?

    -The sample dataset used in the video is from Kaggle and it contains information about Netflix TV shows and movies.

  • What is the structure of the Netflix dataset?

    -The Netflix dataset has a simple schema that includes fields such as show ID, type, title, director, and other general information about TV shows and movies.

  • How do you upload the dataset to an S3 bucket?

    -To upload the dataset to an S3 bucket, you select the bucket, create a folder (e.g., 'Netflix data'), select the CSV file, and then click on upload.

  • What is the purpose of creating a table catalog in Athena?

    -Creating a table catalog in Athena allows you to define the schema of the data and makes it easier to query the data stored in S3 using SQL.

  • What is a glue crawler in AWS Glue?

    -A glue crawler in AWS Glue is a tool that automatically scans data stored in a data store, infers the schema, and creates a metadata catalog table for the data.

  • How does AWS Glue help in creating a table for the data?

    -AWS Glue helps by using a glue crawler to scan through the data, enforce the schema, and create a table in the metadata catalog, which can then be queried using Athena.

  • What is an IAM role in AWS and why is it needed for the crawler?

    -An IAM role in AWS is a set of permissions that defines what actions a user or service can perform. It is needed for the crawler to grant the necessary permissions to scan the S3 folder, infer the schema, and create the table.

  • How do you run a crawler in AWS Glue?

    -To run a crawler in AWS Glue, you create a crawler, specify the source data store, set up the IAM role, define the database to store the results, and then click on 'Run Crawler'.

  • What is the significance of configuring the query output location in Athena?

    -Configuring the query output location in Athena specifies where the results of the queries will be stored in S3, making it easier to access and analyze the query results.

  • Can you provide an example of a SQL query that could be run on the Netflix dataset?

    -An example SQL query could be 'SELECT * FROM netflixdb.netflix_data WHERE director = 'Vikram';' to find all movies directed by Vikram.

Outlines

00:00

📚 Introduction to Querying S3 Data with Amazon Athena

This paragraph introduces the video's main topic: how to use Amazon Athena to query data stored in an S3 bucket using SQL. The presenter explains that Athena is an interactive query service and demonstrates the process with a sample dataset from Netflix, which includes information about TV shows and movies. The initial steps involve uploading the dataset to an S3 bucket, creating a folder named 'Netflix data', and then using Athena to analyze the data. The presenter guides viewers through launching the Athena query editor and creating a table catalog for the uploaded file using AWS Glue crawler, which automatically scans the file, infers the schema, and creates a table.

05:00

🔍 Querying and Analyzing Data with Amazon Athena

In this paragraph, the presenter continues the tutorial by explaining how to verify the uploaded data and use Athena to query it. The process includes creating a crawler in AWS Glue to scan the S3 folder and create a table. The presenter details the steps to configure an IAM role with necessary permissions for the crawler to access the S3 bucket, infer the schema, and create the table. After the crawler has completed its task, the presenter shows how to verify the creation of the table in both Athena and Glue. The paragraph concludes with a demonstration of how to run SQL queries on the newly created table in Athena, including configuring the query output location in S3 and executing sample queries to filter data based on specific criteria such as director and country.

Mindmap

Keywords

💡Amazon Athena

Amazon Athena is an interactive query service provided by Amazon Web Services (AWS) that allows users to perform SQL-like queries on data stored in S3, an object storage service. In the video, Athena is the main tool used to demonstrate how to analyze data from an S3 bucket using SQL without the need to move the data into a traditional database.

💡S3 Bucket

An S3 bucket is a basic container in Amazon's Simple Storage Service (S3) used to store data. In the context of the video, the S3 bucket is where the sample Netflix dataset is uploaded and stored, and it's the source of the data being queried using Amazon Athena.

💡SQL

SQL, which stands for Structured Query Language, is a standard language for managing and manipulating databases. In the video, SQL is used to query the data in the S3 bucket, demonstrating how Athena enables traditional database querying on cloud-stored data.

💡Kaggle

Kaggle is an online community for data scientists and machine learners that provides datasets, tools, and competitions. In the script, the presenter mentions downloading a sample dataset from Kaggle, specifically the Netflix TV shows and movies dataset, which is used for the demonstration in the video.

💡Schema

In database terminology, a schema is the organization of data as a blueprint of how the database is structured. The video script describes the simple schema of the Netflix dataset, which includes fields like show ID, type, title, and director.

💡CSV File

A CSV file, short for Comma-Separated Values, is a file format used to store tabular data, where each line represents a row and fields are separated by commas. The script mentions uploading a CSV file of the Netflix dataset to the S3 bucket.

💡AWS Glue Crawler

AWS Glue is a service that helps in data discovery, data preparation, and has the capability to create a metadata catalog. The 'Glue Crawler' mentioned in the script is used to scan the uploaded file and create a table in the AWS Glue Data Catalog, which Athena can then query.

💡IAM Role

An IAM (Identity and Access Management) role in AWS is a set of permissions that defines what actions a user or service can perform. In the video, creating an IAM role is part of setting up the Glue Crawler, which needs permissions to scan the S3 bucket and infer the schema of the data.

💡Database

In the context of databases, a database is an organized collection of data. The script describes the creation of a 'Netflix DB' in Athena where the table created by the Glue Crawler will reside.

💡Query Output Location

The query output location in Athena is the S3 location where the results of the queries are stored. The script mentions configuring this location as a prerequisite for running queries in Athena.

💡Data Analysis

Data analysis is the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. The video demonstrates data analysis by querying the Netflix dataset to find specific information, such as movies directed by a certain director or shows from a specific country.

Highlights

Introduction to querying data in an S3 bucket using SQL with Amazon Athena.

Amazon Athena is an interactive query service that allows SQL queries on S3 data.

Demonstration of uploading a Netflix TV shows and movies dataset to S3.

Explanation of creating a folder in S3 for organizing the dataset.

Instructions on uploading a CSV file to the designated S3 folder.

Verification of successful file upload to the S3 bucket.

Accessing the Athena console to start querying the uploaded file.

Creating a table catalog in Athena for the uploaded file.

Utilizing AWS Glue Crawler to scan and infer schema for the dataset.

Description of the Glue Crawler process and its role in table creation.

Steps to create a new IAM role for Glue Crawler with necessary permissions.

Running the Glue Crawler to scan the S3 folder and create a table.

Verification of the table creation in both Athena and AWS Glue.

Configuring the query output location in Athena for storing results.

Previewing the table in Athena to see the inferred schema.

Executing SQL queries to analyze specific data within the dataset.

Example query to find movies directed by a specific director.

Example query to filter TV shows and movies from a specific country.

Emphasizing the ability to analyze data in S3 without moving it to a database.

Conclusion highlighting the benefits of using Athena for SQL querying on S3 data.

Transcripts

play00:00

hello everyone in this video we are

play00:01

going to see how to query a data that is

play00:04

sitting in your S3 bucket using SQL from

play00:06

Amazon Athena Amazon Athena is an

play00:09

interactive query service that lets you

play00:11

query the data in S3 using simple SQL

play00:15

okay so let's see how to do that so to

play00:18

that I have downloaded the sample data

play00:20

set from kaggle it's a Netflix TV shows

play00:22

and movies data set and this is how it

play00:25

looks like it has you know simple schema

play00:27

show ID type title director you know

play00:30

general information about the TV shows

play00:32

and movies okay so uh first thing let's

play00:35

upload this data set into our S3 and

play00:38

then see how to analyze this data using

play00:40

SQL from Athena

play00:42

okay I'm going to upload this data into

play00:45

my S3 bucket I'm going to select this

play00:47

bucket

play00:49

I'm going to create a folder and call it

play00:53

Netflix

play00:55

data

play00:59

and I will upload the data into this

play01:01

folder

play01:04

click on add files select this CSV file

play01:08

and click on upload

play01:11

okay uh so once the file is uploaded

play01:14

just verify if it is uploaded

play01:16

successfully and once it is done let's

play01:18

go to Athena and see how to query this

play01:20

uh file from Athena using SQL okay so

play01:24

once you are in Athena console this is

play01:26

how it looks like click on launch query

play01:28

editor and uh so this is how it looks

play01:31

okay so first thing is we need to create

play01:33

a table catalog table for this file that

play01:36

we just uploaded and then we will start

play01:38

querying that file so to do that you can

play01:41

click on create here okay and select

play01:43

glue crawler okay so what glue crawler

play01:46

does is it automatically like scans

play01:48

through this file and enforce the schema

play01:50

and creates a table for you

play01:52

okay

play01:55

okay so I just clicked on AWS glue

play01:58

crawler here and it took me into this

play02:00

screen I'm going to create a crawler

play02:03

here and call it as Netflix uh

play02:06

data

play02:08

crawler okay select next and the source

play02:12

type is data store because the catalog

play02:15

table is not existing already and I will

play02:17

select all our folders and click on next

play02:19

and the data store is S3 the connection

play02:22

is not required here and we will select

play02:26

the path in which we want to crawl okay

play02:29

going to expand this and I'm going to

play02:31

select this entire folder and we need to

play02:34

add a slash in the end okay so that

play02:38

it Scrolls through all the files under

play02:40

this folder okay I'm going to click on

play02:42

next here add another data store now

play02:44

click on next here and here it is asking

play02:47

uh if we

play02:49

we want to create a new IAM role so if

play02:51

you uh you can type in any role that you

play02:53

want so what it does is it automatically

play02:55

creates an IAM role on your behalf and

play02:58

grants all the required permissions like

play02:59

the permission to scan through this S3

play03:02

folder to infer the schema and

play03:03

everything and also create table and all

play03:06

those things okay so if you already have

play03:08

an IAM role with all those permissions

play03:09

you can click this choose an existing IM

play03:12

role and select the role that you want

play03:14

or if you don't have it you can just

play03:16

create a new role here I am going to

play03:18

call it as

play03:19

Netflix

play03:21

data crawler

play03:26

I am roll okay so what it does it it

play03:30

does is it will create a I am role with

play03:32

this name and attach all the required

play03:34

permissions to that okay but to do this

play03:36

make sure that you have all these three

play03:37

permissions create role create policy

play03:39

and attachable policy only then uh glue

play03:42

will be able to create this I am role on

play03:45

your behalf okay so now let's click on

play03:47

next here

play03:49

and this is the frequency with which you

play03:51

want to run your crawler I'm going to

play03:53

select run on demand because I only run

play03:55

it I want to run it only when I need it

play03:57

let's click on next here so this is a

play03:59

database to which the table uh will be

play04:02

added so if you want to create a new

play04:04

database click on ADD database here and

play04:07

you can type in the database name here

play04:10

I'm going to call it Netflix DB and all

play04:13

other things are optional I'm going to

play04:15

click on create so it will create a

play04:17

database called Netflix DB and then it

play04:20

will add the table under the database

play04:22

Okay click on next and then click on

play04:25

finish

play04:27

okay so here it has created that crawler

play04:31

here so select that crawler and click on

play04:33

run crawler here okay so it says that

play04:36

the crawler is now running

play04:38

you can keep refreshing here and once

play04:41

that is done it will automatically

play04:43

create a table for you

play04:48

okay it has completed running and now

play04:51

it's in stopping state so let's wait for

play04:54

this crawler to come back to uh ready

play04:56

state

play05:00

okay so it says that crawler is

play05:02

completed and it has created one table

play05:04

so now let's go to Athena and see if you

play05:07

you can also verify it in glow itself if

play05:09

you click on tables here

play05:12

it actually shows that it has created

play05:14

this Netflix data table okay so let's go

play05:16

back to Athena here and I'll just

play05:18

refresh this

play05:20

and if I if you see here it has created

play05:23

that reflex DB and you can see the table

play05:26

here okay so if you expand that this is

play05:29

the schema that it has inferred for us

play05:31

so show ID string and uh title is string

play05:34

I think

play05:36

release Here is a bigint so yeah it has

play05:39

automatically inferred the schema and

play05:40

created this table for us now let's see

play05:43

how to query this table so before you do

play05:46

that if this is the first time you are

play05:47

running a query in Athena you need to

play05:49

configure your query output location so

play05:51

query output location is basically where

play05:53

Athena will store the results of your

play05:55

queries in S3 you can go to settings and

play05:58

you can click on you know manage here

play06:00

and you can configure your query output

play06:02

location to some S3 path okay so if you

play06:06

already done that that's fine you can

play06:07

start querying this table so you can

play06:09

click on this three dots here I can just

play06:11

do a preview table

play06:13

okay so if you see uh it just ran select

play06:16

start from

play06:18

netflixdb.netflix data limit10 so it's

play06:20

just showing uh some 10 rows of the data

play06:24

okay so you can run any uh like

play06:28

any query that you want to analyze this

play06:31

data so if you want to see all the

play06:33

movies that are directed by let's say

play06:35

Vikram but okay so I can say select star

play06:38

from this

play06:39

where

play06:42

director is equal to

play06:46

chromebot and click on run

play06:51

okay so there is only one movie which is

play06:54

uh

play06:55

directed by Vikram but so uh yeah you

play06:57

can do I mean pretty much anything that

play06:59

you would do in a SQL database to uh

play07:02

like

play07:03

analyze the data

play07:05

so if you want to see something like

play07:07

where country is equal to

play07:10

India so there's a country option here

play07:12

so let's see uh all the movies and TV

play07:16

shows from India click on run

play07:19

okay it looks like maybe there is

play07:23

India yeah okay let's run this

play07:27

okay yeah so if you see we can see a lot

play07:30

of uh results here so these are all the

play07:33

shows and you know TV shows and movies

play07:35

from India so yeah you can use Athena uh

play07:38

like just to run your SQL queries just

play07:40

like any other SQL database so but the

play07:43

beauty of Athena is that you need not

play07:44

move the data from your S3 into any

play07:47

database your data is still sitting in

play07:49

uh S3 but you have created a table on

play07:52

top of it and interactively querying the

play07:54

data so yeah that's how you can analyze

play07:57

your history data using SQL without

play07:59

moving the data into a database I hope

play08:01

you found this video helpful if you did

play08:03

please uh like this video and uh also

play08:06

subscribe to my channel and I'll see you

play08:08

in the next video thank you

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
Amazon AthenaS3 BucketSQL QueryData AnalysisNetflix DataCSV UploadAWS GlueIAM RoleDatabase TableInteractive Query
هل تحتاج إلى تلخيص باللغة الإنجليزية؟