Building a Serverless Data Lake (SDLF) with AWS from scratch

Knowledge Amplifier
6 Mar 202423:01

Summary

TLDRThis video from the 'knowledge amplifier' channel dives into the AWS Serverless Data Lake Framework (SDLF), an open-source project that streamlines the setup of data lake systems. It outlines the core AWS services integral to SDLF, such as S3, Lambda, Glue, and Step Functions, and discusses their roles in creating a reusable, serverless architecture. The script covers the framework's architecture, detailing the data flow from raw ingestion to processed analytics, and highlights the differences between near real-time processing in Stage A and batch processing in Stage B. It also touches on CI/CD practices for data pipeline development and provides references to related tutorial videos for practical implementation guidance.

Takeaways

  • 📚 The video introduces the AWS Serverless Data Lake Framework (SDLF), a framework designed to handle large volumes of structured, semi-structured, and unstructured data.
  • 🛠️ The framework is built using core AWS serverless services including AWS S3 for storage, DynamoDB for cataloging data, AWS Lambda for light data transformations, AWS Glue for heavy data transformations, and AWS Step Functions for orchestration.
  • 🏢 Companies like Formula 1 Motorsports, Amazon retail Ireland, and Naranja Finance utilize the SDLF to implement data lakes within their organizations, highlighting its industry adoption.
  • 🌐 The framework supports both near real-time data processing in Stage A and batch processing in Stage B, catering to different data processing needs.
  • 🔄 Stage A focuses on light transformations and is triggered by events landing in S3, making it suitable for immediate data processing tasks.
  • 📈 Stage B is designed for heavy transformations using AWS Glue and is optimized for processing large volumes of data in batches, making it efficient for periodic data processing tasks.
  • 🔧 The video script explains the architecture of SDLF, detailing the flow from raw data ingestion to processed data ready for analytics.
  • 🔒 Data quality checks are emphasized as crucial for ensuring the reliability of data used in business decisions, with a dedicated Lambda function suggested for this purpose.
  • 🔄 The script outlines the use of AWS services for data transformation, including the use of AWS Step Functions to manage workflows and AWS Lambda for executing tasks.
  • 🔧 The importance of reusability in a framework is highlighted, with the SDLF being an open-source project that can be adapted and reused by different organizations.
  • 🔄 CI/CD pipelines are discussed for managing project-specific code changes, emphasizing the need to implement continuous integration and delivery for variable components of the framework.

Q & A

  • What is the AWS Serverless Data Lake Framework (SDLF)?

    -The AWS Serverless Data Lake Framework (SDLF) is an open-source project that provides a data platform to accelerate the delivery of enterprise data lakes. It utilizes various AWS serverless services to create a reusable framework for data storage, processing, and security.

  • What are the core AWS services used in the SDLF?

    -The core AWS services used in the SDLF include AWS S3 for storage, DynamoDB for cataloging data, AWS Lambda and AWS Glue for compute, and AWS Step Functions for orchestration.

  • How does the SDLF handle data ingestion from various sources?

    -Data from various sources is ingested into the raw layer of the SDLF, which is an S3 location. The data can come in various formats, including structured, semi-structured, and unstructured data.

  • What is the purpose of the Lambda function in the data ingestion process?

    -The Lambda function acts as a router, receiving event notifications from S3 and forwarding the event to a team-specific SQS queue based on the file's landing location or filename.

  • Can you explain the difference between the raw, staging, and processed layers in the SDLF architecture?

    -The raw layer contains the ingested data in its original format. The staging layer stores data after light transformations, such as data type checks or duplicate removal. The processed layer holds the data after heavy transformations, such as joins, filters, and aggregations, making it ready for analytics.

  • How does the SDLF ensure data quality in the ETL pipeline?

    -The SDLF uses a Lambda function to perform data quality validation. This function can implement data quality frameworks to ensure the data generated by the ETL pipeline is of good quality before it is used for analytics.

  • What is the role of AWS Step Functions in the SDLF?

    -AWS Step Functions are used for orchestration in the SDLF. They manage the workflow of data processing, starting from light transformations in the staging layer to heavy transformations in the processed layer.

  • How does the SDLF differentiate between Stage A and Stage B in terms of data processing?

    -Stage A is near real-time, processing data as soon as it lands in S3 and triggers a Lambda function. Stage B, on the other hand, is for batch processing, where data is accumulated over a period and then processed together using AWS Glue.

  • What is the significance of using CI/CD pipelines in the SDLF?

    -CI/CD pipelines are used to manage the deployment of project-specific code for light transformations and AWS Glue scripts. They ensure that only the variable parts of the project are updated, streamlining the development and deployment process.

  • How can one implement the SDLF using their own AWS services?

    -To implement the SDLF, one can refer to the provided reference videos that cover creating event-based projects using S3, Lambda, and SQS, triggering AWS Step Functions from Lambda, and interacting between Step Functions and AWS Glue, among other topics.

Outlines

00:00

🚀 Introduction to AWS Serverless Data Lake Framework (SDLF)

This paragraph introduces the AWS Serverless Data Lake Framework (SDLF), emphasizing its importance and core AWS services involved. The framework is designed to address complex business problems by leveraging a centralized repository for storing, processing, and securing various data types. It mentions the use of AWS Transfer Family, SFTP, and different injection frameworks like scoop Talent or Spark. The paragraph also touches on the reusability aspect of a framework and introduces the open-source nature of SDLF, highlighting its adoption by large organizations like Formula 1 Motorsports, Amazon retail Ireland, and Naranja Finance.

05:01

🌟 Core AWS Services in SDLF and Data System Layers

The second paragraph delves into the core AWS services utilized in the SDLF, such as AWS S3 for storage, DynamoDB for cataloging data, AWS Lambda and Glue for compute tasks, and AWS Step Functions for orchestration. It explains the three major layers of a data system: the raw or landing layer, the staging layer, and the processed or analytical layer. The paragraph outlines the process flow from data ingestion to transformation and storage across these layers, culminating in a detailed explanation of the architecture and the use of AWS services within SDLF.

10:03

🔍 Data Flow and Processing in SDLF Architecture

This paragraph explores the data flow within the SDLF architecture, starting from the raw data landing in S3 to the triggering of Lambda functions and the use of Amazon SQS for event handling. It describes how data is processed through light transformations in the staging layer and then moved to the processed or analytical layer for heavy transformations using AWS Glue. The paragraph also explains the role of AWS Step Functions in initiating the processing workflow, the use of Lambda for routing events to team-specific queues, and the importance of metadata updates in DynamoDB for audit and logging purposes.

15:05

🛠️ Batch Processing and CI/CD Integration in SDLF

The fourth paragraph discusses the distinction between Stage A and Stage B in the SDLF architecture, highlighting Stage A as a near real-time system and Stage B as a batch processing system. It explains the use of a CloudWatch rule to trigger Lambda functions for batch processing every five minutes. The paragraph also addresses the CI/CD aspect, emphasizing the importance of implementing pipelines for project-specific code changes in Lambda and AWS Glue, while using source control tools like CodeCommit and CodePipeline.

20:08

🔗 Implementing SDLF and Reference to Related Videos

The final paragraph provides guidance on implementing the SDLF flow using AWS services, referencing previous videos that cover various components of the framework. It suggests videos on creating event-based projects with S3, Lambda, and SQS, triggering Step Functions from Lambda, and using CloudWatch rules for periodic Lambda invocations. The paragraph also mentions the interaction between Step Functions and AWS Glue, as well as the importance of data quality checks and the use of DynamoDB for audit and logging. It concludes with an invitation to like, share, comment, and subscribe to the channel for more informative content.

Mindmap

Keywords

💡AWS Serverless Data Lake Framework (SDLF)

The AWS Serverless Data Lake Framework (SDLF) is an open-source project that provides a data platform designed to expedite the delivery of enterprise data lakes. It is a core concept in the video, which discusses its components, benefits, and how it facilitates complex business problem-solving using serverless AWS services. The SDLF is highlighted as a reusable framework that organizations can implement to manage large volumes of structured, semi-structured, and unstructured data.

💡AWS Services

AWS Services are the building blocks of the SDLF, each serving a specific purpose within the framework. The video mentions several services such as AWS S3 for storage, AWS Lambda for compute tasks, AWS Glue for data transformation, and AWS Step Functions for orchestration. These services are integral to creating a serverless architecture that is both scalable and cost-effective.

💡Data Lake

A Data Lake is a centralized repository that can store large volumes of diverse data in its native format until it is needed. In the context of the video, the Data Lake is the foundational concept for the SDLF, where structured, semi-structured, and unstructured data is ingested and processed. The Data Lake allows for the storage and processing of data from various sources, making it a critical component for big data analytics.

💡Structured Data

Structured data refers to information that is organized in a specific format, such as CSV or TSV files. The video mentions that structured data can be ingested into the SDLF through AWS Transfer Family or other mechanisms, where it can then be processed and stored in a structured manner for easy querying and analysis.

💡Semi-structured Data

Semi-structured data is data that has some organizational structure but does not adhere to a strict schema like structured data. Examples given in the video include JSON and XML formats. The SDLF is designed to handle this type of data, allowing for flexibility in processing and analyzing information that does not fit neatly into traditional database structures.

💡Unstructured Data

Unstructured data is data that does not have a pre-defined structure or schema and can include text, images, and other formats. The video discusses how the SDLF can manage unstructured data, which is an important capability for modern data systems that need to process a wide variety of information sources.

💡ETL (Extract, Transform, Load)

ETL is a process in data warehousing that involves extracting data from various sources, transforming it to fit operational needs, and loading it into a target system for further analysis. The video describes how the SDLF uses ETL processes to prepare data for analytics, with AWS Glue being a key service for heavy data transformation tasks.

💡Orchestration

Orchestration in the context of the SDLF refers to the coordination of various AWS services to ensure that data flows smoothly through the system from ingestion to analysis. AWS Step Functions is highlighted as the orchestration tool used in the SDLF to manage the workflow of data processing tasks.

💡Lambda Functions

Lambda Functions are serverless compute services that run code in response to events and automatically manage the underlying compute resources. The video explains how Lambda functions are used in the SDLF for tasks such as routing events to specific queues, performing light data transformations, and triggering other AWS services.

💡S3 Raw Layer

The S3 Raw Layer is the initial stage in the SDLF where raw data files are first ingested into AWS S3. The video describes this layer as the starting point for data within the framework, where files are stored in their original form before any transformation occurs.

💡CloudWatch Rules

CloudWatch Rules in AWS are used to trigger actions in response to specific events or conditions. In the video, a CloudWatch rule is set up to trigger a Lambda function every five minutes, which checks for new data in the staging layer and initiates heavy transformation processes using AWS Glue.

💡Data Quality

Ensuring data quality is critical for making accurate business decisions. The video mentions implementing Lambda functions for data quality checks using frameworks like DQ to validate the integrity and reliability of the data processed through the SDLF.

💡CI/CD Pipeline

CI/CD (Continuous Integration/Continuous Deployment) is a practice in software development where code changes are automatically built, tested, and prepared for deployment. The video discusses the importance of CI/CD pipelines in the SDLF for managing project-specific code for light transformations and AWS Glue jobs, allowing for efficient updates and deployments.

Highlights

Introduction to AWS Serverless Data Lake Framework (SDLF).

Core AWS Services used in SDLF: S3, DynamoDB, Lambda, Glue, and Step Functions.

Advantages of using SDLF for complex business problems.

Data Lake as a centralized repository for various data formats.

Reusability as a key feature of a framework and its importance in SDLF.

Three options for implementing Data Lake: building from scratch, purchasing, or using open-source projects like SDLF.

Popularity of SDLF among large organizations like Formula 1 Motorsports and Amazon Retail Ireland.

Explanation of the architecture of SDLF with Stage A and Stage B.

Role of AWS S3 in storing raw data and its serverless nature.

Use of AWS Lambda for light data transformation within the 15-minute execution limit.

AWS Glue's role in heavy data transformation for batch processing.

Orchestration of data workflows using AWS Step Functions in a serverless manner.

Data flow from raw to staging to processed layers in SDLF.

Event-driven architecture using AWS S3 notifications and SQS for data ingestion.

Lambda functions routing events to team-specific SQS queues for organized data processing.

Step Functions workflow for light transformation, auditing, and logging with DynamoDB.

Difference between Stage A for near real-time processing and Stage B for batch processing.

Implementation of CI/CD pipelines for project-specific code in SDLF.

Data quality checks as an essential part of the ETL pipeline to ensure data reliability.

Reference to additional videos for implementing specific parts of the SDLF workflow.

Transcripts

play00:00

Hello friends Welcome to our Channel

play00:02

knowledge amplifier so today in this

play00:04

particular video we are going to explore

play00:06

a very important framework and that is

play00:09

AWS serverless datal framework in short

play00:12

it is called as sdmf okay first we are

play00:15

going to explore what it is what are the

play00:17

core AWS Services used in this framework

play00:20

what are the advantages which companies

play00:22

are using this to solve complex business

play00:24

problems on this we are going to explore

play00:27

in detail in this particular discussion

play00:29

okay so before going ahead with actual

play00:31

framework let us try to recall what is

play00:34

data link so as we know it is nothing

play00:36

but a centralized repo designed to store

play00:39

process and secure large number of

play00:42

structured semi-structured unstructured

play00:44

data right maybe from your vendor

play00:47

company using AWS transfer family that

play00:49

is using SFTP some files are coming that

play00:52

can be structured data in CSV tsv format

play00:55

or that can be image Json this kind of

play00:57

semi structured or unstructured data

play00:59

also possible Maybe maybe from some on

play01:00

premise system using scoop Talent OR P

play01:03

spark using different injection

play01:05

Frameworks the data is coming in our

play01:07

centralized repository that particular

play01:09

location where all these various format

play01:12

of data various kind of data is getting

play01:14

accumulated that centralized repo is

play01:16

nothing but data right and whenever

play01:18

framework what we use that time one

play01:21

particular thing directly get attached

play01:23

with that system and that is reusability

play01:26

right that particular system should be

play01:29

reusable then only we can call that as a

play01:31

framework so now let us try to explore

play01:34

how using different AWS serverless

play01:36

Services we can create some reusable

play01:38

framework for our data L system okay so

play01:41

whenever any organization need to

play01:43

implement data L for solving any big

play01:46

data related use case or well AP related

play01:48

use case that time they generally get

play01:50

two option either they can build that

play01:53

particular data L system from scratch

play01:55

within their organization or maybe from

play01:58

some third party vendor company they can

play02:00

buy or purchase some data leg framework

play02:02

which is following industry standard

play02:04

these are General conventional two

play02:06

approaches which most of the

play02:08

organizations follow but we are having a

play02:10

third option and that is popular

play02:12

open-source project that is some

play02:14

developer or some organization might

play02:16

have created a very well architect

play02:19

system and they have published for free

play02:21

to use for different other organizations

play02:23

or individual users and that is nothing

play02:26

but falling under open source project so

play02:28

this particular Sur less data framework

play02:31

is one such open-source project which

play02:33

provides a data platform that

play02:36

accelerates the delivery of Enterprise

play02:38

data L okay that is using this

play02:41

particular open source project we can

play02:43

quickly build a data L system within our

play02:46

organization and this particular open-

play02:48

source project is used by large

play02:50

organizations like Formula 1 Motorsports

play02:54

it is very popular for racing right we

play02:56

all know and then apart from that Amazon

play02:58

retail Ireland also Al use that Naranja

play03:01

Finance from Argentina these are some

play03:03

popular big organizations who are

play03:05

following this sdlf framework to

play03:06

implement data l in their organization

play03:09

right now let us try to understand to

play03:11

implement this data framework what are

play03:13

the core AWS Services used okay so

play03:16

because this framework is serverless

play03:18

whatever AWS Services we are using to

play03:21

implement this framework they are also

play03:23

obviously serverless like for example if

play03:25

you consider storage system we are

play03:27

having AWS S3 which is having limited

play03:30

storage capacity we no need to think

play03:31

about servers and back ends AWS has

play03:34

taken care of all those things on our

play03:36

behalf and apart from that if we need to

play03:38

catalog that data that time Dynamo can

play03:41

also help us right now this is for

play03:44

storage now we need some compute system

play03:46

for our oap and that time AWS Lambda and

play03:49

glue can help us using Lambda we can

play03:52

process our data Whenever there is light

play03:54

transformation which can be completed

play03:56

within 15 minutes and if there is some

play03:59

heavy ATL workloads that time we can use

play04:01

glue which is again a serverless

play04:03

services and also we need some

play04:05

orchestration tool like very popular

play04:08

orchestration tool are autois or earflow

play04:10

now because this is serverless system we

play04:12

are going to build so for orchestration

play04:14

purpose we can use AWS ST which is

play04:17

nothing but a serverless service

play04:18

provided by AWS for orchestration right

play04:21

so these are the major Services we are

play04:23

using in this particular serverless data

play04:26

framework or hdf now we will directly

play04:29

jump into the architecture part so let

play04:31

me just zoom this particular part a

play04:33

little bit so here in this stlf

play04:35

architecture we are having two stage

play04:37

stage a and here in the lower part we

play04:40

are having Stage B first we will try to

play04:43

explore this stage a what it is doing

play04:45

here okay so from some vendor

play04:47

organization or from on prame system we

play04:50

are getting Law data okay so this is our

play04:53

law file that raw file is getting

play04:56

ingested in S3 Raw location so as we

play04:59

know in in our data system we are having

play05:01

multiple layers the first layer is

play05:03

called as raw or Landing layer then we

play05:05

are having staging layer then after that

play05:07

we are having processed or analytical

play05:09

layer right so let me just give a quick

play05:11

recap on that and then we will go back

play05:13

to this architecture so here I have just

play05:16

written some important difference

play05:17

between different layers what we

play05:19

generally encountered in our data system

play05:21

that is first is raw or Landing layer

play05:24

and this layer basically contains the

play05:26

ingested data that has not been

play05:28

transformed okay that is exactly in raw

play05:32

format from The Source system it get

play05:34

landed in a particular S3 location that

play05:36

is called as raw layer okay now from

play05:38

that raw layer we consume the data and

play05:41

we apply very light transformation maybe

play05:44

for example data type check or data

play05:46

duplicate removal this kind of light

play05:49

transformation we apply and then we

play05:51

store in staging layer and once data is

play05:54

available in staging layer we read that

play05:56

data and we apply heavy transformation

play05:58

in it maybe for example joining the data

play06:01

from various sources performing

play06:03

different filters or aggregation based

play06:05

on the business requirement and after

play06:07

doing all this heavy processing or heavy

play06:09

transformation we write the data in

play06:11

processed Zone okay so these are the

play06:13

three major layers which we generally

play06:15

encountered in a data system so the same

play06:18

we are going to observe in our this

play06:20

framework also so I just thought to give

play06:22

you a quick recap now we will go back to

play06:24

the architecture okay so here from the

play06:26

onr system or from some vendor company

play06:29

some using some ination framework the

play06:31

data is getting landed in raw layer that

play06:33

is in raw format it is available now

play06:35

what happens as soon as the data get

play06:37

landed in S3 we send an event

play06:39

notification to AWS sqs okay and from

play06:43

that AWS sqs we trigger one Lambda

play06:46

function and that Lambda function send

play06:49

that particular event to another sqsq

play06:52

and this sqsq is team wise separated

play06:56

okay maybe for example team a we are

play06:58

having one sqsq for Team B we are having

play07:01

some other sqsq for team C we are having

play07:04

some other sqsq so this Lambda based on

play07:07

where the S3 file is getting landed or

play07:09

maybe based on the file name forward

play07:11

that particular sqs event in one of

play07:14

these sqsq which are created team wise

play07:17

maybe if that particular message is for

play07:19

team a then it will send to this sqsq if

play07:22

it is for Team B it will send to this

play07:23

sqs if it is for another team it will

play07:25

send to this sqs so like that this

play07:28

particular sqsq are created team wise

play07:31

okay I hope you got this point now let

play07:33

me just erase this particular part so

play07:36

the Lambda has sent that particular

play07:37

event to sqs now this particular sqs is

play07:40

team specific now let us try to

play07:42

understand within a team what kind of

play07:45

data flow happens with those events okay

play07:48

so till now this sqsq contains that

play07:50

particular file information what got

play07:53

landed in our raw ler okay it is not yet

play07:55

processed this Lambda has not processed

play07:58

that particular event it just acted as a

play08:00

router in different sqsq now this sqsq

play08:03

has that information that in this

play08:05

particular S3 R layer that particular

play08:06

file got landed and based on this sqs

play08:09

this Lambda get triggered basically this

play08:11

Lambda continuously pull with some

play08:13

certain time interval that any new

play08:15

message got published in this SPS or not

play08:17

and once the Lambda get the message it

play08:19

start one AWS step function okay and

play08:22

here our first layer of processing

play08:24

starts okay what happens that here our

play08:27

step function is getting started the

play08:28

first Lambda function update the

play08:30

metadata information about our job in

play08:32

Dynam maybe for audit or Recon purpose

play08:35

that is pre-update comprehensive catalog

play08:38

maybe this particular file it started

play08:40

processing this is the time stamp it

play08:42

started and some other metadata

play08:44

information for future reference for

play08:46

audit purpose or logging purpose It

play08:48

capture in Dynam okay and then it

play08:51

execute light transformation as I told

play08:54

you that from raw layer we consume the

play08:56

data and then we apply light

play08:57

transformation for example data type

play08:59

check or data duplicacy removal Etc and

play09:02

then we store in staging there so this

play09:04

particular Lambda does that activity

play09:06

that light transformation and because

play09:08

this is light transformation so

play09:10

obviously we can expect that it get

play09:11

completed within 15 minutes for that

play09:13

reason Lambda can be a good choice in

play09:15

this particular stage and it write that

play09:17

data in another S3 location now what is

play09:20

this S3 location this particular S3

play09:23

location is nothing but staging layer

play09:26

right from raw layer we consume the data

play09:28

and then write in strating layer after

play09:30

applying simple transformation or light

play09:32

transformation right and once this

play09:35

particular Lambda apply that light

play09:37

transformation once this get executed

play09:39

successfully this Lambda is taking that

play09:41

responsibility to update in that Dynamo

play09:44

DB table that this job is completed this

play09:46

is the end time of that job and all

play09:48

other met information right so this is a

play09:51

typical AWS St function workflow okay

play09:54

right now let's see what happens next

play09:57

Once the data available in saging layer

play10:00

this particular event goes to another

play10:03

sqsq so here if you see this particular

play10:05

line is going and it is going like this

play10:09

way and

play10:12

here that same staging location is there

play10:15

and here whenever a file get written by

play10:17

the Lambda from that save function from

play10:19

the staging layer one event get emitted

play10:22

and that basically write that event in

play10:25

another sqsq so this is another sqsq

play10:27

okay we have encountered mult M sqsq

play10:29

till now in this architecture initially

play10:31

we were having a generic sqsq then we

play10:34

are having first team specific sqsq then

play10:37

after this processing here when this

play10:39

staging data is written then that event

play10:42

is also updated in this particular sqsq

play10:45

now here we are having another Lambda

play10:48

now this Lambda get triggered by this

play10:50

particular Cloud watch rule as we know

play10:53

that using Amazon event Bridge we can

play10:55

schedule a particular Lambda or glue job

play10:58

with some certain typ time interval so

play11:00

here this particular cloudwatch rule get

play11:03

triggered every 5 minutes and this

play11:05

Lambda basically done every 5 minute

play11:07

what it does it check in this particular

play11:10

sqsq any event arrived or not and when

play11:13

in this sqsq event will arrive when in

play11:15

the staging location some data will be

play11:17

written okay and if Lambda find out that

play11:20

in staging location data is written that

play11:22

means if in this ssq messages are

play11:24

available that time this Lambda will

play11:26

start another step function work

play11:29

okay and now we are going from staging

play11:32

to processed or analytical Zone that

play11:34

means this time we need to apply heavy

play11:36

transformation not light transformation

play11:39

we might need to apply joints with

play11:40

various data sources filters we might

play11:42

need to apply aggregations we have to do

play11:44

based on business requirement right so

play11:46

these are heavy transformation on our

play11:48

big data so that time there is a high

play11:50

chance that it will take more than 15

play11:52

minute so this time we cannot take risk

play11:54

to run our job in Lambda but rather we

play11:57

should run the job in blue so this

play11:59

particular step is doing that heavy

play12:01

transformation if you see here start

play12:03

button we are doing and then here this

play12:05

Lambda is triggering a glue job with all

play12:08

these sqs events that these are the

play12:11

files got landed in staging layer that

play12:13

the glue job should read using p spark

play12:16

or spark with Scala Etc and then process

play12:18

it out okay and once the glue job

play12:20

process the data where it will write it

play12:22

will obviously right in post stage zone

play12:25

or we call this processed area or

play12:27

analytical area this is the final layer

play12:29

of our data right and once the Lambda

play12:32

trigger our glue job glue job may take

play12:34

30 minute or 1 Hour 2 hour Etc so the

play12:37

Lambda will not wait it will just invoke

play12:40

this glue job and then it will send that

play12:42

particular job ID to another Lambda and

play12:45

here we are having a weight block and

play12:47

this Lambda will continuously pull with

play12:49

some certain interval maybe every 15

play12:52

second or every 20 second every 2 minute

play12:54

Etc it will check whether the glue job

play12:56

is completed or not based on the job ID

play12:58

and if the job is finished it will go to

play13:00

the next block and if the job is not

play13:02

finished then it will again wait for a

play13:04

few more minutes based on the configured

play13:06

value and then again it will P for the

play13:08

glue job sets okay right that's how it

play13:11

works we have already discussed this

play13:12

particular pattern also now once our

play13:15

glue job finish what will happen in the

play13:18

analytical layer or in the processed

play13:20

layer the curated data is written now

play13:24

analytical team the data analyst team

play13:26

need to query that data how they can

play13:28

query the data from S3 obviously one of

play13:30

the Bas is Athena so for that they will

play13:33

first run a glue crawler the crawler

play13:35

will update the catalog table and using

play13:38

that catalog data the data analy team or

play13:40

data scientist team can easily query in

play13:42

aena and whenever this particular whole

play13:44

job is running that is Lambda is

play13:46

triggering glue processing this complete

play13:48

information again following that same

play13:50

pattern will be tracked in Dynamo DV for

play13:52

audit or login purpose okay so whenever

play13:55

this particular Lambda will submit that

play13:57

glue job it will make an entry in Dynamo

play13:59

DB that this particular Blue Job started

play14:01

with this job ID and once this whole

play14:03

step is finished here you can see this

play14:05

Lambda layer post update comprehensive

play14:07

catalog it is to you that means it will

play14:09

update in blue that this job is

play14:11

completed and all this information okay

play14:13

and once our crawler is also ready the

play14:16

audit or Deon is also computed using

play14:18

Dynamo DB the last step is data quality

play14:20

check because with bad Data Business

play14:23

might take wrong decision and it might

play14:25

lead to company's loss so always we need

play14:28

to make sure that the data whatever we

play14:30

have generated using our ETL pipeline is

play14:32

having good quality so that time we can

play14:35

Implement another Lambda function which

play14:37

will do that data quality validation

play14:39

using DQ or any sort of this kind of

play14:42

data quality related framework so this

play14:44

is our complete flow again I am reating

play14:48

the complete flow from scratch first

play14:50

from the raw layer the event will be

play14:52

getting published in an sqsq that sqsq

play14:55

will trigger Lambda Lambda will route

play14:57

that event to p team specific sqsq and

play15:00

we'll be having another Lambda for each

play15:02

team that will basically continuously

play15:05

pull the team specific sqsq if it get

play15:08

the message immediately it will start a

play15:09

step function that step function will

play15:12

perform the light transformation

play15:13

complete audit or logging will be

play15:15

captured using this Amazon Dynamo DV and

play15:18

after processing it will write that data

play15:20

in staging layer from staging layer the

play15:22

event will be sent to sqsq that the new

play15:25

file is written in staging layer and we

play15:27

are having another Lambda that is

play15:29

getting triggered every five minutes

play15:31

based on this cloudwatch Rule and if

play15:33

Lambda find any message in this sqs it

play15:35

triggered the heavy transformation using

play15:37

awsp so this way initially we are having

play15:40

raw data then in the middle we are

play15:42

getting staging data and finally here we

play15:45

are getting process data which is ready

play15:47

for analytics workn okay so this is our

play15:51

seress data L framework which is

play15:53

basically open source the complete

play15:55

project link anyway I'll be providing in

play15:57

the description box the have code base

play15:59

you can go through that and now I would

play16:01

like to draw your attention on a major

play16:04

difference between this stage a and

play16:06

Stage B and that is stage a is kind of

play16:10

near real time because here if you see

play16:12

that as soon as file is getting landed

play16:14

in S3 from this particular generic sqs

play16:17

via Lambda the event is going to team

play16:19

specific sqs and as soon as message is

play16:22

available in near real time Lambda is

play16:24

consuming that and triggering this

play16:26

particular step function to apply light

play16:28

transformation in it okay it is not

play16:31

doing much of batching Etc right because

play16:35

these are very light transformation what

play16:36

we are applying in this St function we

play16:38

can apply this particular process on

play16:41

individual small files also but if you

play16:44

consider our Stage B that is this

play16:46

particular place here we are using AWS

play16:49

glue for transformation okay and

play16:51

although glue is server lless but mostly

play16:54

it should be used for a good volume of

play16:56

data processing it is not like for a

play16:58

single single file triggering AWS glue

play17:01

right so here instead of near real time

play17:04

we should do batching in this Stage B

play17:06

okay and how we are implementing that

play17:09

that is implemented using this

play17:11

particular cloudwatch room if You

play17:13

observe in this particular staging layer

play17:16

when file is getting landed here the

play17:18

event is published to sqsq and Lambda is

play17:21

not pulling this sqs to trigger this St

play17:23

function but rather This Cloud watch

play17:26

rule is triggering Lambda and Lambda is

play17:28

is checking whether in sqs any message

play17:30

is available or not okay that means that

play17:32

5 minutes window here it is basically

play17:35

waiting for accumulating all the files

play17:38

whatever is getting landed in this

play17:40

particular staging location and Lambda

play17:42

is triggering this particular step

play17:43

function to process all those files

play17:46

together okay so this particular part if

play17:48

you see here it is clearly written batch

play17:51

just give this this part and here is a

play17:55

very vital point that this particular

play17:57

Stage B is for batch processing and this

play18:00

particular stage a is almost a near

play18:02

realtime system okay and another part of

play18:05

this architecture is cicd component

play18:08

right so here data Engineers need to

play18:10

write the code for the light

play18:11

transformation as well as data Engineers

play18:13

need to write the code for applying

play18:15

business logic in AWS glue so this kind

play18:19

of Project Specific code how to push so

play18:21

for that time we are going to use the

play18:23

cicd pipeline using Port commit and Port

play18:25

pipeline in Stage B as well as if You

play18:28

observe here in stage and mostly we need

play18:31

to work with where we are applying

play18:33

transformation like for example here it

play18:34

is clearly directing that the upper part

play18:37

where Lambda is configuring in Dynamo DB

play18:39

that this event processing is started

play18:41

that is also generic and this last

play18:44

Lambda which is updating in Dynamo DB

play18:46

that the event is process that is also

play18:47

generic in the Middle where using Lambda

play18:50

we are applying light transformation

play18:51

that part may vary project to project

play18:53

right so here if you see quote commit

play18:55

and quote pipeline only changing this

play18:57

particular Lambda so instead of using

play18:59

these kind of CD tools you can use

play19:01

GitHub action also but our main focus

play19:03

should be only changing that part which

play19:05

is variable for project to project like

play19:07

in this particular stage a this part is

play19:10

variable similarly for Stage B this glue

play19:13

part is variable so here if you see code

play19:16

commit to code Pipeline and here it is

play19:18

going to this particular Lambda which

play19:20

will trigger different different glue

play19:21

for different different things right so

play19:23

only Implement cicd pipeline for those

play19:26

variable places not for whole pipeline

play19:28

that's what I'm trying to say okay so

play19:30

these are two very important points one

play19:32

is cicd pipeline one is the nature of

play19:34

stage a and Stage B stage a is near real

play19:36

time but Stage B is batch processing

play19:39

right I hope you understood this and now

play19:42

if you think that I want to implement

play19:44

this whole flow in my own AWS services

play19:47

from scratch that time you can refer

play19:49

some of my videos which will surely help

play19:51

you to implement this complete flow let

play19:54

me explain you so here if you see the

play19:56

initial part let me just highlight with

play19:58

different

play20:00

color so if you consider this particular

play20:03

part here from is3 to sqs to Lambda this

play20:08

particular flow I have already uploaded

play20:10

in this video that is create event based

play20:13

projects using S3 Lambda and sqs that

play20:15

same pattern exactly is covered in this

play20:17

video the link I'll be providing in the

play20:19

description box then here you can see

play20:22

this Lambda is publishing that event in

play20:24

sqs and from sqs this Lambda is getting

play20:27

triggered so so here we should know how

play20:29

to publish a message using python code

play20:32

to an sqsq right so that part also I

play20:36

have covered in this video sending

play20:37

message in sqsq from python B 3 okay now

play20:41

the next step Here If You observe this

play20:43

Lambda is triggering our AWS St function

play20:46

so we should know how to trigger AWS St

play20:49

function from Lambda so that particular

play20:51

thing I have already covered in this

play20:53

video and exactly same format that

play20:55

messages are coming to sqs from their

play20:57

Lambda and from there it is going to AWS

play20:59

St okay you can check this video to

play21:02

understand the complete flow of this

play21:05

particular part okay right and then step

play21:08

function on this I have already covered

play21:10

many videos in my AWS step function

play21:13

playlist and now here if you see from S3

play21:16

the message is going to sqs anyway it is

play21:17

very simple and here we are having a

play21:19

cloudwatch rule which is triggering

play21:21

Lambda every 5 minutes okay so that

play21:24

particular concept also I have covered

play21:26

if you see here this particular video is

play21:28

there how to configure a cloudwatch

play21:30

event rule that calls AWS Lambda

play21:32

function periodically this particular

play21:34

video you can refer and Here If You

play21:36

observe this particular Lambda what it

play21:38

is doing it is triggering our Blue Job

play21:40

using State function so here either you

play21:43

can use Lambda function within our step

play21:47

function to trigger glue job or AWS step

play21:49

function directly have interaction

play21:51

support with AWS glue that also you can

play21:53

use which I covered in this video build

play21:56

DL pipeline using AWS glue and State

play21:58

function where I have covered in detail

play22:00

how to start a crawler how to wait for

play22:03

the crawler to be complete State and

play22:05

then here how to start a glue job and

play22:09

how to wait for it to be completed so

play22:11

this particular video will give you the

play22:13

idea on interaction between State

play22:15

function and this glue and lastly Here

play22:17

If You observe everywhere for audit or

play22:20

logging purpose we are using Dynamo in

play22:23

the Stage B also and in the stage a also

play22:25

that Lambda is writing the data in

play22:27

Dynamo so how to insert the data in

play22:30

Dynamo from Lambda that also I have

play22:32

covered in this video so you can check

play22:34

this video for that particular

play22:36

implementation so like this way all

play22:38

these reference videos can surely help

play22:41

you to implement this particular sdlf or

play22:44

serverless dat framework I hope you

play22:47

understood this this is all for my this

play22:49

video If you find this particular video

play22:51

interesting then please like share and

play22:53

comment subscribe our Channel if you

play22:54

have not subscribed till now and don't

play22:56

forget to press the Bell icon to get the

play22:57

notific ific of our latest videos thank

play23:00

you for watching

Rate This

5.0 / 5 (0 votes)

Связанные теги
AWS ServerlessData FrameworkETL PipelineCloud ComputingData AnalyticsLambda FunctionsS3 StorageDynamoDBGlue ProcessingCI/CD PipelineData Quality
Вам нужно краткое изложение на английском?