Part 1- End to End Azure Data Engineering Project | Project Overview

Mr. K Talks Tech
2 May 202311:02

Summary

TLDRThis video tutorial offers a comprehensive guide to a real-time data engineering project using Azure technologies. It covers the end-to-end process, from data ingestion with Azure Data Factory to transformation with Azure Databricks, and storage in Azure Data Lake. The project demonstrates the Lake House architecture, including bronze, silver, and gold data layers, culminating in analysis with Azure Synapse Analytics and reporting with Power BI. It also addresses security and governance with Azure Active Directory and Azure Key Vault, providing a complete understanding of building and automating a data platform solution.

Takeaways

  • 📊 This video covers a complete end-to-end data engineering project using Azure technologies.
  • 🔧 The project demonstrates how to use various Azure resources like Azure Data Factory, Azure Synapse Analytics, Azure Databricks, Azure Data Lake, Azure Active Directory, Azure Key Vault, and Power BI.
  • 💻 The use case involves migrating data from an on-premise SQL Server database to the cloud using Azure services.
  • 🏞️ The project implements the lakehouse architecture, which includes organizing data into bronze, silver, and gold layers in Azure Data Lake.
  • 🚀 Azure Data Factory is used to connect to the on-premise SQL Server and copy data into Azure Data Lake Gen2.
  • 🔄 Azure Databricks is utilized for data transformation tasks, converting raw data into curated formats stored in different layers.
  • 📚 Azure Synapse Analytics is employed to replicate the database and tables from the on-premise SQL Server and load the curated data for further analysis.
  • 📈 Power BI is used for creating reports and visualizations from the data stored in Azure Synapse Analytics.
  • 🔒 Security and governance are managed using Azure Active Directory and Azure Key Vault for identity management and storing sensitive information.
  • 🧩 The project is structured into multiple parts: environment setup, data ingestion, data transformation, data loading, data reporting, and end-to-end pipeline testing.

Q & A

  • What is the main focus of the video?

    -The main focus of the video is to demonstrate a complete end-to-end data engineering project using Azure technologies.

  • What is the purpose of using Azure Data Factory in this project?

    -Azure Data Factory is used for data ingestion, connecting to the on-premise SQL Server database, copying tables, and moving the data to the cloud.

  • What is the role of Azure Data Lake in the project?

    -Azure Data Lake is used as a storage solution to store the data copied from the on-premise SQL Server database by Azure Data Factory.

  • How does Azure Databricks contribute to the project?

    -Azure Databricks is used for data transformation, allowing data engineers to write code in SQL, PySpark, or Python to transform the raw data into a more curated form.

  • What is the concept of lake house architecture mentioned in the script?

    -Lake house architecture refers to the organization of data in layers within the data lake, such as bronze, silver, and gold layers, each representing different levels of data transformation.

  • What transformations occur in the bronze layer?

    -The bronze layer in the data lake holds an exact copy of the data from the data source without any changes in format or data types, serving as the source of truth.

  • What is the purpose of the silver and gold layers in the data lake?

    -The silver layer is for the first level of data transformation, such as changing column names or data types, while the gold layer is for the final, most curated form of data after all transformations are completed.

  • How does Azure Synapse Analytics relate to the on-premise SQL Server database?

    -Azure Synapse Analytics serves a similar purpose to the on-premise SQL Server database, allowing the creation of databases and tables to store and manage the transformed data.

  • What is the role of Power BI in the project?

    -Power BI is used for data reporting, allowing data analysts to create various types of reports and visualizations based on the data loaded into Azure Synapse Analytics.

  • What security and governance tools are mentioned in the script?

    -Azure Active Directory for identity and access management, and Azure Key Vault for securely storing and retrieving secrets like usernames and passwords are mentioned as security and governance tools.

  • What is the main task of data engineers in automating the data platform solution?

    -The main task of data engineers is to automate the entire data platform solution through pipelines, ensuring that any new data added to the source is automatically processed and reflected in the end reports.

Outlines

00:00

🚀 Introduction to End-to-End Data Engineering with Azure

The video script introduces an end-to-end data engineering project using Azure technologies. It guarantees to provide a clear understanding of using various Azure resources for data engineering. The project will demonstrate a common use case in companies, which involves migrating an on-premise SQL Server database to the cloud. The presenter will cover the use of Azure Data Factory, Azure Synapse Analytics, Azure Data Bricks, Azure Data Lake, Azure Active Directory, Azure Key Vault, and Power BI. The project will be beneficial for viewers to include in their resumes and for clearing data engineering interviews.

05:01

🔧 Data Ingestion and Transformation with Azure Tools

This paragraph delves into the specifics of data ingestion and transformation using Azure tools. It starts with the use of Azure Data Factory for extracting data from an on-premise SQL Server database and storing it in Azure Data Lake Gen 2. The script then explains the use of Azure Data Bricks for data transformation, adhering to the lake house architecture, which involves organizing data into bronze, silver, and gold layers. The bronze layer acts as the source of truth, while the silver and gold layers represent different levels of data transformation. The paragraph also touches on the use of Azure Synapse Analytics for creating a cloud-based database similar to the on-premise model and Power BI for data reporting.

10:03

🛠️ Setting Up the Data Engineering Environment with Azure

The final paragraph outlines the agenda for the data engineering project, which includes environment setup, data ingestion, transformation, loading, reporting, and end-to-end pipeline testing. It emphasizes the importance of automating the data platform solution and mentions the use of Azure Active Directory for security and Azure Key Vault for storing secrets. The paragraph concludes by stating that the project will begin with setting up the environment in the Azure portal, indicating the start of the practical implementation of the discussed concepts.

Mindmap

Keywords

💡Data Engineering

Data Engineering refers to the practice of designing, building, and maintaining the systems and processes that enable the collection, storage, and transformation of raw data into more usable forms for analytics. In the video, the theme revolves around a complete end-to-end data engineering project, demonstrating how to use various Azure tools to build a data engineering solution.

💡Azure

Azure is a cloud computing service created by Microsoft, offering a range of cloud services, including analytics, computing, database, and networking. The video focuses on using Azure technologies to complete a data engineering project, showcasing how different Azure resources are utilized in the process.

💡Azure Data Factory

Azure Data Factory is a cloud-based data integration service that allows you to create, schedule, and manage data workflows. In the script, it is mentioned as an ETL (Extract, Transform, Load) tool used for data ingestion, connecting to an on-premise SQL Server database and moving data to the cloud.

💡Azure Synapse Analytics

Azure Synapse Analytics is an analytics service that brings together enterprise data warehousing and big data analytics. It is used in the video to create a database similar to an on-premise SQL Server and to load the transformed data into it, providing a familiar data warehousing model in the cloud.

💡Azure Databricks

Azure Databricks is a big data analytics platform optimized for Azure. It's used in the video for high-end data analytics workloads, allowing data engineers to write code for data transformation using SQL, PySpark, or Python within the platform.

💡Azure Data Lake

Azure Data Lake is a hyperscale, secure data lake service that allows you to store vast amounts of unstructured and structured data at a low cost. The script describes using Azure Data Lake Gen 2 to store the raw data copied from the on-premise SQL Server database before transformation.

💡Lake House Architecture

The Lake House Architecture is a concept that combines the best of data lakes and data warehouses, organizing data into layers for different transformation stages. In the video, the architecture is explained with bronze, silver, and gold layers, where data is progressively transformed and cleaned before being used for analytics.

💡Power BI

Power BI is a business analytics service by Microsoft that provides interactive visualizations and business intelligence capabilities. In the context of the video, Power BI is used to create reports and visualizations from the data loaded into Azure Synapse Analytics, allowing for data analysis and reporting.

💡ETL

ETL stands for Extract, Transform, Load, which is a process in database usage that involves pulling data from different sources, converting it into a suitable format, and loading it into a database. The video script discusses the use of Azure Data Factory as an ETL tool for data ingestion and transformation.

💡Security and Governance

Security and governance in the context of the video refer to the practices and tools used to ensure data protection, access control, and compliance with regulations. Azure Active Directory and Azure Key Vault are mentioned as tools for identity access management and secure storage of secrets, respectively.

💡Automation

Automation in the video refers to the process of setting up the data engineering solution in such a way that it can run without manual intervention. This includes the use of pipelines to automatically handle new data from the source, transform it, and update the reports in Power BI.

Highlights

Introduction to a complete end-to-end data engineering project using Azure technologies.

Guarantee of a clear understanding of using Azure resources for data engineering after the video.

Use case of migrating an on-premise SQL Server database to the cloud, a common scenario for companies.

Overview of tools used: Azure Data Factory, Synapse Analytics, Data Bricks, Data Lake, Azure Active Directory, Key Vault, and Power BI.

Explanation of Azure Data Factory as an ETL tool for data ingestion from on-premise databases.

Utilization of Azure Data Lake Gen 2 for storing data in the cloud at a low cost.

Role of Azure Data Bricks in transforming raw data into curated data for analytics.

Introduction to the Lake House architecture with bronze, silver, and gold data layers.

Details on the bronze layer as the source of truth with an exact copy of the data source.

Transformation process from bronze to silver layer for basic changes like column names and data types.

Final transformation to the gold layer for the cleanest form of data using Azure Data Bricks.

Use of Azure Synapse Analytics to create databases and tables similar to on-premise SQL Server.

Loading transformed data into Azure Synapse Analytics for analysis and reporting.

Application of Power BI for creating reports and visualizations from the data in Synapse Analytics.

Importance of automating the data platform solution through pipelines for real-time data updates.

Security and governance aspects covered with Azure Active Directory and Azure Key Vault.

Agenda for the project video, split into multiple sections for comprehensive learning.

Starting with environment setup in Azure Portal for the data engineering project.

Transcripts

play00:04

hello everyone welcome back to my

play00:07

channel in today's video we are going to

play00:09

see about a complete entry and real-time

play00:11

data engineering project so this data

play00:13

installing project is completely done

play00:15

using issue Technologies so I can pretty

play00:18

much guarantee that after watching this

play00:19

complete end-to-end data project video

play00:21

you'll be having a clear understanding

play00:23

about how to use different Azure

play00:25

resources to build your data engineering

play00:27

project so this engineer is going to be

play00:30

a complete demo of different resources

play00:32

so the use case that is going to be

play00:34

covered in this video is one of the very

play00:35

popular use case and most of the

play00:37

companies uses this use case to build

play00:39

the data engine steering project I'm

play00:41

pretty much sure that it is going to be

play00:42

really useful and you can also include

play00:45

this project in your resume and it will

play00:47

be really helpful for you guys to clear

play00:49

any kind of issue data engineering

play00:51

interviews so without wasting further

play00:53

time let's get started so I would like

play00:55

to start this by introducing the

play00:57

different tools that is going to be used

play00:59

in this project so let's see what are

play01:01

those so we'll be using issue data

play01:04

Factory

play01:05

issue synapse Analytics

play01:07

issue data bricks

play01:09

issue data Lake

play01:11

Azure active directory

play01:13

Azure key Vault and finally power bi so

play01:17

as you can see here there are a lot of

play01:18

different resources so these are the

play01:21

most commonly used Resources by any data

play01:23

Engineers to build any kind of data

play01:25

engineering projects cool so now let's

play01:27

see how we can use these tools to build

play01:29

this data in steering project with an

play01:31

example of an architecture of the

play01:33

project let's check it out

play01:35

so as I mentioned before the use case

play01:37

that we are going to do in this project

play01:39

is pretty common use case so what I mean

play01:42

by this common use cases if I introduce

play01:44

the data source then you'll be getting a

play01:46

clear idea about what I'm talking about

play01:47

so the data source that we are going to

play01:50

use in this particular use cases

play01:51

on-premise SQL Server database so we all

play01:54

know that one of the main reasons for

play01:56

different companies to move to cloud is

play01:58

to migrate their on-premise traditional

play02:00

database to Cloud so this is one of the

play02:03

most common use case right so I thought

play02:05

like let's take the same use case for

play02:06

this project so it will be really useful

play02:08

for you to actually understand this

play02:10

whole process so inside the database

play02:12

we're having like six or seven tables so

play02:14

we are going to migrate this database

play02:16

completely to Cloud so as part of this

play02:19

the first step would be using a tool

play02:21

called assured refractory so Azure data

play02:24

Factory is a ETL tool I have created a

play02:26

separate playlist for sure data Factory

play02:28

where I have covered all the basic

play02:30

concepts uh like what is social data

play02:32

Factory in general or what are the

play02:33

different things that we could do with

play02:34

it so if you really wanted to get an

play02:36

understanding about that you can check

play02:38

that out but if you ask me that if it is

play02:40

really mandatory to watch those videos

play02:42

before actually understanding this

play02:44

project I would say no because all the

play02:46

concepts will be covered in this video

play02:48

as well so as I mentioned before

play02:49

actually data Factory is an ETL tool

play02:52

which is mainly used for data ingestion

play02:53

so what we are going to do is the issue

play02:56

data Factory will be used to connect

play02:57

with this on-premise SQL Server database

play02:59

to copy all the tables from the database

play03:02

and move all the tables to the cloud and

play03:04

now where the tables will be stored in

play03:06

the cloud so for that we are using Azure

play03:09

data Lake gen 2.

play03:11

Gen 2 is the storage solution in Azure

play03:14

and the data which is stored in Azure

play03:16

analyte is pretty cheap so we'll be

play03:18

using Azure data Factory to connect to

play03:20

the on-premise SQL database and put all

play03:21

the data into the Azure data like gento

play03:24

so once a data has been added into the

play03:26

Azure data Lake we'll be using a tool

play03:28

called Ashu databricks to transform all

play03:31

the draw data into the most curated data

play03:35

Azure databricks is a big data analytics

play03:37

tool which is mainly used for high-end

play03:39

data analytics workloads so in simpler

play03:42

words we can say that using Azure data

play03:43

breaks we can actually write the code

play03:45

either it is SQL or Pi spark or python

play03:48

all the action development work will be

play03:50

done inside Azure databricks so which is

play03:53

mainly used for actually transforming

play03:55

the data and the next thing is one of

play03:57

the most important topics that I have

play03:59

covered in this project is using issue

play04:01

databricks and issue data like there is

play04:03

a concept called lake house architecture

play04:05

so what is mean by Lakers architecture

play04:08

is say for example Azure data Lake will

play04:10

be divided into multiple layers so what

play04:13

I mean by layers is we have different

play04:15

layers such as bronze layer

play04:18

silver layer

play04:20

and gold layer what are the difference

play04:22

between these lasers for example let me

play04:24

start with the bronze layer so as

play04:27

discussed before Aishwarya Factory

play04:28

connects to the on-premise server

play04:31

database copies the data and put the

play04:33

data into the issue data lake so inside

play04:35

issue data Lake the issue data Factory

play04:37

puts the data first into the bronze

play04:39

layer which means that the bronze layer

play04:41

has an exact copy of what the data looks

play04:44

like in this data source in this project

play04:46

we are not going to touch any data

play04:48

inside the bronze layer and we are not

play04:50

going to change the format or anything

play04:52

inside the bronze layer so this is going

play04:54

to be the source of Truth what are the

play04:56

main advantages of this is for example

play04:58

in the subsequent data Transformations

play05:00

that we are going to do if something

play05:02

goes wrong you can come back to this

play05:04

bronze layer and get all the data to try

play05:06

out which is going to be the same as the

play05:08

data source so once the data has been

play05:11

copied to this bronze layer by Azure

play05:13

data Factory we then use assure data

play05:15

breaks to connect to this bronze layer

play05:17

do some data transformation and load the

play05:21

transform data into the silver layer so

play05:23

this silver layer and gold layer is kind

play05:25

of here different levels of

play05:26

transformation for example the first

play05:28

level of transformation is silver layer

play05:30

and the next layer of transformation is

play05:32

a goal layer so in bronze to Silver

play05:34

layer the transformation might be kind

play05:36

of simple like changing the column names

play05:39

or changing the data types because lot

play05:41

of times the on-premise kind of data

play05:43

types and the cloud data types is not

play05:46

really compatible so we may need to

play05:48

change few things based on how it is

play05:50

going to support it into the cloud so

play05:52

those kind of minimal transformation can

play05:54

be done in this silver layer and once

play05:57

the data has been transformed using

play05:58

Azure databricks and the finer data has

play06:00

been loaded into the silver layer we

play06:02

then do an another set of transformation

play06:04

using same usual data breaks and this

play06:07

data is going to be loaded finally to

play06:09

this this gold layer so this gold layer

play06:12

is the final cleanest form of data so we

play06:16

all need to have a clean data right so

play06:18

one of the main tasks of data Engineers

play06:19

is to clean the raw data into the most

play06:22

curated data which means that all these

play06:25

data transformation is done by the data

play06:27

Engineers using assured outbreaks in

play06:29

different zones so that's called lake

play06:31

house architecture so once the data has

play06:34

been loaded to the goal layer now we are

play06:35

using another tool called assure synapse

play06:38

analytics so you can think like Azure

play06:40

synapse analytics is kind of similar to

play06:42

what is on-premise SQL Server database

play06:44

so we can also create the same kind of

play06:48

database similar to how we can create

play06:50

using the on-premise SQL database we can

play06:52

also create all these database and

play06:53

tables using a synapse analytics and

play06:55

once these database and tables have been

play06:58

set up in Azure synapse analytics all

play07:00

the data that is present in the goal

play07:02

layer will be loaded to all the tables

play07:05

that we have created in Asus synapse

play07:06

analytics so now at the end of this step

play07:09

the associate snaps analytics will have

play07:11

a kind of a similar data virusing model

play07:13

of how it looks like in on-premise SQL

play07:17

database so you can consider now the

play07:19

data has been completely migrated so we

play07:21

are not going to stop here so we are

play07:23

going to do further analysis on the data

play07:25

that has been stored inside the SEO

play07:28

synapse analytics using a tool called

play07:30

Power bi so this power bi will get all

play07:33

the data that has been loaded into the

play07:34

Asus snaps analytics and we could create

play07:37

a different kind of reports like uh the

play07:40

charts bar charts or whatever the

play07:42

reports that we want in power ba so

play07:44

basically data analyst is the one who

play07:46

will be creating reports in power bi and

play07:48

most of the companies where the data

play07:50

Engineers will also do it so it's better

play07:52

to understand the full end-to-end flow

play07:55

of a data engineering project that's the

play07:57

reason I have covered power bi as part

play07:58

of this data engine steering project so

play08:01

apart from this we are also using the

play08:02

security and governance tools and some

play08:04

of the tools like we are using Azure

play08:07

active directory which is an identity

play08:09

access management tool so all the

play08:11

security related things and the further

play08:12

things like creating the service

play08:14

principle and all of the steps can be

play08:15

done in the selective directory and all

play08:17

the other details will be covered in the

play08:19

subsequent sections and we'll be also

play08:22

using the Azure key wall to store all

play08:24

the secrets for example the username and

play08:26

password can be stored in Azure key

play08:28

Vault and it can be used to safely

play08:30

retrieve this information which is kind

play08:33

of the safest way how the data Engineers

play08:35

do in real time project so I also wanted

play08:38

to include this so it will give you an

play08:40

complete understanding of the overall

play08:42

data engineering project

play08:43

cool so now I think you have an

play08:45

understanding about the use case that we

play08:47

are going to do in this data engineering

play08:49

project so this complete entry and data

play08:51

engineering project will start with data

play08:53

ingestion and once the data has been

play08:55

ingested and stored into the Azure data

play08:57

like then we'll be using Azure

play08:59

databricks to transform all the data to

play09:01

the most cleanest form and once the data

play09:03

has been transformed and loaded into the

play09:05

gold layer in Australia like we then use

play09:07

the Asus snaps analytics to create the

play09:09

database and tables and load the final

play09:12

gold layer data into these tables and

play09:15

once the data has been loaded into the

play09:17

Asus snaps analytics will be then using

play09:19

power bi to create reports and one of

play09:22

the main tasks of data Engineers is to

play09:24

automate all the interior data platform

play09:27

solution which is pretty much done via

play09:29

pipelines so what I mean by this is for

play09:31

example once we have configured this

play09:33

end-to-end data engineering solution and

play09:36

say for example if a new row has been

play09:39

added to any of the tables in the

play09:40

on-premise SQL database and once we have

play09:43

run this pipeline Line This pipeline

play09:45

should get this latest Row from the

play09:47

on-premise SQL database do all the data

play09:50

transformation and load that data into

play09:52

the database and finally this power BH

play09:54

should reflect the new row that has been

play09:56

added into the source cool I think you

play09:58

have pretty much understood the use case

play10:00

of this project and it will be really

play10:02

helpful for you guys to understand how

play10:05

all these data sources work together how

play10:08

these data sources can be brought

play10:09

together to build this data engineering

play10:11

project since there are a lot of

play10:13

different things to be covered in this

play10:15

video I have splitted this video into

play10:17

multiple section so the agenda for this

play10:19

project would be I'll be starting with

play10:21

the part one which is the environment

play10:23

setup and once that has been done we'll

play10:25

be moving to the part 2 which is the

play10:27

data ingestion that we have discussed

play10:28

now and the part three would be data

play10:31

transformation using data breaks and

play10:33

part 4 will be data loading and part 5

play10:36

will be data reporting using power ba

play10:38

and once that has been done the final

play10:41

part six will be the end to end pi

play10:43

testing cool this is the agenda for this

play10:46

project and we'll be starting with the

play10:48

part one which is environment setup for

play10:50

now and for part one environment setup

play10:53

let's go to the issue portal and see

play10:55

what are the different resources that we

play10:57

need to create and what are the other

play10:59

environment setup that we have to do

play11:00

let's check it out

Rate This

5.0 / 5 (0 votes)

Related Tags
Data EngineeringAzure TechnologiesETL ToolData MigrationCloud DatabaseData TransformationData LakeDatabricksSynapse AnalyticsPower BISecurity Governance