ETL - Extract Transform Load | Summary of all the key concepts in building ETL Pipeline

ETL-SQL
6 Jul 202224:16

Summary

TLDRThis video delves into the crucial concept of ETL (Extract, Transform, Load) pipelines, essential for data warehousing. It covers the extraction of data from various sources, transformation processes involving mapping, enrichment, and aggregation, and the final loading into data warehouses. The video is a valuable resource for both SQL beginners and experienced professionals.

Takeaways

  • πŸ“š ETL stands for Extract, Transform, and Load, which are the three main phases of a data pipeline used in data warehousing.
  • πŸ” In the Extract phase, data is gathered from various sources like databases, flat files, or real-time streaming platforms like Kafka.
  • 🚫 Avoid complex logic during the extraction phase; simple transformations like calculating age from the date of birth are acceptable.
  • πŸ”‘ Ensure data format consistency across multiple sources to maintain uniformity in the data warehouse.
  • πŸ›‘ Apply data quality rules during extraction to ensure the integrity and relevance of the incoming data, such as filtering out records from before the business started.
  • πŸ—‚ The staging area is a temporary holding place for data where basic transformations and quality checks occur before the data moves to the data warehouse.
  • πŸ”„ Common load strategies include full loads for small tables and delta loads for larger tables to manage changes efficiently.
  • πŸ—Ί The Transform phase involves converting raw data into meaningful information through mapping, enrichment, joining, filtering, and aggregation.
  • πŸ” Mapping in the Transform phase can include direct column mappings, renaming, or deriving new columns from existing data.
  • πŸ“Š Fact tables in the data warehouse contain measures like total sales and are often linked to dimension tables via foreign keys.
  • 🏒 The Enterprise Data Warehouse (EDW) serves as the main business layer, storing processed data for reporting and analysis, and can feed downstream applications or data marts.

Q & A

  • What does ETL stand for in the context of data warehousing?

    -ETL stands for Extract, Transform, and Load, which are the three main steps involved in the process of integrating data from different sources into a data warehouse.

  • Why is understanding ETL important for SQL beginners?

    -Understanding ETL is important for SQL beginners because it is a fundamental concept in data warehousing and data integration, which are essential skills for working with databases and managing data flows.

  • What are the different sources from which data can be extracted?

    -Data can be extracted from various sources such as OLTP systems, flat files, hand-filled surveys, and real-time streaming sources like Kafka.

  • What is the purpose of the extract phase in ETL?

    -The purpose of the extract phase is to get data from the source as quickly as possible and prepare it for the subsequent transformation phase.

  • What is the significance of data format consistency in the extraction phase?

    -Data format consistency ensures that the same data is represented in the same manner across different sources, simplifying the integration process and reducing errors during data transformation.

  • What are some examples of data quality rules that can be applied during the extraction phase?

    -Examples of data quality rules include checking that sales data is from the correct time period (e.g., after the business started), ensuring that related columns have corresponding values, and limiting the length of description columns to save storage space.

  • What are the two popular load strategies for the extract phase?

    -The two popular load strategies for the extract phase are full load, where the entire table is sent every time, and delta load, where only changes to the table are sent.

  • What is the main purpose of the transform phase in ETL?

    -The main purpose of the transform phase is to apply various data transformations and mappings to convert raw data into meaningful information that can be used for business analysis and reporting.

  • What are some common transformation steps involved in the transform phase?

    -Common transformation steps include mapping, enrichment, joining, filtering, removing duplicates, and aggregation.

  • What is the difference between a dimension table and a fact table in a data warehouse?

    -A dimension table typically contains descriptive information about the data (e.g., employee details) and has a primary key, while a fact table contains quantitative measures (e.g., sales figures) and includes foreign keys that reference the primary keys of dimension tables.

  • What is the role of the load phase in the ETL process?

    -The load phase is responsible for loading the transformed data into the appropriate tables in the data warehouse, such as dimension tables, fact tables, and enterprise data warehouse (EDW) tables, and making it available for business intelligence and reporting.

  • What is the purpose of data marts in the context of ETL?

    -Data marts are subject-specific areas derived from the enterprise data warehouse (EDW), used for focused analysis and reporting. They contain data that is specific to a particular business area or department.

Outlines

00:00

πŸ“š Introduction to ETL Pipelines

The video script introduces the concept of Extract, Transform, Load (ETL) pipelines, emphasizing their importance for building data warehouses. It targets both SQL beginners and experienced professionals, with the aim of providing a comprehensive understanding of ETL. The speaker outlines the key points of ETL that every developer should know, mentioning that these are not exhaustive but are crucial based on the presenter's understanding. The first topic discussed is the extraction phase, which involves obtaining data from various sources in different formats. Methods for extraction include flat files, JDBC/ODBC connections, SFTP for security reasons, and real-time streaming solutions like Kafka. The frequency of extraction and ingestion is determined by business needs and is not always real-time. The script also touches on simple logic application and data format consistency during the extraction phase.

05:00

πŸ” Data Quality and Load Strategies in ETL

This paragraph delves into maintaining data quality during the ETL process, focusing on the extraction phase. It discusses applying data quality rules, such as ensuring data pertains to the period a business has been operational, and handling records that do not meet these criteria by either ignoring them or flagging them for review. The importance of data format consistency across multiple sources is highlighted, with examples given for gender representation and date formats. The paragraph also covers the strategies for loading data into the staging area, such as full loads for small tables and delta loads for larger, frequently updated datasets. The use of flags for inserts and updates simplifies the loading process, but when not provided, it necessitates a comparison between the staging and target tables. The paragraph concludes with the typical truncate and load approach for the extraction phase, which is not intended for running business queries.

10:00

πŸš€ Transform Phase: Enhancing Data for Business Insights

The transform phase of the ETL process is the focus of this paragraph, where raw data from staging tables is converted into meaningful information through various data transformations. The speaker outlines several key transformation steps, including mapping (source to target mapping, renaming, and deriving new columns), enriching data through lookups, joining multiple tables into a single one for clearer reporting, filtering data for specific business needs, and removing duplicates to ensure data quality. Aggregation is also discussed as a critical step, where measures are calculated to support business decisions, such as total sales or revenue. The paragraph emphasizes the importance of these transformation steps in preparing data for the load phase, making it ready for business consumption and analytics.

15:01

πŸ“ˆ Load Phase: Dimension and Fact Tables in Data Warehousing

The load phase of the ETL process is detailed in this paragraph, which involves loading data into intermediate, dimension, and fact tables. Dimension tables are highlighted as having primary keys (surrogate keys) and functional identifiers, with attributes that define the information stored. The paragraph discusses different load strategies for dimension tables, including Type 1, Type 2, and hybrid models, and the importance of defining granularity for accurate reporting. Fact tables are explained as having primary keys, foreign keys from dimension tables, and measures that can be aggregated. The nature of these tables is such that they generally support additive measures, although semi-additive and non-additive measures also exist. The paragraph concludes with a brief mention of the Enterprise Data Warehouse (EDW) and data marts, which are used for storing processed data and subject-specific analysis, respectively.

20:02

🌟 Wrapping Up the ETL Process

In the final paragraph, the speaker summarizes the ETL process, from extracting data from various sources to transforming and loading it into the appropriate tables for business intelligence and reporting. The video aims to be educational for beginners and a refresher for experienced professionals. The speaker invites viewers to comment if any points were missed and expresses gratitude for watching. The paragraph reinforces the importance of understanding ETL for anyone involved in data warehousing and business analytics, ensuring that the audience is equipped with the knowledge to handle ETL pipelines effectively.

Mindmap

Keywords

πŸ’‘ETL

ETL stands for Extract, Transform, and Load, which is a fundamental process in data warehousing. It involves extracting data from various sources, transforming it into a suitable format, and loading it into a data warehouse. In the video, ETL is the central theme, with each phase being discussed in detail to highlight its importance in building data pipelines.

πŸ’‘Data Warehouse

A data warehouse is a large, centralized repository of data designed to support business intelligence activities. In the context of the video, the data warehouse is the ultimate destination for the extracted and transformed data, where it is stored and made available for analysis and reporting.

πŸ’‘OLTP

OLTP, or Online Transaction Processing, refers to a class of systems that manage transaction-oriented processes. In the script, OLTP sources are mentioned as one of the potential sources from which data can be extracted, indicating that they are databases that handle day-to-day transactions.

πŸ’‘Flat File

A flat file is a simple file used to store data in plain, human-readable text. In the video, flat files are mentioned as a method of data extraction where the source system extracts data and sends it in this format, which can then be loaded into the data warehouse.

πŸ’‘JDBC and ODBC

JDBC (Java Database Connectivity) and ODBC (Open Database Connectivity) are standards for connecting to databases. In the script, they are mentioned as methods to connect to OLTP sources for data extraction, allowing for direct database queries to retrieve data.

πŸ’‘Data Governance

Data governance involves the overall management of the availability, usability, integrity, and security of data in an organization. The video discusses how data governance and security issues might restrict direct connections to servers, leading to the use of other methods like SFTP for data transfer.

πŸ’‘Real-time Streaming

Real-time streaming refers to the processing and analysis of data as it is generated or received, without significant delay. The video uses real-time streaming as an example of how businesses can respond immediately to customer actions, such as website visits, to enhance customer engagement and conversion rates.

πŸ’‘Data Quality Rules

Data quality rules are criteria used to ensure that the data meets certain standards of accuracy, completeness, and consistency. In the video, these rules are used during the extraction phase to validate data, such as checking that sales dates are within the business's operational period, ensuring data reliability.

πŸ’‘Staging Area

A staging area is a temporary storage location for data that is being prepared for loading into a database or data warehouse. The script describes the staging area as a place where basic logic and data format consistency are applied before the data moves further in the ETL process.

πŸ’‘Batch Processing

Batch processing is a method of processing large amounts of data at one time, rather than processing it continuously in real-time. The video mentions batch processing as a typical pipeline strategy where data is ingested into the environment at scheduled intervals for processing.

πŸ’‘Dimension Table

In a data warehouse, a dimension table is a type of table that contains descriptive information about the data. The video explains that dimension tables should have a primary key, often a surrogate key, and attributes that describe the data, and they are crucial for understanding the context of the measures in fact tables.

πŸ’‘Fact Table

A fact table is a table in a data warehouse that contains measurable facts or quantities, such as sales figures or quantities produced. The script describes fact tables as having foreign keys that reference dimension tables and measures that can be aggregated to provide business insights.

πŸ’‘Semi-Additive and Non-Additive Facts

Semi-additive and non-additive facts refer to measures in a fact table that do not aggregate in a straightforward manner across all dimensions. The video points out that not all facts in a fact table are purely additive, and some measures may require special handling during aggregation.

πŸ’‘EDW (Enterprise Data Warehouse)

An Enterprise Data Warehouse (EDW) is a large data repository that integrates data from multiple sources and serves the analytical needs of an entire organization. The video describes the EDW as the main business layer where processed data is stored and shared with business teams for reporting and decision-making.

πŸ’‘Data Mart

A data mart is a subset of an organization's data warehouse that is focused on a specific subject area or department. The script mentions data marts as useful for creating subject-specific areas for analysis, derived from the EDW, and used for reporting and visualization.

Highlights

ETL (Extract, Transform, Load) is essential for building data warehouses and is a crucial topic for both SQL beginners and experienced professionals.

The Extract phase involves getting data from various sources as quickly as possible, including databases, flat files, and real-time streaming solutions.

Data can be extracted using flat files, JDBC/ODBC connections, or by pushing files via SFTP to a landing area.

Real-time streaming, supported by solutions like Kafka, is crucial for businesses requiring low latency, such as credit card inquiries on websites.

The Transform phase is where raw data is converted into meaningful information through various data transformations and mappings.

Mapping in the Transform phase involves source to target mapping, which can be direct, column-level, or involve creating new derived columns.

Data enrichment in the Transform phase can involve looking up additional information to make the data more meaningful for business reports.

Join transformations are used to combine data from multiple tables into a single table for easier reporting and analysis.

Filter transformations allow for the selection of specific data subsets, such as filtering data for a specific region like North America.

Duplicate records can be a common issue, and the ETL process should be designed to handle them effectively.

Aggregation is a key transformation step, often involving group by clauses and aggregation functions like SUM, MAX, MIN to calculate business measures.

The Load phase involves loading data into intermediate tables, dimensions, facts, and creating data marts for business intelligence.

Dimension tables should have a primary key (surrogate key) and a functional identifier or natural key for uniqueness and business relevance.

Fact tables, in addition to a primary key, also contain foreign keys sourced from dimension tables and include measures for aggregation.

The EDW (Enterprise Data Warehouse) is the main business layer that supports BI and contains processed data for decision-making.

Data marts can be derived from the EDW for subject-specific analysis and are used for reporting and visualization.

ETL pipelines should be designed to handle various data formats and quality rules, ensuring data consistency and accuracy.

Load strategies for ETL can include full loads for small tables or delta loads for larger tables to manage changes efficiently.

Transcripts

play00:00

hello everyone my name is nitin welcome

play00:03

back to my channel in today's video

play00:05

we'll talk about extract transform load

play00:08

or in short etl

play00:10

so if you are building any data

play00:11

warehouse you must be aware of etl

play00:13

pipelines

play00:14

if you are a sql beginner i must tell

play00:16

you that this is one of the most

play00:18

important topic which you should

play00:19

understand thoroughly

play00:21

if you are an experienced professional i

play00:23

am pretty sure this video will work as a

play00:25

very good refresher for you

play00:28

as part of this video i have covered the

play00:30

important points which everybody must

play00:33

know as part of etl

play00:35

also want to mention that these are not

play00:37

the only points so don't think that it

play00:40

is a complete exhaustive list but from

play00:42

my side as per my understanding whatever

play00:44

i feel is very important and are a must

play00:47

for any etl developer i have listed that

play00:50

in this video

play00:51

so without wasting any time let's start

play00:53

with the very first topic which is

play00:55

extract

play00:57

extract based basically means that you

play00:59

have to get data from your source so

play01:02

whatever the source could be it could be

play01:04

an oltp sources running on some database

play01:07

solutions it could be a flat file it

play01:10

could be a hand filled surveys it could

play01:13

be anything so for your data warehouse

play01:16

there can be n number of multiple

play01:18

sources and those sources can send you

play01:20

data in different formats extract phase

play01:23

means you have to get data as quickly as

play01:26

possible from your source and how can we

play01:28

do that

play01:29

so there are multiple ways by which you

play01:31

can achieve that the most common way is

play01:34

using the flat files so the source will

play01:36

extract data from their environment and

play01:39

they will send you it in flat file

play01:42

formats if the source is running some

play01:44

typical database solutions as oltp

play01:47

sources they may even let you connect

play01:49

via jdbc and odvc and then you can

play01:53

connect to those sources and extract the

play01:55

data yourself

play01:56

however in some of the cases because of

play01:58

data governance and security issues the

play02:01

source may not allow you to connect

play02:03

directly to their servers so in that

play02:05

case they may extract the files they

play02:07

will push files by sftp to your

play02:09

environment and we call it a landing

play02:11

area

play02:12

and from there you can consume in in

play02:14

your tables

play02:15

now with or the near real time streaming

play02:18

or the real time streaming has come into

play02:20

the picture so now solutions like kafka

play02:23

guinnesses all these supports real-time

play02:25

streaming so if you are supporting a

play02:27

business where latency or the delay in

play02:29

consuming the data or

play02:32

say for example if you're working for a

play02:34

credit card business and there is a

play02:36

person who is visiting your website and

play02:39

who is inquiring about your different

play02:41

credit cards so as a business partner

play02:43

right what you may want to do is reach

play02:45

out to that customer immediately as

play02:48

as and when he's surfing your website

play02:50

because in a typical data warehouse if

play02:52

there is a 24 hour delay

play02:54

and say next day you reach out to that

play02:57

person that yesterday you were on our

play02:59

website and you were interested in

play03:01

different credit cards we have some

play03:02

offer for you would you like to take it

play03:05

now

play03:06

because of this delay you may end up

play03:08

losing a customer but at that moment

play03:10

when he's actually on your website and

play03:13

you have real time streaming enabled and

play03:15

you are streaming the click stream data

play03:16

into your systems and and supporting

play03:19

your business in real time that is one

play03:22

of the use cases i can think of in that

play03:24

case

play03:25

turning your potential customers into

play03:27

actual customers it's very high so

play03:29

depending on the business you are

play03:30

supporting you may go for the real-time

play03:32

streaming solutions as well for

play03:34

extractions

play03:35

and ingestion and then there is a

play03:37

typical pipeline batch where

play03:40

on decided frequency once a day or twice

play03:42

a day once a week

play03:44

the data will come to your environment

play03:46

and you will ingest it into your tables

play03:48

so extraction basically means you you

play03:51

have to get data done from source and

play03:53

you have to do it as quickly as possible

play03:55

that does not mean that it should be

play03:57

real time streaming all the time there

play03:59

are other factors which determines that

play04:01

what will be the frequency of extraction

play04:03

and ingestion of the data

play04:05

now moving on to the next topic which is

play04:08

no complex logic so in extraction phase

play04:12

which typically happens in the staging

play04:14

area we do not apply any complex logic

play04:17

so simple logic like if you have date of

play04:20

birth from there if you want to

play04:22

calculate the age of the customer maybe

play04:24

that kind of logic you can implement in

play04:27

in this staging area

play04:29

which comes in the extraction phase

play04:31

where you can apply very very basic

play04:33

logic like determining the edge so no

play04:36

complex logic in the extraction phase

play04:38

next point is data format consistency so

play04:41

what i mean by that is you may have

play04:43

multiple sources sending you data and

play04:45

most of the time you may feel that there

play04:48

is a relationship between data sent by

play04:51

different sources so here say example

play04:54

i have a gender column and source one is

play04:57

sending me male female and others right

play05:00

source two might be sending the same

play05:02

information which is gender but they may

play05:04

be using the convention mfo

play05:07

however source three could be sending it

play05:09

like 0 1 and 2.

play05:11

now i cannot ingest all these data as is

play05:14

in my data warehouse right i have to

play05:16

have a data format consistency so that

play05:19

everywhere the same data is represented

play05:22

in the same manner so i may choose in my

play05:25

target that i'll represent for gender

play05:27

column whenever any source is sending me

play05:29

gender values

play05:31

i will apply some logic and i will keep

play05:33

it as mfo the other example could be a

play05:37

it's a very basic example where you

play05:39

apply some consistency into your date

play05:41

formats so source one may be sending you

play05:44

data in this format yyyy hyphen mm

play05:46

hyphen dd similarly source two could be

play05:49

sending data in their own format and

play05:51

source 3 and source 4. so now in your

play05:53

target you can decide that okay for any

play05:56

incoming date column by default i'll

play05:58

have this format so you can apply some

play06:00

basic

play06:01

data format transformations on your

play06:04

incoming data in this extraction phase

play06:06

the next point is the data quality rules

play06:08

so in the extraction phase you can also

play06:10

apply some data quality rules

play06:12

like say your business has started from

play06:14

2015 onward so any sales you have made

play06:18

you know that it should be after 2015

play06:21

right it should not be 2014 or before

play06:24

that because your business was not even

play06:26

in the existence at that time so you can

play06:28

apply this data quality rule that while

play06:30

ingesting any data the source should be

play06:33

sending you data from 2015 onwards so if

play06:36

it is not following this data quality

play06:38

rule you can either push it to an error

play06:40

table so that you can report it to the

play06:42

source or you can simply ignore such

play06:44

records

play06:46

so that is one example the other example

play06:48

could be that there are sometimes

play06:49

description columns and we never use

play06:52

this description column in our warehouse

play06:54

data warehouse for any business or

play06:56

analytical purpose we are just storing

play06:59

it so in that case probably just to save

play07:01

some storage space right and speed up

play07:03

the process you may restrict it to the

play07:06

first 500 characters only so you can

play07:08

apply that kind of rule also say if you

play07:11

if you are getting a value for column

play07:12

one

play07:13

then column two should also have a value

play07:16

it cannot be null so that data quality

play07:18

rules also you can apply so for example

play07:22

if the source is sending you some data

play07:24

and if the source has sales date in it

play07:27

then probably you can check for the

play07:28

sales id column also that if sales has

play07:31

been made and you are getting a sales

play07:33

date value from the source you should

play07:35

also get the sales id for that else in

play07:38

that case it's an error record and you

play07:40

can move it to their tables these are

play07:43

some of the basic data quality rules

play07:44

also you can apply in this phase

play07:47

generally the tables who are involved in

play07:49

the extraction phase typical the staging

play07:51

tables are truncate and load so today

play07:54

file will come you will truncate the

play07:55

table you will load the today file you

play07:58

will process the data tomorrow when the

play08:00

new file will come you will delete the

play08:01

previous table completely and you will

play08:04

load the fresh file and then you will

play08:05

proceed with your etl pipeline so

play08:07

generally these are truncate and load

play08:10

there it is not supposed to run any

play08:13

business queries so in the extraction

play08:15

phase right the staging tables which are

play08:17

involved you are not supposed to run any

play08:19

business queries on top of that it is

play08:22

strictly for technical purpose only it

play08:24

is primarily built to support your edl

play08:27

pipelines and you are not supposed to

play08:29

run any business queries in fact the

play08:31

business team should not be even given

play08:33

access to the staging area

play08:35

now what can be the popular load

play08:37

strategies right for the extract so it

play08:39

could be a full or it could be a delta

play08:42

so if a table is pretty small say a few

play08:44

hundred or even thousands of rows the

play08:46

source may send you full data every time

play08:49

so this actually reduces the overhead or

play08:51

the operations overhead of maintaining

play08:53

the incremental flow so in that case

play08:56

it's a small table the source may end up

play08:57

sending you full table all the time so

play09:00

in that case you will simply truncate

play09:02

and load your final table

play09:04

the other popular strategy is the delta

play09:06

this is specially very good for the

play09:08

bigger tables where on daily basis you

play09:11

are getting hundreds and thousands of

play09:12

rows and and with time it becomes a very

play09:15

big table so in such cases what happens

play09:18

source generally identify

play09:20

what what are the updated records

play09:22

deleted records or the new records and

play09:25

they send only the changes happening to

play09:27

their table to you and then you have to

play09:29

load this data into your staging area

play09:31

and then you have to flag the records in

play09:33

some cases the source may also send you

play09:35

the flag that i for insert and default

play09:38

delete you for update like that so this

play09:41

makes your life easy but if the source

play09:43

is not sending you any flag then in that

play09:45

case you have to load this data into

play09:47

staging table and compare it with your

play09:49

target table on the basis of primary key

play09:51

to determine whether it's a new record

play09:53

or it's an old record with some updates

play09:56

and you have to apply update on the

play09:57

table

play09:58

so those are two very popular load

play10:00

strategies in the staging areas and the

play10:03

load approach obviously when you are

play10:05

starting it for the first time right you

play10:07

will go for the historical load so in

play10:09

that case you have to bring in all the

play10:11

data till date and then you will load it

play10:12

into your tables and after that you will

play10:14

run the incremental loads on daily basis

play10:16

or whatever is the frequency

play10:18

so that is about the extract phase guys

play10:20

i hope you are clear that when you are

play10:23

working on the extract phase primarily

play10:25

you are getting data from source and you

play10:27

are ingesting data into your staging

play10:29

areas these are some of the very

play10:31

critical points you should be aware of

play10:33

and these are the steps generally taken

play10:35

in the etl pipeline for the extract one

play10:38

now moving on to the transform phase

play10:40

right transform phase what does this

play10:41

mean actually so in transform phase

play10:44

whatever data you have ingested in your

play10:45

staging tables you will apply

play10:48

different

play10:49

data transformations or different data

play10:51

mapping rules on it to make it more

play10:54

meaningful so in this phase we will

play10:56

start converting a raw data into

play10:59

meaningful information

play11:00

and in transform phase you can either

play11:02

have predefined set of steps that for

play11:06

any incoming data you will follow these

play11:08

steps one by one in sequence and at the

play11:11

end of those steps you will have a

play11:13

enriched data

play11:14

which makes more more sense to your

play11:16

business so some of the common

play11:18

transformation steps i have mentioned in

play11:20

this video it is not an exhaustive list

play11:23

but as per my understanding these are

play11:25

the most important transformation steps

play11:27

which should be there

play11:29

typically in all the etl pipeline

play11:32

the very first step is the mapping so

play11:34

mapping basically means source to target

play11:36

mapping so when you're getting data from

play11:38

the source right they may have they may

play11:40

be sending you data for 10 columns and

play11:43

for every 10 incoming 10 columns there

play11:46

will be a mapping for those columns in

play11:47

the target table so it could be like it

play11:50

could be as is mapping where whatever

play11:52

the source is sending you data you just

play11:54

move that data as is into your table

play11:56

without any transformation or it could

play11:59

be a column level mapping that or it

play12:01

could be a new derived columns also so

play12:03

say source is sending you first name and

play12:05

last name and now in your business you

play12:08

want full name of the customer so you

play12:10

may create a new derived column

play12:11

concatenating first name and last name

play12:14

and coming up with the full name

play12:16

similarly you can apply you can rename

play12:18

the columns like source may be sending

play12:20

you emp underscore id for the employer

play12:22

id column and in in your case you want

play12:25

the full name so you can change it to

play12:27

employee id so you can rename the

play12:28

columns

play12:29

so that's a typical example of mapping

play12:31

transformations you can enrich the data

play12:34

so what will happen sometimes the source

play12:36

will send you some data which may not be

play12:38

very meaningful

play12:40

from your business perspective so in

play12:42

that case you can do a quick look up

play12:45

into some other tables and you can add

play12:47

some more information to it so that the

play12:50

output is more meaningful for your

play12:52

reports right so one example could be

play12:55

the source could be sending you zip code

play12:57

but the promotions your marketing team

play12:59

is running is at the city level so what

play13:02

you may want to do is rather than just

play13:03

storing the zip code you may want to

play13:06

convert that zip code into a city level

play13:08

by doing a lookup into the table and

play13:10

then storing city value into your tables

play13:12

right similarly the third type of

play13:15

transformation is the joint where you

play13:17

have data into multiple tables and then

play13:19

you

play13:20

join those multiple tables and you load

play13:22

a single table by joining these tables

play13:25

so this makes more sense because

play13:26

sometimes it happens that

play13:28

source are sending you data source may

play13:31

have different

play13:32

tables at their own

play13:34

environment also and they are extracting

play13:36

it one by one and they are just sending

play13:38

you data but for you it makes more sense

play13:41

to club those data into a single table

play13:43

and then use it for reporting so you may

play13:45

go for the join transformations

play13:48

similarly you can apply a filter

play13:50

transformation also so say uh it could

play13:53

your source could be sending you a

play13:55

global data like it it spread across

play13:57

multiple countries and the source is

play13:59

sending you data for all the countries

play14:01

in one fight now you are going to

play14:04

support a promotion or a marketing

play14:06

scheme which is applicable only for

play14:08

north america so in that case if you're

play14:10

creating a data mod or if you're

play14:12

creating if you are bringing data

play14:14

specifically for your north america

play14:16

continent you may want to apply a filter

play14:18

in incoming data so that you select only

play14:21

the north america data and then you move

play14:23

it to your tables right so filter is

play14:26

another very important transformation

play14:28

step

play14:29

during transform phase the other is a

play14:32

very common problem actually but i am

play14:35

pointing it here as a step because if

play14:37

required right if you are frequently

play14:38

facing this problem which is of remove

play14:40

duplicates then probably you should add

play14:43

this as a typical step for all incoming

play14:45

data so you should your ideally your etl

play14:49

process should handle duplicate records

play14:51

although it's not mandatory for you but

play14:54

i think it is

play14:55

a good practice to have this check now

play14:58

how duplicates can end up in your

play15:00

environment

play15:01

one very common reason is the source has

play15:03

sent you the same file again right so

play15:06

like today is say 20th

play15:09

and

play15:10

source has sent you a 19th file again

play15:13

right so you will process the same file

play15:15

assuming that it is of 20th but when you

play15:18

process the same data and if your etl

play15:20

pipelines are not properly built you may

play15:22

process the same data again so if you

play15:24

consume the same data again you will end

play15:26

up having a duplicate records in your

play15:28

tables right which may lead to data

play15:31

quality issues

play15:32

uh the other reason could be that you

play15:35

might have run the same job again so for

play15:37

19th you might have run the same job

play15:39

again and now you have duplicate data in

play15:42

your tables so your etl pipeline should

play15:44

be designed in a way that you should be

play15:46

able to handle your duplicate records

play15:49

now aggregation when you're building

play15:50

your data warehouse there will be times

play15:52

when you are calculating the measures

play15:55

and these measures actually support your

play15:57

businesses that okay a business may want

play16:00

to know that how many sales have

play16:01

happened in a particular store in a

play16:03

given quarter you may want to do some

play16:06

aggregation which is equivalent to

play16:08

basically a group by clause in your sql

play16:11

and then you may want to come up with

play16:13

some numbers by adding those coming up

play16:15

with some or average max and min

play16:18

and you may want to make your data more

play16:20

meaningful right so aggregation is one

play16:23

another very important transformation

play16:26

which you must be very well aware of

play16:28

right so this is about the transform i

play16:30

hope

play16:31

i have covered all the important

play16:33

critical steps which happens during the

play16:35

transformation phase

play16:37

if you feel i missed any

play16:39

feel free to leave a comment below and

play16:41

probably i'll try to add that as well

play16:44

now moving on to the last step which is

play16:45

the load phase so in the load phase what

play16:47

happens is you have the data you have

play16:50

extracted data from source ingested into

play16:52

your staging tables

play16:54

you have read data from your staging

play16:55

tables apply a lot of transformation

play16:58

steps into it and load it into

play16:59

intermediate tables now what happens is

play17:02

once you have data into your

play17:04

intermediate tables these intermediate

play17:05

tables typically what happens is in many

play17:08

scenarios enterprises don't want to have

play17:11

a separate data layers every time so

play17:14

these intermediate

play17:16

steps right many times it happens it

play17:18

will happen in the temporary table or

play17:20

the volatile table which directly loads

play17:22

the dimensions and fact right so that's

play17:25

this is also a possibility now coming to

play17:27

the load phase basically you will be

play17:29

loading your dimension tables you will

play17:31

be loading your fact tables you will be

play17:33

loading your ew tables and you will be

play17:35

creating data marks out of your data

play17:37

warehouse as well

play17:38

so what are dimension tables so this is

play17:41

not a specific dimension table video but

play17:43

i'll tell you the key points as per me

play17:46

that when you're loading a dimension

play17:47

table these are the key points you

play17:49

should consider

play17:50

so one is obviously the dimension table

play17:53

typically should have a primary key and

play17:55

which which are generally

play17:57

created using auto increment column or

play18:00

the identity columns in some rdbms

play18:02

solutions or you can call it as a

play18:04

surrogate key so these are meaningless

play18:06

these are just like the number one two

play18:08

three four five

play18:10

and you basically it is used to create

play18:13

uniqueness for each record right however

play18:16

business wise it does not make much

play18:18

sense because these are purely technical

play18:20

columns and these are like a sequence

play18:22

now a dimension table must also have a

play18:24

functional identifier or a natural key

play18:26

so to explain it more let me use some

play18:29

examples right so it does not mean that

play18:31

a functional identifier is always

play18:34

one row in each dimension table

play18:36

depending on the load type of the

play18:38

dimension right like in std2 where you

play18:40

will maintain the history

play18:42

the primary key or the auto increment or

play18:45

the surrogate key will always be unique

play18:47

but the functional identifiers may

play18:49

repeat right so say

play18:50

employee who have moved his city from

play18:53

bangalore to hyderabad the employee will

play18:56

remain the same in your employee table

play18:57

right though that information will

play18:59

remain the same

play19:00

like his date of joining

play19:02

right his name

play19:05

and his pan card all these details will

play19:07

remain the same only his

play19:09

city office city will change from

play19:10

bangalore to hyderabad so the functional

play19:13

identifier may exist multiple times

play19:16

however the surrogate key associated

play19:18

with those function identifier will

play19:19

still be unique

play19:21

right and this is unique for each row

play19:24

but it may have multiple occurrences

play19:26

depending on whether you are maintaining

play19:28

scd2 table or a cd1 table

play19:31

the other thing dimension tables has the

play19:33

attributes attributes basically tells

play19:36

you that what information this dimension

play19:38

actually is storing so it's like for

play19:41

employee is the dimension table

play19:43

if you have an employee dimension table

play19:44

it may have employee name employee cd

play19:47

employee date of birth so these are the

play19:48

basically the attributes of your

play19:51

employees

play19:53

as i said earlier dimension has the load

play19:56

strategy it could be a cd123 or the

play19:59

hybrid std model which is used to load

play20:02

any dimension table whether you want to

play20:04

maintain history or you don't want to

play20:06

maintain history

play20:07

and if you if your dimension is very

play20:10

small and kind of a static not much

play20:12

changes are coming into that dimension

play20:13

table you may even want to go for a

play20:15

truncate and load option for that

play20:17

dimension table

play20:19

and granularity whenever you're

play20:21

designing a data model you have to make

play20:22

sure that your dimension granularity is

play20:24

properly defined

play20:26

because if that is not well defined then

play20:29

in that case reporting when you are

play20:30

creating a reporting and visualization

play20:33

on top of those dimension tables you may

play20:36

not be able to reflect data in a proper

play20:38

manner to your business right so grain

play20:41

is one thing you should consider while

play20:43

loading any dimension tables

play20:46

next two dimension is the fact tables so

play20:49

fact tables

play20:50

if you know that it should have a

play20:52

primary key which is similar to the

play20:55

surrogate keys which we see in the

play20:57

dimension so even the fact tables have

play20:59

the primary key or the auto increment

play21:01

identity columns or the surrogate keys

play21:04

however in addition to that they also

play21:06

have a foreign key so foreign key

play21:08

basically are the the primary keys which

play21:10

are sourced from the source from the

play21:12

source dimension table into the facts so

play21:15

typically the dimension tables are

play21:16

always loaded first

play21:18

and the fact table source the primary

play21:20

key of dimension table into it

play21:23

as and reference them as a foreign key

play21:26

so there is a foreign key relationship

play21:28

between primary key foreign key

play21:29

relationship between dimensions and fact

play21:32

and fact table generally

play21:34

have the measures like total sales total

play21:36

revenue where you will apply some group

play21:38

by you will apply some aggregation

play21:40

functions like sum max min to calculate

play21:43

some measures that being said all facts

play21:45

does not have the additive property

play21:47

there are some semi-additive and

play21:49

factorious facts also and non-additive

play21:52

facts also but in general fact tables

play21:54

generally have uh the measures and we do

play21:58

aggregation those measures could be

play22:00

additive uh across all the attributes or

play22:03

across some of the attributes

play22:06

now that's about fact table

play22:08

next is the edw so this is your main

play22:11

business layer where you are storing all

play22:12

the ew tables so your edw tables has the

play22:16

process data which is very important for

play22:19

your business team to make any decisions

play22:22

so these tables are generally exposed to

play22:24

that business or the reporting team or

play22:26

the bi engineers in your team who will

play22:28

read data from these

play22:30

process data from these tables and will

play22:33

create reports on top of that so edw

play22:35

basically means it's a main data layer

play22:37

which is so which supports the bi

play22:40

and it has all the process data which is

play22:43

kind of end of the etl pipeline so all

play22:46

the process data is also shared with the

play22:48

downstream applications which could be

play22:50

reporting or it could be some other team

play22:53

where you will export data from these

play22:55

tables into flat files and push it into

play22:58

there so now rather than acting as the

play23:01

consumer or the target you will act as a

play23:03

source and you will export that process

play23:06

data and you will push it to some other

play23:08

team which will now start their etl

play23:10

pipeline or their consumption process

play23:13

similar to etw you can have data marts

play23:16

also so sometimes you may want a very

play23:19

subject specific area where you can do

play23:22

your analysis so in that case you can

play23:24

derive data mods from your edw you will

play23:27

create separate tables in data marts and

play23:29

you will source only subject specific

play23:31

data from your edw and create data mods

play23:34

and

play23:35

data marts are also used for reporting

play23:37

and visualization purpose

play23:39

so that's it guys

play23:40

i wanted to cover this in today's video

play23:44

that what is etl extract transform and

play23:48

load and what all the key steps in each

play23:52

phase i hope this video was helpful you

play23:55

if you are a fresher i'm pretty sure you

play23:56

might have learned some new things today

play23:59

and if you are a experienced

play24:01

professional as i said earlier i am i

play24:03

believe that this video was a good

play24:05

refresher for you

play24:07

if you feel i missed any point here feel

play24:10

free to drop a comment and i'll try to

play24:12

correct it thank you very much for

play24:14

watching the video thanks

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
ETLData WarehousingExtractTransformLoadSQLData QualityData GovernanceReal-Time StreamingBatch ProcessingData MappingData EnrichmentDimension TablesFact TablesData MartsBusiness IntelligenceData Transformation