Data Warehouse System Processes | Lecture #5 | Data Warehouse Tutorial for beginners

AmpCode
20 Mar 202108:03

Summary

TLDRThis tutorial video delves into the processes involved in building a data warehouse. It covers the essential steps including data extraction and loading, cleaning and transforming, backup and archiving, and query management. The video emphasizes the importance of these processes for optimizing query performance, managing data efficiently, and ensuring data integrity. It also highlights the need for data warehouses to adapt as businesses grow, making it a valuable resource for anyone interested in data warehousing solutions.

Takeaways

  • 📚 The video introduces various processes involved in a data warehouse and their significance with examples.
  • 🔄 The first process discussed is 'Extract and Load', which involves taking data from source systems and loading it into the data warehouse after reconstruction.
  • 🕒 The 'Controlling Process' is crucial for determining when to start data extraction and ensuring data consistency.
  • 🔍 'Cleaning and Transforming' is the second major process, which includes making the data consistent and structuring it to improve query performance.
  • 📊 'Partitioning' the data is part of the cleaning and transforming process, which optimizes hardware performance and simplifies data warehouse management.
  • 🔢 'Aggregation' is performed to speed up common queries by analyzing subsets or aggregations of detailed data.
  • 🛡️ 'Backup and Archiving' is essential for data recovery in case of data loss, software, or hardware failure, and for keeping old data accessible for restoration.
  • 🔎 The 'Query Management Process' is vital for managing queries, speeding up their execution, directing them to effective data sources, and monitoring query profiles.
  • 📈 Query management also helps in determining which aggregations to generate based on the information from query profiles, thus improving efficiency.
  • 🌐 The tutorial aims to build data warehousing solutions on open system technologies like UNIX and relational databases.
  • 🎥 The video is a tutorial that provides an introductory overview of the processes involved in a data warehouse, including extraction, cleaning, backup, archiving, and query management.

Q & A

  • What are the main processes involved in a data warehouse?

    -The main processes involved in a data warehouse are extract and load, cleaning and transforming data, backup and archiving data, and query management.

  • What does the extract and load process involve?

    -The extract and load process involves taking data from source systems and loading it into the data warehouse, ensuring the data is reconstructed in a way that is suitable for the data warehouse to store.

  • Why is it important to control the process during data extraction?

    -Controlling the process is important to determine when to start the data extraction and to check the consistency of the data. It ensures that the tools, logic modules, and programs are executed in the correct sequence and at the right time.

  • What should be considered when initiating the data extraction?

    -When initiating the data extraction, it is important to ensure that the data is in a consistent state and represents a single, consistent version of the information to the user.

  • What is the purpose of cleaning and transforming the data in a data warehouse?

    -Cleaning and transforming the data helps to speed up queries by making the data consistent and converting the source data into a structure that increases query performance and decreases operational cost.

  • What is partitioning in the context of data warehousing?

    -Partitioning is the process of dividing each fact table into multiple separate partitions to optimize hardware performance and simplify the management of the data warehouse.

  • Why is aggregation important in data warehousing?

    -Aggregation is important to speed up common queries by relying on the fact that most common queries will analyze a subset or an aggregation of the detailed data.

  • What is the significance of backup and archiving in a data warehouse?

    -Backup and archiving are crucial for recovering data in the event of data loss, software failure, or hardware failure. Archiving also allows for the removal of old data in a format that can be quickly restored when required.

  • What functions does the query management process perform?

    -The query management process manages and speeds up the execution time of queries, directs queries to their most effective data sources, ensures optimal use of system resources, and monitors actual query profiles.

  • How does the query management process help in improving the efficiency of the data warehouse?

    -The query management process improves efficiency by lowering operational costs and ensuring that all system sources are used in the most effective way, as well as by monitoring query profiles to determine which aggregations to generate.

  • What is the role of open system technologies in building data warehousing solutions?

    -Open system technologies like UNIX and relational databases provide the foundation for building scalable and flexible data warehousing solutions that can evolve as the business grows.

Outlines

00:00

📊 Introduction to Data Warehouse Processes

This paragraph introduces the topic of the video, which is the various processes involved in a data warehouse. It contrasts the fixed operations and techniques used in operational databases with the evolving nature of data warehouses, which must adapt as business grows. The tutorial will cover building data warehousing solutions using open system technologies and relational databases. Four major processes are highlighted: data extraction and loading, data cleaning and transforming, data backup and archiving, and query management. The paragraph emphasizes the importance of these processes in the context of decision support systems and the need for data warehouses to evolve over time.

05:01

🔄 Detailed Explanation of Data Warehouse Processes

The second paragraph delves into the specifics of each major process in a data warehouse. It starts with the extraction and loading process, explaining the importance of data consistency and the need to reconstruct information for the data warehouse. The paragraph then discusses the cleaning and transforming process, which includes steps like making data consistent, structuring it to increase query performance, and partitioning data to optimize hardware performance. The backup and archiving process is crucial for data recovery in case of failures, and it involves keeping regular backups and archiving old data for quick restoration. Lastly, the query management process is described, which includes managing queries, speeding up execution time, directing queries to effective data sources, ensuring efficient use of system sources, and monitoring query profiles. The paragraph concludes by summarizing the processes discussed and encouraging viewers to subscribe for updates.

Mindmap

Keywords

💡Data Warehouse

A data warehouse is a large, centralized repository of data designed for query and analysis rather than for transaction processing. It relates to the video's theme as it is the main subject being discussed. In the script, the data warehouse is described as evolving with the business and requiring different processes compared to operational databases.

💡Operational Database

An operational database is a database that is used for the day-to-day operations of a business. It is mentioned in the script to contrast with a data warehouse, highlighting that operational databases use techniques suitable for transaction processing but not for decision support systems like a data warehouse.

💡Extract and Load

Extract and load (EtL) refers to the process of extracting data from external sources and loading it into a data warehouse. In the script, this process is the first major step in the data warehouse operations, where data is taken from source systems and prepared for storage in the data warehouse.

💡Cleaning and Transforming

Cleaning and transforming data involves making the extracted data consistent and suitable for the data warehouse's structure. The script explains that this process is crucial for speeding up queries and reducing operational costs by structuring the data to support performance requirements.

💡Partitioning

Partitioning in the context of a data warehouse is the process of dividing a large fact table into multiple separate partitions. The script mentions partitioning as a method to optimize hardware performance and simplify data warehouse management, making queries faster and more efficient.

💡Aggregation

Aggregation in data warehousing is the process of summarizing detailed data to speed up common queries. The script describes aggregation as relying on the fact that most queries analyze a subset or summary of the detailed data, thus improving query performance.

💡Backup and Archiving

Backup and archiving are processes that ensure data recovery in case of data loss or system failure. The script explains that regular backups are necessary, and archiving involves storing old data in a way that allows for quick restoration when needed.

💡Query Management

Query management is the process of managing, optimizing, and directing queries to their most effective data sources. The script discusses this as a critical process for speeding up query execution time, ensuring efficient use of system resources, and monitoring query profiles.

💡Open System Technologies

Open system technologies refer to systems that are based on open standards and can be used with a variety of software and hardware. The script mentions building data warehousing solutions on open system technologies like UNIX and relational databases, emphasizing flexibility and interoperability.

💡Normalization

Normalization in databases is the process of organizing data to minimize redundancy and dependency. The script refers to normalization as a technique used in operational databases to keep tables small and efficient, but notes that it may not be suitable for data warehouses.

💡Consistency

Consistency in the context of data warehousing refers to the state where the data represents a single, accurate version of the truth. The script discusses the importance of data being in a consistent state when extracted for the data warehouse, ensuring reliability for users and query accuracy.

Highlights

Introduction to various processes involved in a data warehouse.

Explanation of the difference between operational databases and data warehouses in terms of data handling and query execution.

The significance of data warehouse processes for decision support systems that require flexibility for future queries.

Building data warehousing solutions on open system technologies like UNIX and relational databases.

Identification of four major processes in a data warehouse: extract and load, clean and transform, backup and archive, and manage queries.

Details on the extract and load process, emphasizing the importance of data reconstruction for the data warehouse.

The role of the controlling process in determining when to start data extraction and ensuring data consistency.

Importance of data extraction consistency for representing a single version of information to the user.

The process of loading data into a temporary data store for cleaning and consistency checks.

Steps involved in cleaning and transforming data to improve query performance and reduce operational costs.

The transformation of source data into a structured format to support performance requirements.

Partitioning data to optimize hardware performance and simplify data warehouse management.

Aggregation of data to speed up common queries by analyzing subsets or aggregations of detailed data.

The necessity of backup and archiving data for recovery in case of data loss or system failure.

Archiving data in a format that allows quick restoration for scenarios like month-on-month sales analysis.

Query management process functions, including managing queries, speeding up execution time, and directing queries to effective data sources.

Importance of query management in ensuring system sources are used effectively, lowering operational costs, and improving process efficiency.

Monitoring query profiles to inform warehouse management on which aggregations to generate.

Summary of the introductory part of data warehouse processes, including extract and load, clean and transform, backup and archive, and query management.

Transcripts

play00:03

[Music]

play00:04

hello everyone welcome to my channel

play00:07

and in this tutorial we are going to see

play00:09

the various processes involved in a data

play00:11

warehouse

play00:12

so in the previous lecture we have seen

play00:14

the delivery process

play00:16

of a data warehouse so in this lecture

play00:18

we are going to see

play00:20

what are the different processes

play00:21

involved in a data warehouse

play00:23

the significance with some simple

play00:26

examples

play00:27

so without further ado let's get into it

play00:30

so in the data warehouse we have a fixed

play00:32

number of operations

play00:33

to be applied on an operational database

play00:36

and we have

play00:37

well defined techniques to use a

play00:39

normalized data

play00:40

to keep the tables small etc so these

play00:43

techniques are suitable for delivering a

play00:45

solution

play00:47

but in case of decision support systems

play00:50

we do not know what

play00:51

query and operation which needs to be

play00:53

executed in future

play00:55

so therefore the techniques applied on

play00:57

the operational database

play00:58

are not suitable for the data warehouse

play01:00

as the data warehouse should evolve

play01:02

as the business grows so in this

play01:04

tutorial we will discuss

play01:06

how to build the data warehousing

play01:08

solutions on

play01:09

open system technologies like unix and

play01:12

relational databases

play01:14

so in data warehouse there are four

play01:16

major processes

play01:17

so the first one is extract and load the

play01:20

data

play01:21

next one is cleaning and transforming

play01:23

the data then the backup

play01:25

and archiving the data and managing the

play01:27

queries and directing

play01:29

them to the appropriate data sources so

play01:31

these are the major processes which

play01:33

we'll see

play01:34

in detail so the first process is

play01:37

extracting

play01:38

loading the data so the data extractions

play01:41

takes the data from the source systems

play01:44

and load it to the data warehouse

play01:47

so the data load takes the extracted

play01:49

data and loads it

play01:51

into the data warehouse so you have to

play01:53

remember one thing

play01:55

before loading the data into data

play01:56

warehouse the information extracted from

play01:59

the external sources

play02:01

must be reconstructed which will be

play02:04

feasible for the data warehouse

play02:06

to store that data so in this we have to

play02:08

consider three points clearly

play02:10

the first one is controlling the process

play02:13

so the controlling process

play02:14

involves determining when to start the

play02:18

data extraction

play02:19

and the consistency to check on the data

play02:22

if the data looks good

play02:24

controlling process ensures that the

play02:26

tools the logic modules

play02:28

and the programs which are executed in

play02:30

correct sequence and the correct time

play02:33

it is very important process the next

play02:36

one is

play02:36

when to initiate the extract so the data

play02:39

needs to be in a consistent state

play02:41

when it is extracted that is nothing but

play02:44

the data warehouse

play02:45

should represent a single consistent

play02:48

version

play02:48

of the information to the user so for

play02:51

example

play02:52

in a financial data warehouse which

play02:55

stores the financial data such as

play02:57

general ledger account payable and

play02:59

account receivable

play03:00

it is very illogical to merge the

play03:03

consolidated

play03:04

data when the quarterly reports are

play03:06

being generated

play03:07

so this would mean that the latest data

play03:10

will not be refreshed as per the user's

play03:13

requirements

play03:14

and the next point you have to consider

play03:16

is loading the data

play03:18

so after extracting the data it is

play03:20

loaded into a temporary data store

play03:23

where it is cleaned up and made it

play03:25

consistent

play03:26

so the consistency checks are executed

play03:29

only when all the data sources

play03:31

have been loaded into the temporary data

play03:34

store

play03:34

so this is our first process which is

play03:37

extract and loading

play03:39

our next process is cleaning and

play03:41

transforming the data

play03:43

so once the data is extracted and loaded

play03:45

into the temporary data store

play03:48

it is time to perform the cleaning and

play03:50

transforming the data

play03:51

so here you can see the list of the

play03:54

steps which is involved in the cleaning

play03:56

and transforming stage so the first one

play03:58

is clean and transform the loaded data

play04:01

into a structure so the cleaning and

play04:03

transforming

play04:04

helps to speed up the queries so it can

play04:07

be done by making the data consistent

play04:09

so the transforming involves converting

play04:12

the source data

play04:13

into a structure so structuring the data

play04:17

increases the query performance and

play04:19

decreases the operational cost

play04:21

so the data contained in a data

play04:23

warehouse must be transformed to support

play04:26

the performance requirements and control

play04:29

the ongoing

play04:30

operational cost so it is very crucial

play04:33

the next step is partitioning the data

play04:36

so it will optimize the hardware

play04:38

performance

play04:39

and simplify the management of data

play04:41

warehouse

play04:42

so here we partition each fact table

play04:45

into a multiple separate

play04:46

partitions so what it will do that the

play04:49

huge

play04:50

table which contains the billions of

play04:52

record

play04:53

will be partitioned so that the queries

play04:55

will take

play04:56

shorter time for the quick analysis and

play04:59

lower operational cost

play05:01

and it will also optimize the hardware

play05:03

performance

play05:04

and will avoid the long running queries

play05:06

over the platform

play05:08

so the next step is aggregation so the

play05:10

aggregation is required to speed up the

play05:13

common queries

play05:14

so the aggregation relies on the fact

play05:16

that most common queries

play05:18

will analyze a subset or an aggregation

play05:21

of the detailed data

play05:22

our next process which is involved in

play05:24

data warehousing

play05:25

is backup and archiving the data

play05:28

it is also very important process so in

play05:31

order to recover the data

play05:33

in event of data loss software failure

play05:36

or a hardware failure

play05:38

it is very necessary to keep the regular

play05:40

backups

play05:42

so the archiving involves removing the

play05:44

old data

play05:45

from the system in a format that allows

play05:48

it quickly restored whenever required

play05:50

so for example in a sales analysis data

play05:53

warehouse

play05:54

for xyz company it may be required to

play05:57

keep the data

play05:58

for at least four years with the latest

play06:01

one year of data

play06:02

being kept online in these scenarios

play06:05

there is often requirement to be able to

play06:08

do the month-on-month comparison

play06:10

for the year and the last year so in

play06:13

this case we require some data to be

play06:15

restored from the archive and

play06:18

one last process is query management

play06:21

process

play06:22

so this process performs this given

play06:25

functions

play06:25

so the first one is it manages the

play06:28

queries

play06:29

the next one is it helps to speed up the

play06:31

execution time of the queries

play06:33

so it is very important as when you

play06:35

require the quick

play06:37

analysis over the stored data it is

play06:40

very crucial that the execution time of

play06:42

the queries

play06:43

will be as low as possible the next one

play06:46

is

play06:47

the directs the queries to their most

play06:49

effective data sources

play06:51

the next one is it ensures that all the

play06:54

system sources are used in the most

play06:56

effective way

play06:57

so it will lower the operational cost

play06:59

and improve the efficiency of the

play07:01

process

play07:02

and the last one is monitors the actual

play07:05

query profiles

play07:06

this functions involves the query

play07:08

management process

play07:09

so the information which is generated in

play07:12

this process

play07:12

is used by the warehouse management

play07:15

process

play07:16

to determine which aggregation to

play07:19

generate

play07:20

so this process does not generally

play07:22

operate during the regular loads

play07:24

of the transformation into the data

play07:26

warehouse

play07:27

so these are all the processes which are

play07:30

involved in a data warehouse

play07:32

which we have discussed in brief with

play07:34

some simple examples

play07:36

in this case we have seen the

play07:38

introductory part of the processes which

play07:40

are involved in a data warehouse

play07:42

the processes which are extract and load

play07:44

process

play07:46

clean and transform process backing up

play07:49

and archiving the data

play07:51

and the query management process so if

play07:53

you like this video

play07:54

please consider subscribing and ring the

play07:57

notification

play07:58

bell to get the latest updates thanks

play08:01

for watching

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Data WarehousingETL ProcessesQuery ManagementData CleaningData TransformationBackup SolutionsArchive StrategiesBusiness GrowthDecision SupportOpen SystemsRelational Databases
Benötigen Sie eine Zusammenfassung auf Englisch?