Data Lakehouse: An Introduction

Bryan Cafferky
30 Nov 202225:00

Summary

TLDRIn this video, Brian introduces the concept of the Data Lakehouse, a convergence of data lakes and data warehouses. He discusses the evolution from the data lake, which became a data swamp due to lack of governance, to the structured environment of traditional data warehouses. Brian explores the challenges of implementing data warehouse features in a distributed data platform, highlighting the advancements in technologies like Delta Lake that offer transactional support and ACID properties. The summary also touches on the architectural differences between relational databases and data lakes, and the importance of features like schema evolution and metadata governance in the Data Lakehouse.

Takeaways

  • 📚 The Data Lakehouse is an emerging concept that combines the best of data lakes and data warehouses, aiming to provide a unified platform for data storage and analytics.
  • 🔄 The Data Lakehouse introduces transactional support to data lakes with technologies like Delta Lake, which adds transaction logs to Parquet files, enabling ACID properties.
  • 🚀 The evolution from data lake to data swamp highlighted the lack of data governance, leading to the need for a more structured approach to handle big data effectively.
  • 🛠 Relational databases offer robust features like structured query language (SQL), ACID transactions, and various constraints that ensure data integrity and security.
  • 🔑 Traditional data warehouses are built on top of relational databases and are optimized for reporting and decision-making through data aggregation and fast querying.
  • 🔍 The architectural differences between data lakes and relational databases present challenges in implementing data warehouse features on a distributed data platform.
  • 🔄 Schema evolution is a feature of the Data Lakehouse that allows for dynamic changes to the data schema without disrupting existing systems, accommodating the fast pace of data changes.
  • 🔒 Security in the Data Lakehouse context relies on cloud platform security measures, as opposed to the encapsulated security features of traditional relational databases.
  • 🔄 The Data Lakehouse aims to support a wide range of data types beyond structured data, including images, videos, and other multimedia formats, which is essential for modern data analytics.
  • 🤖 Support for machine learning and AI is a significant aspect of the Data Lakehouse, expanding its capabilities beyond traditional data warehousing to include advanced analytics.
  • 🔗 The Data Lakehouse concept is continuously evolving, with features like referential integrity and other constraints still in development to enhance data management and governance.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is the concept of the data lake house, its introduction, and how it combines elements of data lakes and traditional data warehouses.

  • What was the initial problem with data lakes?

    -The initial problem with data lakes was the lack of data governance, which led to a situation where a lot of data was stored without any structure or thought about how it should be used, eventually turning into a 'data swamp'.

  • What are the core features of relational databases that support a data warehouse?

    -The core features of relational databases that support a data warehouse include support for structured query language (SQL), built on set theory, ACID transactions for data integrity, constraints for data validity, and transaction logs for recoverability.

  • What is the difference between OLTP and data warehouse workloads in relational databases?

    -OLTP (Online Transaction Processing) workloads focus on transactional processing systems for daily operations, while data warehouse workloads focus on reporting and decision-making with an emphasis on querying large datasets for aggregation.

  • What is the significance of ACID in relational databases?

    -ACID in relational databases stands for Atomicity, Consistency, Isolation, and Durability. It ensures that database transactions are processed reliably, maintaining data integrity and allowing for complete or no changes at all.

  • What is the Delta Lake and how does it relate to data lake houses?

    -Delta Lake is a technology that adds transactional support to data lakes, providing ACID properties. It is based on the Parquet file format and includes transaction logging, which is crucial for the development of data lake houses.

  • What challenges arise when trying to implement relational database features in a data lake house environment?

    -Challenges include the distributed nature of data storage in a data lake house, which requires additional overhead and network traffic to perform operations like checking for unique keys or referential integrity across multiple nodes.

  • What is schema evolution and why is it important in data lake houses?

    -Schema evolution is the ability of a system to adapt to changes in the schema, such as adding new columns, without breaking existing processes. It is important in data lake houses to accommodate the fast-paced and dynamic nature of data storage and analysis.

  • How does the data lake house approach differ from traditional data warehouses in terms of security?

    -In data lake houses, security is often managed through the cloud platform's architecture rather than being encapsulated within the database service itself. This means that security measures need to be implemented at the infrastructure level, where the data is stored.

  • What are the new capabilities that data lake houses bring to the table compared to traditional data warehouses?

    -Data lake houses bring capabilities such as support for a variety of file structures, not just structured data, and the ability to handle unstructured data like images, videos, and sounds. They also support machine learning and AI, which are not traditionally part of data warehouse functionalities.

  • What is the final message or conclusion of the video?

    -The final message of the video is that while data lake houses have made significant progress in emulating the functionality of traditional data warehouses, they still face unique challenges due to their architectural differences. However, they offer new capabilities that are essential for modern data processing and analysis.

Outlines

00:00

📚 Introduction to Data Lakehouse Concept

In this introductory video, Brian from Kathkey's channel discusses the concept of the data lakehouse. He emphasizes the importance of understanding the conceptual background before diving into technical details. Brian introduces the idea of a data lakehouse as a blend of data lake and data warehouse, revisiting the traditional data warehouse and the hype around Hadoop about a decade ago. He explains how the initial excitement of using Hadoop for massive data processing led to the creation of data lakes, which eventually turned into data swamps due to lack of governance and data management. The video sets the stage for a deeper exploration of the data lakehouse, highlighting the need for a structured approach to managing large volumes of data.

05:01

🔒 Importance of Data Integrity and Transactions in Relational Databases

Brian delves into the features of relational databases that are crucial for maintaining data integrity. He discusses the role of Structured Query Language (SQL) and the principles of set theory in organizing data into discrete tables. The video highlights the significance of transactions, which support operations like insert, update, and delete, and ensure data consistency through the ACID properties (Atomicity, Consistency, Isolation, Durability). Brian also covers various types of constraints, such as referential integrity, domain constraints, key constraints, and check constraints, that help maintain data integrity. He further explains the role of transaction logs in recording changes and enabling data recovery, and the importance of database backups for resilience and recoverability.

10:02

🏦 Understanding OLTP and Data Warehouse Workloads in Relational Databases

This paragraph focuses on the two primary types of workloads supported by relational databases: transactional processing (OLTP) and data warehousing. Brian explains that transactional systems are critical for daily operations, handling tasks like sales, accounting, and banking transactions. These systems prioritize fast and efficient data maintenance. On the other hand, data warehouses are designed for reporting, decision-making, and planning, focusing on query performance and data aggregation. Brian also discusses the architectural differences between these systems, such as the need for high availability and fault tolerance in transactional systems, and the emphasis on integration and large data set aggregation in data warehouses. He also touches on the modeling techniques used in these systems, such as entity-relationship modeling for transactional databases and dimensional modeling for data warehouses.

15:04

🌐 Challenges in Implementing Data Warehouse Features in Data Lakehouses

Brian explores the challenges of implementing data warehouse features in a data lakehouse environment. He contrasts the simple architecture of relational databases, where operations like checking for unique keys or referential integrity are local and efficient, with the complexities of a scaled-out data platform. Data lakehouses, which are based on files like Parquet and Delta format, face overhead in operations due to the distributed nature of the system. Brian discusses the evolution of data lakehouses, starting with basic query support and moving towards transactional support with the introduction of Delta Lake, which adds transaction logs and ACID compliance. He also mentions the ongoing development of constraints and the challenges of implementing referential integrity in a distributed system.

20:05

🛡️ Data Lakehouse: Emulating Data Warehouse Features and Beyond

In this final paragraph, Brian wraps up the discussion by highlighting the progress made in data lakehouses in emulating traditional data warehouse features. He mentions the implementation of SQL language support, transactions, and constraints, although noting that some features are still evolving. Brian also addresses the differences in security, backup, and recovery strategies between relational databases and data lakehouses. He emphasizes the need for careful management of files in a data lakehouse environment. Additionally, he discusses the concept of schema evolution, which allows for flexibility in handling changes in data structures. Brian concludes by acknowledging the unique capabilities of data lakehouses, such as handling diverse data types and supporting machine learning and AI, which are not traditionally found in data warehouses.

Mindmap

Keywords

💡Data Lake House

The term 'Data Lake House' refers to a modern architectural approach that combines the best features of a traditional data warehouse with the flexibility and scalability of a data lake. In the video, it is presented as a solution to the problems encountered with data lakes, such as the lack of data governance and structure, which often led to 'data swamps.' The concept is central to the video's theme, illustrating an evolution in data management practices.

💡Data Swamp

A 'Data Swamp' is a term used to describe a disorganized and unstructured data lake that has become cluttered and difficult to navigate due to a lack of governance and structure. In the script, it is mentioned as an undesirable outcome of the initial hype around data lakes, where data was thrown into storage without any thought to its organization or utility, leading to a state where the data became 'pretty useless pretty quickly.'

💡Hadoop

Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is mentioned in the script as the technology that initially fueled the excitement around data lakes about 12 years ago, with its ability to handle massive data processing using Hadoop MapReduce and the Hadoop Distributed File System.

💡Relational Database

A 'Relational Database' is a type of database that stores data tables in a way that allows relationships between the data to be defined and queried using a structured query language (SQL). The video script discusses the features of relational databases, such as support for SQL, ACID transactions, and various types of constraints, which are essential for maintaining data integrity and are being emulated in the data lake house architecture.

💡ACID

ACID is a set of properties that ensure that database transactions are processed reliably. The acronym stands for Atomicity, Consistency, Isolation, and Durability. In the script, ACID is used to describe the guarantees provided by relational databases during transactions, ensuring that each transaction is treated as a single 'All or Nothing' unit of work, which is crucial for maintaining data integrity.

💡Transaction Log

A 'Transaction Log' is a record of all the changes made to a database. It is used for data recovery in case of a system failure. In the video, transaction logs are highlighted as an important feature of relational databases, allowing for the rollback of changes in the event of a failed transaction and ensuring the database's recoverability.

💡Data Governance

Data Governance refers to the overall management of the availability, usability, integrity, and security of the data in an organization. The script discusses the lack of data governance in early data lakes, which contributed to the formation of data swamps. The data lake house aims to reintroduce data governance to avoid such issues.

💡Schema on Read

Schema on Read is a concept where the schema of the data is not enforced at the time the data is written but rather when it is read or queried. This is contrasted with 'Schema on Write,' where data must conform to a predefined schema before it is written to the database. The script mentions that data lakes initially relied on schema on read, which contributed to their flexibility but also to the challenges in maintaining data integrity.

💡Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It is mentioned in the script as a significant development in the data lake house architecture, providing transactional support and adding transaction logs to what were previously just flat files in a data lake, thus enhancing data reliability and management.

💡Schema Evolution

Schema Evolution refers to the ability of a database to adapt to changes in the data structure without requiring significant modifications to the underlying system. In the script, it is highlighted as a feature implemented by Databricks that allows for the dynamic addition of new columns or changes to the data schema, making the data lake house more flexible and adaptable to evolving data needs.

Highlights

Introduction to the concept of Data Lake House and its significance.

The evolution from Data Lake to Data Swamp due to lack of data governance.

The challenges faced with the traditional data warehouse and the need for evolution.

The role of Hadoop and MapReduce in the early hype around big data processing.

The importance of understanding the technical details before diving into Data Lake House.

The concept of 'Freedom' in data management and its drawbacks.

The necessity of data governance and the questions it raises about data integrity and accuracy.

The evolution of SQL Server and the core requirements for a data warehouse.

The features of relational databases that support a robust data warehouse.

The significance of transactions and ACID properties in maintaining data integrity.

The role of constraints in ensuring data integrity within relational databases.

The importance of transaction logs for recoverability in relational databases.

The architectural differences between relational databases and data lakes.

The challenges in implementing relational database features in a scaled-out data platform.

The introduction of Delta Lake and its role in adding transactional support to data lakes.

The ongoing evolution of Data Lake House in terms of constraints and security.

The differences in workloads between transactional systems and data warehouses.

The importance of high availability and recoverability in data lake house architecture.

The concept of schema evolution and its relevance in the dynamic world of data lakes.

The integration of metadata and governance in the Data Lake House.

The support for various file structures and the need for handling unstructured data in the Data Lake House.

The role of Data Lake House in supporting machine learning and AI.

Transcripts

play00:00

welcome back to my channel I'm Brian

play00:01

kathkey and in this video we're going to

play00:04

be talking about the data lake house and

play00:05

this will be an introduction to give you

play00:07

a conceptual background to what that is

play00:09

before I jump in please consider

play00:11

supporting me on patreon you'll get

play00:13

direct access to me special content and

play00:16

periodic q and A's among the benefits

play00:18

now one of the differences on my channel

play00:20

I try to maintain is introducing

play00:23

everything conceptually and making sure

play00:25

you have a firm foundation under which

play00:27

to understand the technical details that

play00:30

are discussed further on and in that

play00:32

same kind of mentality I'll be talking

play00:35

about the data lake house so we'll be

play00:37

talking about data lake house to data

play00:39

swamp revisiting the traditional data

play00:41

warehouse what's so hard about that and

play00:44

introducing the data lake house if

play00:46

you're going out Wayback machine maybe

play00:48

12 years ago a lot of hype around Hadoop

play00:50

and the idea that you could do all kinds

play00:52

of massive data processing using Hadoop

play00:55

mapreduce and as part of that the Hadoop

play00:58

distributed file system this sounded

play01:01

great people said let's use this stuff

play01:03

so we got people out there and they

play01:04

threw stuff out into the data Lake which

play01:06

is really just a storage system just

play01:08

like a file folder right throw some

play01:10

files out there did querying then they

play01:12

told their friends and they threw files

play01:14

out there and more people and so on and

play01:16

so on and eventually you had lots and

play01:18

lots of data sitting out on these file

play01:20

folders with no data governance no

play01:22

really thought about how it should be

play01:24

used and this was called Freedom right

play01:26

hey it was like we don't need no

play01:28

education we don't need no rules problem

play01:32

eventually they started asking questions

play01:33

like what is this data it was put out a

play01:36

year ago and the guy left what happened

play01:37

what is this is it current is it

play01:39

accurate where did it come from why are

play01:42

there so many bogus values in the data

play01:44

how do I get the information I need and

play01:46

has someone already gotten this data I

play01:48

need together so I can just use it so

play01:50

this became a data swap instead of a

play01:54

data Lake in other words it became

play01:55

pretty useless pretty quickly people

play01:57

started to you know enter the trough of

play02:00

Despair This Promise wasn't all it was

play02:02

cracked up to be but In fairness

play02:04

sometimes people just jump too quickly

play02:06

and they forget that the data warehouse

play02:08

is in the SQL Server World also had to

play02:11

evolve and you can't just throw away the

play02:14

core requirements it's not things don't

play02:16

work like that there's no magic but

play02:17

people forget that and Everybody sung

play02:19

the Praises of the end of the old

play02:21

relational databases and all the

play02:23

requirements of doing a data warehouse

play02:25

so let's talk about that for a minute

play02:27

let's talk about relational databases

play02:29

and the kinds of features that they

play02:31

offer to do a data warehouse well we

play02:33

know that they support these structured

play02:35

query language and that they're built

play02:37

upon set theory and EF Cod came all

play02:39

about all these roles and they came up

play02:41

with this idea of a relational database

play02:43

system and structured query language is

play02:45

a very rich and robust language that has

play02:47

been extended over the years and the

play02:49

idea behind a relational database is

play02:51

that you have these objects to store

play02:53

data called tables each table is

play02:55

supposed to store a discrete set of

play02:57

information maybe about sales or

play02:59

customers or products and only about

play03:01

that information so it relates all to

play03:04

that subject you may have a relationship

play03:06

between given tables sales for instance

play03:09

as a customer somebody bought the

play03:11

product so there's a relationship

play03:13

between these relational databases are

play03:15

also good at supporting transactions

play03:17

right what are transactions Brian this

play03:19

is how you maintain the data it starts

play03:21

by inserting data then you needed to go

play03:23

back and do updates and maybe eventually

play03:25

you realize like the customer isn't

play03:27

going to be a customer anymore you

play03:28

delete them so insert update and delete

play03:30

are the kinds of things we do against

play03:32

our tables to maintain the data and we

play03:35

do that in something called transactions

play03:37

transactions on a relational database

play03:39

support acid which means atomicity

play03:42

consistency isolation and durability you

play03:44

don't need to worry about all the

play03:45

details what that means but the idea

play03:48

behind it is sort of an All or Nothing

play03:49

Concept in which we apply all the

play03:52

related sets of Maintenance tests

play03:54

together or we do them not at all so if

play03:57

we consider something like the customer

play04:00

in sales we have something called

play04:02

referential Integrity here we want in

play04:04

other words we don't want to allow

play04:06

someone to insert a sales row unless

play04:09

there's a related customer which is

play04:11

represented on the sales Row the

play04:13

customer key that means that we have to

play04:15

first insert the customer then we can

play04:17

insert the sales row and if we can't do

play04:20

both we should roll everything back and

play04:22

that's the idea so there could be more

play04:24

tables involved we could have many

play04:26

tables involved and we would wrap the

play04:28

whole thing up in what's called a

play04:30

transaction do this insert do this

play04:32

insert do this update whatever it is we

play04:34

need to do and we put it together as a

play04:36

package called the transaction and at

play04:38

the end of it we say commit it now if

play04:41

the commit has a problem suddenly it

play04:43

runs out of space or something goes on

play04:44

it can't do it then it's supposed to

play04:46

roll back or the program it can test it

play04:49

and say oh something went wrong I need

play04:50

to roll back the transaction meaning

play04:52

undo everything as if I never touched it

play04:54

now you can imagine you've done a lot of

play04:56

changes it's important that you get rid

play04:59

of everything you did and return it to

play05:00

the state where it was before you even

play05:02

started that gives it a sort of

play05:04

consistency you may be missing data for

play05:07

that transaction

play05:08

but at least you're not Half Baked some

play05:10

of it got through some of it didn't and

play05:12

you don't even know what got through and

play05:13

what didn't so it's better to have this

play05:16

consistency and that's supported in

play05:18

these transactions All or Nothing it can

play05:20

do this or it doesn't do this

play05:23

another thing that extends the sort of

play05:25

Integrity of the data are constraints

play05:28

there's a whole bunch of different types

play05:30

of constraints you can see in the

play05:32

database picture there we talked about

play05:34

referential integrity and that means you

play05:36

can define a rule in the database that

play05:38

says you cannot insert a sales row

play05:41

unless the customer key is found on the

play05:44

customer table in other words if I'm

play05:46

inserting a sales for customer a then

play05:50

customer a must exist on the customer

play05:51

table there's a lot of other types of

play05:54

constraints domain constraints and force

play05:56

that data types are what they should be

play05:59

integers and decimals you can't put the

play06:01

wrong type of data in a column we have

play06:03

key constraints what's a key constraint

play06:06

well things like uniqueness if you're

play06:08

going to have a key for a table which is

play06:10

the unique identifier for a given

play06:12

customer it has to be unique so that's a

play06:14

unique key constraint entity Integrity

play06:16

constraints mean that a key a primary

play06:19

key for instance for a customer cannot

play06:21

be null you cannot have null primary

play06:22

keys we've already talked about

play06:24

referential integrity constraints and

play06:26

then we talk about column value

play06:28

constraints also known as check

play06:30

constraints so what are those Brian but

play06:33

the idea is that you can say sales

play06:34

cannot be for instance less than zero in

play06:37

fact probably should always be greater

play06:38

than zero a date on a sale must be

play06:41

filled in you can't have a null value so

play06:43

null check constraints are one you could

play06:46

say values are a certain thing there's

play06:47

all these types of things that are meant

play06:49

to kind of keep a sane set of values in

play06:52

the tables supporting this whole idea of

play06:55

transactions are transaction logs these

play06:58

are external files that are maintained

play07:00

when the transactions are occurring so

play07:02

as you do inserts there's this little

play07:04

transcription taking place in the

play07:05

database to say oh I see you're adding a

play07:08

customer I see you're adding a sale I

play07:09

see you're doing this and making these

play07:11

changes and it's keeping track of all

play07:13

these changes so that when you say oh

play07:15

wait something went wrong it can use the

play07:17

transaction log to reverse the data back

play07:19

to the way it was before you started

play07:21

when you say commit it locks it down

play07:23

down and says okay that's the way the

play07:25

data is and the transaction log record

play07:26

that that's the current state of data

play07:28

this was committed so transaction logs

play07:31

are really important for recoverability

play07:33

and they also allow us to know what

play07:35

happened to the data typically in

play07:37

relational databases now this is a

play07:38

self-contained world right the

play07:40

environment of relational databases that

play07:42

you can only go through the database

play07:44

server whether it's Oracle or SQL Server

play07:46

software to do anything these are not

play07:48

external files like in a data Lake these

play07:50

are all self-contained in the database

play07:51

and it must manage them periodically to

play07:54

make sure we don't lose our data if

play07:55

something goes wrong and the server

play07:57

crashes or something dbas will take

play07:59

database backups and typically this is

play08:01

like nightly we'll do a nightly database

play08:02

backup and then during the day all the

play08:05

transactions happening will be logged in

play08:06

the transaction log so lo and behold

play08:08

something happens in the system crashes

play08:10

and we lost all the data well you would

play08:12

go and grab the backup last backup and

play08:14

it might be from last night 8 pm and

play08:17

then you take the transaction logs of

play08:18

everything that happened since then to

play08:20

the latest transaction and you apply

play08:22

those transactions actions for it to

play08:24

restore the backup and start applying

play08:25

the transactions until you get to the

play08:28

most current state you can get to which

play08:30

hopefully is pretty good and you haven't

play08:31

lost much data these are all parts of

play08:34

resilience and recoverability that SQL

play08:36

Server databases have been doing for a

play08:38

long time and relational databases also

play08:40

include a lot of security because again

play08:41

you have to go through this sort of

play08:43

veneer wrapper around it to get at the

play08:45

data so it's going to say who are you

play08:47

and do you have a password and what

play08:48

permissions do you have in the database

play08:50

so they have a lot of security around

play08:51

them and they typically also support

play08:54

triggers additionally to triggers that

play08:56

can be stored procedures and functions

play08:58

which I'm not really going to get too

play08:59

much into because it's not critical to a

play09:01

database to have that but it is a nice

play09:02

feature you can store essentially

play09:04

programs in the database written in the

play09:06

SQL language but triggers get back to

play09:08

sort of this data maintenance kind of

play09:10

thing for instance way back in the old

play09:12

days triggers would be used to enforce

play09:14

referential Integrity somebody would try

play09:16

to insert a row into the sales table and

play09:19

you would have a trigger and a trigger

play09:21

is something you write a piece of code

play09:23

that should execute when an action is

play09:25

performed on a table so you try to

play09:27

insert into sales you say well if

play09:28

there's an insert on the sales table

play09:30

before you do that check that customer

play09:33

key coming in is it on the customer

play09:35

table yes okay let it go through no

play09:37

don't let this transaction happen don't

play09:39

let them do the insert and there's on

play09:41

updates and on inserts and after and all

play09:43

these different types of controls are on

play09:45

the trigger I'm not a huge fan of

play09:46

triggers I haven't used them in a long

play09:47

time but there are use cases where they

play09:49

make sense they can also be used

play09:51

sometimes to do automatic logging you

play09:52

insert into the sales table and you

play09:54

could write to a log saying somebody

play09:55

inserted a row and before we get into

play09:57

the data lake house and what it is I

play10:00

need to talk about the two primary types

play10:02

of workloads that relational databases

play10:04

support on the left we talk about the

play10:07

transaction or online transactional

play10:09

process in oltp database workload and on

play10:13

the right we're going to talk about the

play10:14

data warehouse which is probably what

play10:16

we're going to focus on right but we've

play10:17

got these two potential types of

play10:19

workloads and both of these coexist and

play10:21

have coexisted in the relational

play10:23

database world for decades but the bread

play10:25

and butter of relational databases

play10:27

really has been on the left the

play10:29

transactional processing systems these

play10:31

are databases that store your mission

play10:33

critical data in other words you've got

play10:35

your applications running in this during

play10:37

the database typically in a relational

play10:38

database this could be your sales system

play10:41

Financial record systems like accounting

play10:44

and general ledger and accounts payable

play10:46

banking transactions all kinds of things

play10:49

that are absolutely essential for the

play10:50

daily operations of your business data

play10:53

warehouse purpose is reporting and it's

play10:55

decision making and planning that's what

play10:57

it's kind of involved around what should

play10:58

we be doing what products should we sell

play11:00

how can we do a better job transactional

play11:03

systems are focusing primarily on the

play11:06

data maintenance and doing it really

play11:07

well so that's why you saw that thing

play11:09

about transactions earlier insert update

play11:11

and deletes key they have to be fast

play11:13

they have to be efficient and they have

play11:15

to avoid a lot of locking because many

play11:17

different people could be trying to do

play11:19

these kinds of operations in different

play11:21

places all over the system and nowadays

play11:23

it could be a website that's distributed

play11:25

all over the world people are trying to

play11:27

do updates and inserts so it's an

play11:28

important thing but remember these are

play11:30

sort of small sets of data it's

play11:32

inserting a sales row and a customer row

play11:34

and things like that not hundreds of

play11:36

thousands of rows from a given single

play11:38

person data warehouse it's not about

play11:40

maintaining the data typically there's

play11:42

some sort of a batch window maybe once a

play11:45

night

play11:45

ETL kicks off and it does all the

play11:48

loading and merging and crushing up the

play11:50

data to where it needs to be

play11:51

what's really important is when people

play11:53

run queries do power bi do reporting

play11:55

they get quick responses in the querying

play11:58

which is involved in sorting and

play12:00

aggregating not looking at individual

play12:02

rows so again a very different Focus

play12:04

here in transactional systems the thing

play12:06

that is really crucial because these are

play12:08

like if this is down your business is

play12:10

dying and you have to solve this problem

play12:13

quickly so it's crucial it needs to be

play12:15

reliable it needs to be secure it needs

play12:18

to be resilient and very fault tolerant

play12:19

you'll see lots of architectures where

play12:22

something fails over the database system

play12:24

fails and it triggers a failover to a

play12:27

whole nother database system and that

play12:29

may even have another failover so if

play12:31

that goes down it goes to a third it's a

play12:33

lot of things going on it has to be

play12:35

recoverable we saw the whole backup

play12:36

recovery thing when we can restore the

play12:38

database if you lose it you're in deep

play12:40

trouble whereas on the data warehouse

play12:42

we're doing queries we're doing lots of

play12:45

aggregation also the originating source

play12:48

of data is the transactional systems

play12:50

which typically the data from there is

play12:52

going to be pulled out and put into our

play12:54

data warehouses so the transactional

play12:56

systems are really our system of record

play12:59

they're the ones that are going to

play13:00

ultimately say what's right and what

play13:02

isn't so they have to be accurate the

play13:04

data warehouse has to make sure they

play13:05

pull all that into the data warehouse

play13:07

and get it right also means that data

play13:10

warehouses have a lot of integration

play13:11

they pull data from many different

play13:13

sources and consolidate it so it can be

play13:15

used for reporting and again as I

play13:17

mentioned aggregation of large data sets

play13:19

and finally transactional oltp's

play13:22

databases

play13:24

use a modeling technique called entity

play13:26

relationship modeling and they apply the

play13:29

laws of normalization remember that in

play13:31

your interview

play13:32

so these ERM models allow you to

play13:35

eliminate data redundancy which is the

play13:38

enemy of transactional databases however

play13:41

the data warehouse says no way you want

play13:45

redundancy we're going to use

play13:46

dimensional modeling we don't care if we

play13:48

have 50 copies of the same customer all

play13:51

we care about is that the queries run

play13:53

fast so a very different orientation now

play13:55

why do I care about this Brian why is

play13:57

this so important because when I talk

play13:58

about bowering Concepts and features

play14:02

that are in relational databases I want

play14:04

you to remember that the data lake house

play14:06

is focused on emulating the data

play14:09

warehouse functionality in fact the data

play14:12

lake house name comes from taking data

play14:14

Lake data warehouse and merging those

play14:16

words together to get data lake house

play14:18

very clever all this is good Brian but

play14:21

what's the big deal why is it so hard to

play14:23

just add those kinds of features to a

play14:25

data Lake seems pretty easy right hmm

play14:28

think about it for a minute and this is

play14:30

important because as you run into trying

play14:32

to build your own data lake houses and

play14:34

things you might say hmm why is this so

play14:37

challenging and how can I get around it

play14:38

I'll tell you why because in the old

play14:40

relational database world you got a

play14:43

single box everything's local very

play14:45

simple you try to insert a row and you

play14:46

say is that a unique row it can very

play14:49

quickly check against the table and say

play14:50

yep that's unique doesn't exist yet or

play14:52

whatever that's unique row this key

play14:54

doesn't exist referential Integrity not

play14:56

a problem I go to insert a sale row

play14:58

takes the customer key says is that on

play15:00

the customer table it's all very local

play15:02

very close no network traffic really

play15:04

it's all right there very efficient all

play15:06

on one machine simple architecture now

play15:08

when you look at a scaled out data

play15:10

platform whether it's databricks a

play15:11

snowflake or synapse it's very different

play15:13

world you've got many different nodes

play15:15

running many different machines they're

play15:16

actually separate they can be even be

play15:18

separated by some Network barriers this

play15:21

is a lot of overhead and this could be

play15:22

like 10 000 nodes in a cluster and bear

play15:25

in mind these are actually files right

play15:27

it's not like the relational database

play15:29

World in which everything's wrappered in

play15:31

a single monolithic interface in a

play15:35

service that wraps it all up these are

play15:37

actually separate storage files in the

play15:39

data lake house it based it on parquet

play15:42

which it enhanced and called Delta

play15:44

format but they're just flat files so

play15:46

this adds a lot of complexity because

play15:48

you say okay is this a unique key value

play15:50

it's going to have to look at all the

play15:52

cluster nodes to find that out if you

play15:54

said is this customer key on the

play15:56

customer table it's going to have to do

play15:58

a shuffle to co-locate the data so we

play16:00

can find out if that customer key is

play16:01

there or not these are very expensive

play16:03

operations and this non-efficient use of

play16:05

a scaled out platform but we need that

play16:08

functionality so we're kind of in a

play16:10

quandary and as mentioned that the

play16:11

database is external files so even

play16:13

though we're trying to emulate this sort

play16:14

of relational database functionality it

play16:17

has some key differences and that's a

play16:18

big one it's just a bunch of files

play16:20

really somebody could just take a file

play16:22

and delete it and that's part of your

play16:23

database so let's step back and let's

play16:26

say how well has the data Lakehouse

play16:28

implemented that old relational database

play16:31

data warehouse functionality like where

play16:33

are they at based on our previous

play16:34

diagram well spark has had at least the

play16:37

query level select type of support for

play16:40

SQL since 1.0 it's been a good had that

play16:43

for a while and it can do that on top of

play16:45

flat files in fact the whole data Lake

play16:47

thing I talked about earlier I should

play16:48

have mentioned basically that was schema

play16:51

on read you throw a file out there

play16:52

Define an SQL schema over it which we

play16:55

call Hive and then you can use SQL to

play16:58

query and all that stuff and that's been

play16:59

around a long time bear in mind that

play17:01

data lake house wasn't worried about

play17:02

maintaining data it was just querying

play17:04

data that was dumped out there now we're

play17:06

talking about maintaining it and the

play17:08

good starting point is we have the SQL

play17:09

language to start with transactions have

play17:11

been implemented that's huge this is

play17:13

massive and this is when people started

play17:15

talking about Delta Lake what is Delta

play17:16

Lake beginning at Delta Lake was the

play17:18

addition of transactional support on a

play17:21

data Lake all right it's based on

play17:22

parquet new format called Delta but what

play17:26

does Delta really do it adds transaction

play17:28

logs we're going to talk about in a

play17:30

minute we get acid support right so

play17:32

we're getting that kind of robustness we

play17:34

can do a complete commit or rollback

play17:36

something that didn't exist only a few

play17:38

years ago now exists and we have

play17:40

transaction logging so a Delta file is

play17:43

really just a parquet file but it's got

play17:45

transaction logging and we'll see more

play17:47

of that in future videos constraints

play17:49

first time I'm not using green I'm using

play17:51

an orange because although databricks

play17:54

has done a great job of implementing

play17:56

many types of constraints on the lake

play17:57

house it's still evolving there's still

play18:00

parts that aren't completely there I

play18:02

used a primary key generated the other

play18:04

day using something called the identity

play18:05

column them work great but then I found

play18:08

I had to kind of do a hack to get what

play18:10

the identity column value was so I could

play18:13

use it as a foreign key in another table

play18:15

which is a very common thing so I found

play18:17

it wasn't completely there in terms of

play18:19

its implementation but I was able to get

play18:21

it to work so it did what I needed so

play18:23

just be aware it's evolving and don't

play18:25

expect it to be just like a good old

play18:27

Oracle or SQL Server yeah the

play18:30

referential Integrity constraints I'm

play18:32

really excited about because I honestly

play18:33

didn't think I'd see this in a scaled

play18:35

out platform but databricks says they

play18:37

got it there I have not tried it yet and

play18:39

it's in public preview which means it's

play18:41

pretty far along so take a look at that

play18:43

if you get a chance I'm going to be

play18:44

interested in looking at that that's a

play18:46

challenging thing to do on a scaled out

play18:48

platform now security honestly you're in

play18:50

a cloud platform with databricks you've

play18:52

got to deal with that cloud platform's

play18:54

way of securing things and bear in mind

play18:56

this is not like a relational database

play18:58

that it's all wrapped around in the

play19:00

service these are just files that are

play19:01

sitting out on blob or Azure data Lake

play19:04

storage or AWS store storage or Google

play19:06

storage so security there's some there's

play19:09

some Grant revoke type stuff you can do

play19:11

within the data Lake there's some grants

play19:13

revoke kind of things you can do within

play19:14

the data lake house but really you need

play19:16

to really look at the cloud Architects

play19:18

you're working on and secure everything

play19:20

along those lines so I wouldn't worry

play19:22

much about that and as far as triggers

play19:23

go I don't think they're implemented I

play19:25

haven't really researched it yet last I

play19:27

checked they weren't but I'm okay

play19:28

without triggers too because they

play19:30

typically have caused more problems than

play19:32

they solved so I typically avoid them

play19:34

but I did want to mention them knowing

play19:36

databricks I wouldn't be surprised if

play19:37

they do Implement them for those people

play19:39

that really wanted database backups

play19:41

again this gets back to you kind of had

play19:43

to do database backups when you're in

play19:45

this monolithic specialized proprietary

play19:48

service called SQL Server Oracle why

play19:50

Brian it's still storing files under the

play19:52

covers but it doesn't tell you where

play19:54

they are and it doesn't let you manage

play19:56

them or copy them in fact if you tried

play19:57

to do anything to them you'd probably

play19:58

corrupt your entire database so you need

play20:00

to leave them alone and let the system

play20:02

manage them because that's the case you

play20:04

need to use the special extraction

play20:06

program called the database backup to

play20:09

write a database backup file and then

play20:11

you have to use the database backup

play20:12

restore command to bring it back that's

play20:15

not the case with something like a data

play20:17

lake house these are flat files and you

play20:19

can do in slice and dice whatever you

play20:20

want and it's up to you to really manage

play20:22

them and be careful with what you do I'm

play20:24

sure that some functionality around

play20:25

governance for those is is definitely

play20:27

needed but the database backup idea

play20:31

probably is less relevant here High

play20:33

availability and all those things and

play20:35

recoverability definitely are and you're

play20:37

going to need to think about that and

play20:39

definitely for these files that you're

play20:40

putting out on your lake house figure

play20:42

out how you can make sure you have good

play20:45

recoverability copies of them maybe you

play20:48

copy them to a few places in storage or

play20:50

move them to Archive storage so that if

play20:52

something happens and you lose them you

play20:53

can get them back typically cloud

play20:55

storage does a couple of copies maybe

play20:57

maybe more depending on the type of

play20:58

storage but I I think you should make

play21:01

sure you have the kind of reliability to

play21:03

make sure you don't lose data what is

play21:05

schema evolution is first in the old

play21:07

data warehousing world I've seen many

play21:09

times where they've gone along and

play21:12

somebody just throws a new column into a

play21:14

data set or a table that you're pulling

play21:15

and you aren't expecting it happens all

play21:18

the time now in the traditional world

play21:19

everything breaks and you're kind of

play21:21

left around panicking trying to solve

play21:23

the problem because nobody told you they

play21:25

were adding new tables or columns or

play21:26

making changes or dropping columns but

play21:28

databricks added a really interesting

play21:31

feature called schema Evolution and

play21:33

schema Evolution when new columns come

play21:35

in or changes occur you can make a

play21:37

decision in your code in architecture to

play21:39

say how do you want to handle that and

play21:41

if you choose to you can allow your

play21:43

schema to evolve like adding new columns

play21:45

so that's a very cool feature and I'm

play21:47

sure you have to be careful how you use

play21:49

it but good to have because in this fast

play21:51

changing world with data lakes and all

play21:53

kinds of things being dumped into your

play21:54

your data storage you probably can't

play21:57

afford to just stop everything and do a

play21:59

project to figure out how to handle a

play22:01

new column I want to add a few things to

play22:03

what I've said I took a very specific

play22:05

Viewpoint it is based on my own

play22:07

background if I look at the history of

play22:08

things I come from that SQL relational

play22:10

background and most people that are

play22:12

moving to the data Lake do as well and

play22:14

database is aware of this however it's

play22:16

important to understand that the data

play22:18

Lake ain't your old-fashioned Legacy

play22:20

data warehouse there's a lot of

play22:22

functionality in there that wasn't in

play22:24

the old world data warehouse for

play22:27

instance if we go to the right and look

play22:28

at data lake house we get now because of

play22:31

this new data lake house functionality

play22:33

metadata and governance so that's what

play22:35

the data lake house brings to us but we

play22:37

also are getting all kinds of support

play22:40

for different types of file structures

play22:42

the Big Data world right not just

play22:44

structured data that we used to but

play22:46

pictures and images and sound and video

play22:48

and on and on we need to be able to

play22:50

process that it's not an option anymore

play22:52

we need to be able to analyze videos and

play22:53

say what's happening in the video or

play22:55

transcribe it we need to look at sound

play22:57

we need to be able to handle all kinds

play22:59

of data so this is something you just

play23:00

can't do in the relational database

play23:02

world and we want to be able to support

play23:03

machine learning and AI AI right this is

play23:06

a big part of where things are going

play23:07

again something that isn't traditionally

play23:09

supported in a data warehouse world now

play23:12

I also want to call your attention the

play23:13

fact that this is databricks view of

play23:15

their evolution right going from your

play23:17

traditional data warehouse to the data

play23:19

Lake and then realizing they lacked the

play23:22

governance and features needed to be a

play23:24

proper data warehouse with all that

play23:25

transactional support so we get to the

play23:27

data lake house so this is theirs

play23:29

there's a lot of other things they talk

play23:31

about and if you look at this link at

play23:33

the bottom here it says more that's a

play23:35

link that will take you to a databricks

play23:36

Blog from about two years ago where they

play23:38

talk about data lake house and what it

play23:40

does and all these features and it

play23:42

doesn't really take it from the sort of

play23:43

viewpoint I'm taking it they're just

play23:44

looking at more from an overall

play23:45

functionality that was lacking that

play23:47

they're trying to build into the lake

play23:49

house so there's more in that than I've

play23:51

discussed here but I hope this gives you

play23:52

a sort of smattering of what's involved

play23:55

and what the lake house is about and by

play23:57

the way a link to these slides is

play23:59

available in the video description so we

play24:01

started out by talking about the data

play24:03

Lake to data Swap and the idea that why

play24:05

wow we can handle big data and everybody

play24:07

jumped on board and got so excited that

play24:09

they thought that the magic of this

play24:11

technology meant we didn't have to do

play24:12

any work anymore just throw it out there

play24:14

and analyze and that disillusionment was

play24:18

identified pretty quickly then we got to

play24:20

looking at as databricks did what the

play24:22

traditional data warehouse did and how

play24:24

do we get some of that functionality

play24:26

back that we need in the data Lake world

play24:28

and then we talked about the

play24:29

architectural differences between a

play24:32

relational database and a data Lake

play24:34

which make applying some of these things

play24:36

particularly challenging but somehow

play24:39

they did most of it and they're doing

play24:41

more and so we talked about introducing

play24:43

the data lake house and all the things

play24:45

that have been implemented towards

play24:46

giving us that kind of data warehouse

play24:48

functionality that's it I want to thank

play24:51

you for watching please like share

play24:53

subscribe put comments in and questions

play24:55

until next time I'm Paul and thrower all

play24:58

in this together thank you

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
Data LakeData Lake HouseData GovernanceBig DataData WarehouseHadoopSQL ServerRelational DatabasesData ManagementData Analytics
هل تحتاج إلى تلخيص باللغة الإنجليزية؟