Data Lakehouse: An Introduction
Summary
TLDRIn this video, Brian introduces the concept of the Data Lakehouse, a convergence of data lakes and data warehouses. He discusses the evolution from the data lake, which became a data swamp due to lack of governance, to the structured environment of traditional data warehouses. Brian explores the challenges of implementing data warehouse features in a distributed data platform, highlighting the advancements in technologies like Delta Lake that offer transactional support and ACID properties. The summary also touches on the architectural differences between relational databases and data lakes, and the importance of features like schema evolution and metadata governance in the Data Lakehouse.
Takeaways
- đ The Data Lakehouse is an emerging concept that combines the best of data lakes and data warehouses, aiming to provide a unified platform for data storage and analytics.
- đ The Data Lakehouse introduces transactional support to data lakes with technologies like Delta Lake, which adds transaction logs to Parquet files, enabling ACID properties.
- đ The evolution from data lake to data swamp highlighted the lack of data governance, leading to the need for a more structured approach to handle big data effectively.
- đ Relational databases offer robust features like structured query language (SQL), ACID transactions, and various constraints that ensure data integrity and security.
- đ Traditional data warehouses are built on top of relational databases and are optimized for reporting and decision-making through data aggregation and fast querying.
- đ The architectural differences between data lakes and relational databases present challenges in implementing data warehouse features on a distributed data platform.
- đ Schema evolution is a feature of the Data Lakehouse that allows for dynamic changes to the data schema without disrupting existing systems, accommodating the fast pace of data changes.
- đ Security in the Data Lakehouse context relies on cloud platform security measures, as opposed to the encapsulated security features of traditional relational databases.
- đ The Data Lakehouse aims to support a wide range of data types beyond structured data, including images, videos, and other multimedia formats, which is essential for modern data analytics.
- đ€ Support for machine learning and AI is a significant aspect of the Data Lakehouse, expanding its capabilities beyond traditional data warehousing to include advanced analytics.
- đ The Data Lakehouse concept is continuously evolving, with features like referential integrity and other constraints still in development to enhance data management and governance.
Q & A
What is the main topic of the video?
-The main topic of the video is the concept of the data lake house, its introduction, and how it combines elements of data lakes and traditional data warehouses.
What was the initial problem with data lakes?
-The initial problem with data lakes was the lack of data governance, which led to a situation where a lot of data was stored without any structure or thought about how it should be used, eventually turning into a 'data swamp'.
What are the core features of relational databases that support a data warehouse?
-The core features of relational databases that support a data warehouse include support for structured query language (SQL), built on set theory, ACID transactions for data integrity, constraints for data validity, and transaction logs for recoverability.
What is the difference between OLTP and data warehouse workloads in relational databases?
-OLTP (Online Transaction Processing) workloads focus on transactional processing systems for daily operations, while data warehouse workloads focus on reporting and decision-making with an emphasis on querying large datasets for aggregation.
What is the significance of ACID in relational databases?
-ACID in relational databases stands for Atomicity, Consistency, Isolation, and Durability. It ensures that database transactions are processed reliably, maintaining data integrity and allowing for complete or no changes at all.
What is the Delta Lake and how does it relate to data lake houses?
-Delta Lake is a technology that adds transactional support to data lakes, providing ACID properties. It is based on the Parquet file format and includes transaction logging, which is crucial for the development of data lake houses.
What challenges arise when trying to implement relational database features in a data lake house environment?
-Challenges include the distributed nature of data storage in a data lake house, which requires additional overhead and network traffic to perform operations like checking for unique keys or referential integrity across multiple nodes.
What is schema evolution and why is it important in data lake houses?
-Schema evolution is the ability of a system to adapt to changes in the schema, such as adding new columns, without breaking existing processes. It is important in data lake houses to accommodate the fast-paced and dynamic nature of data storage and analysis.
How does the data lake house approach differ from traditional data warehouses in terms of security?
-In data lake houses, security is often managed through the cloud platform's architecture rather than being encapsulated within the database service itself. This means that security measures need to be implemented at the infrastructure level, where the data is stored.
What are the new capabilities that data lake houses bring to the table compared to traditional data warehouses?
-Data lake houses bring capabilities such as support for a variety of file structures, not just structured data, and the ability to handle unstructured data like images, videos, and sounds. They also support machine learning and AI, which are not traditionally part of data warehouse functionalities.
What is the final message or conclusion of the video?
-The final message of the video is that while data lake houses have made significant progress in emulating the functionality of traditional data warehouses, they still face unique challenges due to their architectural differences. However, they offer new capabilities that are essential for modern data processing and analysis.
Outlines
đ Introduction to Data Lakehouse Concept
In this introductory video, Brian from Kathkey's channel discusses the concept of the data lakehouse. He emphasizes the importance of understanding the conceptual background before diving into technical details. Brian introduces the idea of a data lakehouse as a blend of data lake and data warehouse, revisiting the traditional data warehouse and the hype around Hadoop about a decade ago. He explains how the initial excitement of using Hadoop for massive data processing led to the creation of data lakes, which eventually turned into data swamps due to lack of governance and data management. The video sets the stage for a deeper exploration of the data lakehouse, highlighting the need for a structured approach to managing large volumes of data.
đ Importance of Data Integrity and Transactions in Relational Databases
Brian delves into the features of relational databases that are crucial for maintaining data integrity. He discusses the role of Structured Query Language (SQL) and the principles of set theory in organizing data into discrete tables. The video highlights the significance of transactions, which support operations like insert, update, and delete, and ensure data consistency through the ACID properties (Atomicity, Consistency, Isolation, Durability). Brian also covers various types of constraints, such as referential integrity, domain constraints, key constraints, and check constraints, that help maintain data integrity. He further explains the role of transaction logs in recording changes and enabling data recovery, and the importance of database backups for resilience and recoverability.
đŠ Understanding OLTP and Data Warehouse Workloads in Relational Databases
This paragraph focuses on the two primary types of workloads supported by relational databases: transactional processing (OLTP) and data warehousing. Brian explains that transactional systems are critical for daily operations, handling tasks like sales, accounting, and banking transactions. These systems prioritize fast and efficient data maintenance. On the other hand, data warehouses are designed for reporting, decision-making, and planning, focusing on query performance and data aggregation. Brian also discusses the architectural differences between these systems, such as the need for high availability and fault tolerance in transactional systems, and the emphasis on integration and large data set aggregation in data warehouses. He also touches on the modeling techniques used in these systems, such as entity-relationship modeling for transactional databases and dimensional modeling for data warehouses.
đ Challenges in Implementing Data Warehouse Features in Data Lakehouses
Brian explores the challenges of implementing data warehouse features in a data lakehouse environment. He contrasts the simple architecture of relational databases, where operations like checking for unique keys or referential integrity are local and efficient, with the complexities of a scaled-out data platform. Data lakehouses, which are based on files like Parquet and Delta format, face overhead in operations due to the distributed nature of the system. Brian discusses the evolution of data lakehouses, starting with basic query support and moving towards transactional support with the introduction of Delta Lake, which adds transaction logs and ACID compliance. He also mentions the ongoing development of constraints and the challenges of implementing referential integrity in a distributed system.
đĄïž Data Lakehouse: Emulating Data Warehouse Features and Beyond
In this final paragraph, Brian wraps up the discussion by highlighting the progress made in data lakehouses in emulating traditional data warehouse features. He mentions the implementation of SQL language support, transactions, and constraints, although noting that some features are still evolving. Brian also addresses the differences in security, backup, and recovery strategies between relational databases and data lakehouses. He emphasizes the need for careful management of files in a data lakehouse environment. Additionally, he discusses the concept of schema evolution, which allows for flexibility in handling changes in data structures. Brian concludes by acknowledging the unique capabilities of data lakehouses, such as handling diverse data types and supporting machine learning and AI, which are not traditionally found in data warehouses.
Mindmap
Keywords
đĄData Lake House
đĄData Swamp
đĄHadoop
đĄRelational Database
đĄACID
đĄTransaction Log
đĄData Governance
đĄSchema on Read
đĄDelta Lake
đĄSchema Evolution
Highlights
Introduction to the concept of Data Lake House and its significance.
The evolution from Data Lake to Data Swamp due to lack of data governance.
The challenges faced with the traditional data warehouse and the need for evolution.
The role of Hadoop and MapReduce in the early hype around big data processing.
The importance of understanding the technical details before diving into Data Lake House.
The concept of 'Freedom' in data management and its drawbacks.
The necessity of data governance and the questions it raises about data integrity and accuracy.
The evolution of SQL Server and the core requirements for a data warehouse.
The features of relational databases that support a robust data warehouse.
The significance of transactions and ACID properties in maintaining data integrity.
The role of constraints in ensuring data integrity within relational databases.
The importance of transaction logs for recoverability in relational databases.
The architectural differences between relational databases and data lakes.
The challenges in implementing relational database features in a scaled-out data platform.
The introduction of Delta Lake and its role in adding transactional support to data lakes.
The ongoing evolution of Data Lake House in terms of constraints and security.
The differences in workloads between transactional systems and data warehouses.
The importance of high availability and recoverability in data lake house architecture.
The concept of schema evolution and its relevance in the dynamic world of data lakes.
The integration of metadata and governance in the Data Lake House.
The support for various file structures and the need for handling unstructured data in the Data Lake House.
The role of Data Lake House in supporting machine learning and AI.
Transcripts
welcome back to my channel I'm Brian
kathkey and in this video we're going to
be talking about the data lake house and
this will be an introduction to give you
a conceptual background to what that is
before I jump in please consider
supporting me on patreon you'll get
direct access to me special content and
periodic q and A's among the benefits
now one of the differences on my channel
I try to maintain is introducing
everything conceptually and making sure
you have a firm foundation under which
to understand the technical details that
are discussed further on and in that
same kind of mentality I'll be talking
about the data lake house so we'll be
talking about data lake house to data
swamp revisiting the traditional data
warehouse what's so hard about that and
introducing the data lake house if
you're going out Wayback machine maybe
12 years ago a lot of hype around Hadoop
and the idea that you could do all kinds
of massive data processing using Hadoop
mapreduce and as part of that the Hadoop
distributed file system this sounded
great people said let's use this stuff
so we got people out there and they
threw stuff out into the data Lake which
is really just a storage system just
like a file folder right throw some
files out there did querying then they
told their friends and they threw files
out there and more people and so on and
so on and eventually you had lots and
lots of data sitting out on these file
folders with no data governance no
really thought about how it should be
used and this was called Freedom right
hey it was like we don't need no
education we don't need no rules problem
eventually they started asking questions
like what is this data it was put out a
year ago and the guy left what happened
what is this is it current is it
accurate where did it come from why are
there so many bogus values in the data
how do I get the information I need and
has someone already gotten this data I
need together so I can just use it so
this became a data swap instead of a
data Lake in other words it became
pretty useless pretty quickly people
started to you know enter the trough of
Despair This Promise wasn't all it was
cracked up to be but In fairness
sometimes people just jump too quickly
and they forget that the data warehouse
is in the SQL Server World also had to
evolve and you can't just throw away the
core requirements it's not things don't
work like that there's no magic but
people forget that and Everybody sung
the Praises of the end of the old
relational databases and all the
requirements of doing a data warehouse
so let's talk about that for a minute
let's talk about relational databases
and the kinds of features that they
offer to do a data warehouse well we
know that they support these structured
query language and that they're built
upon set theory and EF Cod came all
about all these roles and they came up
with this idea of a relational database
system and structured query language is
a very rich and robust language that has
been extended over the years and the
idea behind a relational database is
that you have these objects to store
data called tables each table is
supposed to store a discrete set of
information maybe about sales or
customers or products and only about
that information so it relates all to
that subject you may have a relationship
between given tables sales for instance
as a customer somebody bought the
product so there's a relationship
between these relational databases are
also good at supporting transactions
right what are transactions Brian this
is how you maintain the data it starts
by inserting data then you needed to go
back and do updates and maybe eventually
you realize like the customer isn't
going to be a customer anymore you
delete them so insert update and delete
are the kinds of things we do against
our tables to maintain the data and we
do that in something called transactions
transactions on a relational database
support acid which means atomicity
consistency isolation and durability you
don't need to worry about all the
details what that means but the idea
behind it is sort of an All or Nothing
Concept in which we apply all the
related sets of Maintenance tests
together or we do them not at all so if
we consider something like the customer
in sales we have something called
referential Integrity here we want in
other words we don't want to allow
someone to insert a sales row unless
there's a related customer which is
represented on the sales Row the
customer key that means that we have to
first insert the customer then we can
insert the sales row and if we can't do
both we should roll everything back and
that's the idea so there could be more
tables involved we could have many
tables involved and we would wrap the
whole thing up in what's called a
transaction do this insert do this
insert do this update whatever it is we
need to do and we put it together as a
package called the transaction and at
the end of it we say commit it now if
the commit has a problem suddenly it
runs out of space or something goes on
it can't do it then it's supposed to
roll back or the program it can test it
and say oh something went wrong I need
to roll back the transaction meaning
undo everything as if I never touched it
now you can imagine you've done a lot of
changes it's important that you get rid
of everything you did and return it to
the state where it was before you even
started that gives it a sort of
consistency you may be missing data for
that transaction
but at least you're not Half Baked some
of it got through some of it didn't and
you don't even know what got through and
what didn't so it's better to have this
consistency and that's supported in
these transactions All or Nothing it can
do this or it doesn't do this
another thing that extends the sort of
Integrity of the data are constraints
there's a whole bunch of different types
of constraints you can see in the
database picture there we talked about
referential integrity and that means you
can define a rule in the database that
says you cannot insert a sales row
unless the customer key is found on the
customer table in other words if I'm
inserting a sales for customer a then
customer a must exist on the customer
table there's a lot of other types of
constraints domain constraints and force
that data types are what they should be
integers and decimals you can't put the
wrong type of data in a column we have
key constraints what's a key constraint
well things like uniqueness if you're
going to have a key for a table which is
the unique identifier for a given
customer it has to be unique so that's a
unique key constraint entity Integrity
constraints mean that a key a primary
key for instance for a customer cannot
be null you cannot have null primary
keys we've already talked about
referential integrity constraints and
then we talk about column value
constraints also known as check
constraints so what are those Brian but
the idea is that you can say sales
cannot be for instance less than zero in
fact probably should always be greater
than zero a date on a sale must be
filled in you can't have a null value so
null check constraints are one you could
say values are a certain thing there's
all these types of things that are meant
to kind of keep a sane set of values in
the tables supporting this whole idea of
transactions are transaction logs these
are external files that are maintained
when the transactions are occurring so
as you do inserts there's this little
transcription taking place in the
database to say oh I see you're adding a
customer I see you're adding a sale I
see you're doing this and making these
changes and it's keeping track of all
these changes so that when you say oh
wait something went wrong it can use the
transaction log to reverse the data back
to the way it was before you started
when you say commit it locks it down
down and says okay that's the way the
data is and the transaction log record
that that's the current state of data
this was committed so transaction logs
are really important for recoverability
and they also allow us to know what
happened to the data typically in
relational databases now this is a
self-contained world right the
environment of relational databases that
you can only go through the database
server whether it's Oracle or SQL Server
software to do anything these are not
external files like in a data Lake these
are all self-contained in the database
and it must manage them periodically to
make sure we don't lose our data if
something goes wrong and the server
crashes or something dbas will take
database backups and typically this is
like nightly we'll do a nightly database
backup and then during the day all the
transactions happening will be logged in
the transaction log so lo and behold
something happens in the system crashes
and we lost all the data well you would
go and grab the backup last backup and
it might be from last night 8 pm and
then you take the transaction logs of
everything that happened since then to
the latest transaction and you apply
those transactions actions for it to
restore the backup and start applying
the transactions until you get to the
most current state you can get to which
hopefully is pretty good and you haven't
lost much data these are all parts of
resilience and recoverability that SQL
Server databases have been doing for a
long time and relational databases also
include a lot of security because again
you have to go through this sort of
veneer wrapper around it to get at the
data so it's going to say who are you
and do you have a password and what
permissions do you have in the database
so they have a lot of security around
them and they typically also support
triggers additionally to triggers that
can be stored procedures and functions
which I'm not really going to get too
much into because it's not critical to a
database to have that but it is a nice
feature you can store essentially
programs in the database written in the
SQL language but triggers get back to
sort of this data maintenance kind of
thing for instance way back in the old
days triggers would be used to enforce
referential Integrity somebody would try
to insert a row into the sales table and
you would have a trigger and a trigger
is something you write a piece of code
that should execute when an action is
performed on a table so you try to
insert into sales you say well if
there's an insert on the sales table
before you do that check that customer
key coming in is it on the customer
table yes okay let it go through no
don't let this transaction happen don't
let them do the insert and there's on
updates and on inserts and after and all
these different types of controls are on
the trigger I'm not a huge fan of
triggers I haven't used them in a long
time but there are use cases where they
make sense they can also be used
sometimes to do automatic logging you
insert into the sales table and you
could write to a log saying somebody
inserted a row and before we get into
the data lake house and what it is I
need to talk about the two primary types
of workloads that relational databases
support on the left we talk about the
transaction or online transactional
process in oltp database workload and on
the right we're going to talk about the
data warehouse which is probably what
we're going to focus on right but we've
got these two potential types of
workloads and both of these coexist and
have coexisted in the relational
database world for decades but the bread
and butter of relational databases
really has been on the left the
transactional processing systems these
are databases that store your mission
critical data in other words you've got
your applications running in this during
the database typically in a relational
database this could be your sales system
Financial record systems like accounting
and general ledger and accounts payable
banking transactions all kinds of things
that are absolutely essential for the
daily operations of your business data
warehouse purpose is reporting and it's
decision making and planning that's what
it's kind of involved around what should
we be doing what products should we sell
how can we do a better job transactional
systems are focusing primarily on the
data maintenance and doing it really
well so that's why you saw that thing
about transactions earlier insert update
and deletes key they have to be fast
they have to be efficient and they have
to avoid a lot of locking because many
different people could be trying to do
these kinds of operations in different
places all over the system and nowadays
it could be a website that's distributed
all over the world people are trying to
do updates and inserts so it's an
important thing but remember these are
sort of small sets of data it's
inserting a sales row and a customer row
and things like that not hundreds of
thousands of rows from a given single
person data warehouse it's not about
maintaining the data typically there's
some sort of a batch window maybe once a
night
ETL kicks off and it does all the
loading and merging and crushing up the
data to where it needs to be
what's really important is when people
run queries do power bi do reporting
they get quick responses in the querying
which is involved in sorting and
aggregating not looking at individual
rows so again a very different Focus
here in transactional systems the thing
that is really crucial because these are
like if this is down your business is
dying and you have to solve this problem
quickly so it's crucial it needs to be
reliable it needs to be secure it needs
to be resilient and very fault tolerant
you'll see lots of architectures where
something fails over the database system
fails and it triggers a failover to a
whole nother database system and that
may even have another failover so if
that goes down it goes to a third it's a
lot of things going on it has to be
recoverable we saw the whole backup
recovery thing when we can restore the
database if you lose it you're in deep
trouble whereas on the data warehouse
we're doing queries we're doing lots of
aggregation also the originating source
of data is the transactional systems
which typically the data from there is
going to be pulled out and put into our
data warehouses so the transactional
systems are really our system of record
they're the ones that are going to
ultimately say what's right and what
isn't so they have to be accurate the
data warehouse has to make sure they
pull all that into the data warehouse
and get it right also means that data
warehouses have a lot of integration
they pull data from many different
sources and consolidate it so it can be
used for reporting and again as I
mentioned aggregation of large data sets
and finally transactional oltp's
databases
use a modeling technique called entity
relationship modeling and they apply the
laws of normalization remember that in
your interview
so these ERM models allow you to
eliminate data redundancy which is the
enemy of transactional databases however
the data warehouse says no way you want
redundancy we're going to use
dimensional modeling we don't care if we
have 50 copies of the same customer all
we care about is that the queries run
fast so a very different orientation now
why do I care about this Brian why is
this so important because when I talk
about bowering Concepts and features
that are in relational databases I want
you to remember that the data lake house
is focused on emulating the data
warehouse functionality in fact the data
lake house name comes from taking data
Lake data warehouse and merging those
words together to get data lake house
very clever all this is good Brian but
what's the big deal why is it so hard to
just add those kinds of features to a
data Lake seems pretty easy right hmm
think about it for a minute and this is
important because as you run into trying
to build your own data lake houses and
things you might say hmm why is this so
challenging and how can I get around it
I'll tell you why because in the old
relational database world you got a
single box everything's local very
simple you try to insert a row and you
say is that a unique row it can very
quickly check against the table and say
yep that's unique doesn't exist yet or
whatever that's unique row this key
doesn't exist referential Integrity not
a problem I go to insert a sale row
takes the customer key says is that on
the customer table it's all very local
very close no network traffic really
it's all right there very efficient all
on one machine simple architecture now
when you look at a scaled out data
platform whether it's databricks a
snowflake or synapse it's very different
world you've got many different nodes
running many different machines they're
actually separate they can be even be
separated by some Network barriers this
is a lot of overhead and this could be
like 10 000 nodes in a cluster and bear
in mind these are actually files right
it's not like the relational database
World in which everything's wrappered in
a single monolithic interface in a
service that wraps it all up these are
actually separate storage files in the
data lake house it based it on parquet
which it enhanced and called Delta
format but they're just flat files so
this adds a lot of complexity because
you say okay is this a unique key value
it's going to have to look at all the
cluster nodes to find that out if you
said is this customer key on the
customer table it's going to have to do
a shuffle to co-locate the data so we
can find out if that customer key is
there or not these are very expensive
operations and this non-efficient use of
a scaled out platform but we need that
functionality so we're kind of in a
quandary and as mentioned that the
database is external files so even
though we're trying to emulate this sort
of relational database functionality it
has some key differences and that's a
big one it's just a bunch of files
really somebody could just take a file
and delete it and that's part of your
database so let's step back and let's
say how well has the data Lakehouse
implemented that old relational database
data warehouse functionality like where
are they at based on our previous
diagram well spark has had at least the
query level select type of support for
SQL since 1.0 it's been a good had that
for a while and it can do that on top of
flat files in fact the whole data Lake
thing I talked about earlier I should
have mentioned basically that was schema
on read you throw a file out there
Define an SQL schema over it which we
call Hive and then you can use SQL to
query and all that stuff and that's been
around a long time bear in mind that
data lake house wasn't worried about
maintaining data it was just querying
data that was dumped out there now we're
talking about maintaining it and the
good starting point is we have the SQL
language to start with transactions have
been implemented that's huge this is
massive and this is when people started
talking about Delta Lake what is Delta
Lake beginning at Delta Lake was the
addition of transactional support on a
data Lake all right it's based on
parquet new format called Delta but what
does Delta really do it adds transaction
logs we're going to talk about in a
minute we get acid support right so
we're getting that kind of robustness we
can do a complete commit or rollback
something that didn't exist only a few
years ago now exists and we have
transaction logging so a Delta file is
really just a parquet file but it's got
transaction logging and we'll see more
of that in future videos constraints
first time I'm not using green I'm using
an orange because although databricks
has done a great job of implementing
many types of constraints on the lake
house it's still evolving there's still
parts that aren't completely there I
used a primary key generated the other
day using something called the identity
column them work great but then I found
I had to kind of do a hack to get what
the identity column value was so I could
use it as a foreign key in another table
which is a very common thing so I found
it wasn't completely there in terms of
its implementation but I was able to get
it to work so it did what I needed so
just be aware it's evolving and don't
expect it to be just like a good old
Oracle or SQL Server yeah the
referential Integrity constraints I'm
really excited about because I honestly
didn't think I'd see this in a scaled
out platform but databricks says they
got it there I have not tried it yet and
it's in public preview which means it's
pretty far along so take a look at that
if you get a chance I'm going to be
interested in looking at that that's a
challenging thing to do on a scaled out
platform now security honestly you're in
a cloud platform with databricks you've
got to deal with that cloud platform's
way of securing things and bear in mind
this is not like a relational database
that it's all wrapped around in the
service these are just files that are
sitting out on blob or Azure data Lake
storage or AWS store storage or Google
storage so security there's some there's
some Grant revoke type stuff you can do
within the data Lake there's some grants
revoke kind of things you can do within
the data lake house but really you need
to really look at the cloud Architects
you're working on and secure everything
along those lines so I wouldn't worry
much about that and as far as triggers
go I don't think they're implemented I
haven't really researched it yet last I
checked they weren't but I'm okay
without triggers too because they
typically have caused more problems than
they solved so I typically avoid them
but I did want to mention them knowing
databricks I wouldn't be surprised if
they do Implement them for those people
that really wanted database backups
again this gets back to you kind of had
to do database backups when you're in
this monolithic specialized proprietary
service called SQL Server Oracle why
Brian it's still storing files under the
covers but it doesn't tell you where
they are and it doesn't let you manage
them or copy them in fact if you tried
to do anything to them you'd probably
corrupt your entire database so you need
to leave them alone and let the system
manage them because that's the case you
need to use the special extraction
program called the database backup to
write a database backup file and then
you have to use the database backup
restore command to bring it back that's
not the case with something like a data
lake house these are flat files and you
can do in slice and dice whatever you
want and it's up to you to really manage
them and be careful with what you do I'm
sure that some functionality around
governance for those is is definitely
needed but the database backup idea
probably is less relevant here High
availability and all those things and
recoverability definitely are and you're
going to need to think about that and
definitely for these files that you're
putting out on your lake house figure
out how you can make sure you have good
recoverability copies of them maybe you
copy them to a few places in storage or
move them to Archive storage so that if
something happens and you lose them you
can get them back typically cloud
storage does a couple of copies maybe
maybe more depending on the type of
storage but I I think you should make
sure you have the kind of reliability to
make sure you don't lose data what is
schema evolution is first in the old
data warehousing world I've seen many
times where they've gone along and
somebody just throws a new column into a
data set or a table that you're pulling
and you aren't expecting it happens all
the time now in the traditional world
everything breaks and you're kind of
left around panicking trying to solve
the problem because nobody told you they
were adding new tables or columns or
making changes or dropping columns but
databricks added a really interesting
feature called schema Evolution and
schema Evolution when new columns come
in or changes occur you can make a
decision in your code in architecture to
say how do you want to handle that and
if you choose to you can allow your
schema to evolve like adding new columns
so that's a very cool feature and I'm
sure you have to be careful how you use
it but good to have because in this fast
changing world with data lakes and all
kinds of things being dumped into your
your data storage you probably can't
afford to just stop everything and do a
project to figure out how to handle a
new column I want to add a few things to
what I've said I took a very specific
Viewpoint it is based on my own
background if I look at the history of
things I come from that SQL relational
background and most people that are
moving to the data Lake do as well and
database is aware of this however it's
important to understand that the data
Lake ain't your old-fashioned Legacy
data warehouse there's a lot of
functionality in there that wasn't in
the old world data warehouse for
instance if we go to the right and look
at data lake house we get now because of
this new data lake house functionality
metadata and governance so that's what
the data lake house brings to us but we
also are getting all kinds of support
for different types of file structures
the Big Data world right not just
structured data that we used to but
pictures and images and sound and video
and on and on we need to be able to
process that it's not an option anymore
we need to be able to analyze videos and
say what's happening in the video or
transcribe it we need to look at sound
we need to be able to handle all kinds
of data so this is something you just
can't do in the relational database
world and we want to be able to support
machine learning and AI AI right this is
a big part of where things are going
again something that isn't traditionally
supported in a data warehouse world now
I also want to call your attention the
fact that this is databricks view of
their evolution right going from your
traditional data warehouse to the data
Lake and then realizing they lacked the
governance and features needed to be a
proper data warehouse with all that
transactional support so we get to the
data lake house so this is theirs
there's a lot of other things they talk
about and if you look at this link at
the bottom here it says more that's a
link that will take you to a databricks
Blog from about two years ago where they
talk about data lake house and what it
does and all these features and it
doesn't really take it from the sort of
viewpoint I'm taking it they're just
looking at more from an overall
functionality that was lacking that
they're trying to build into the lake
house so there's more in that than I've
discussed here but I hope this gives you
a sort of smattering of what's involved
and what the lake house is about and by
the way a link to these slides is
available in the video description so we
started out by talking about the data
Lake to data Swap and the idea that why
wow we can handle big data and everybody
jumped on board and got so excited that
they thought that the magic of this
technology meant we didn't have to do
any work anymore just throw it out there
and analyze and that disillusionment was
identified pretty quickly then we got to
looking at as databricks did what the
traditional data warehouse did and how
do we get some of that functionality
back that we need in the data Lake world
and then we talked about the
architectural differences between a
relational database and a data Lake
which make applying some of these things
particularly challenging but somehow
they did most of it and they're doing
more and so we talked about introducing
the data lake house and all the things
that have been implemented towards
giving us that kind of data warehouse
functionality that's it I want to thank
you for watching please like share
subscribe put comments in and questions
until next time I'm Paul and thrower all
in this together thank you
5.0 / 5 (0 votes)