Google SWE teaches systems design | EP21: Hadoop File System Design
Summary
TLDRThis video delves into the architecture of Hadoop Distributed File System (HDFS), explaining its design for high throughput in both reading and writing large-scale data. It covers HDFS's distributed nature, chunk storage, metadata management by the Name Node, and the importance of replication for data availability. The script also touches on the evolution of HDFS to include high availability through coordination services, addressing the single point of failure issue. The video promises to connect these concepts to databases built on top of HDFS in a subsequent video.
Takeaways
- 🌞 The video discusses the architecture of Hadoop Distributed File System (HDFS), focusing on its design for high throughput reads and writes.
- 📚 HDFS is based on the Google File System and is popular for running on standard desktop computers, making it accessible for large-scale distributed systems.
- 🔍 HDFS is designed for write-once-read-many times files, storing data in chunks across multiple data nodes to improve parallelism for large files.
- 🗃️ The Name Node is a critical component of HDFS, storing all metadata about files and their chunks, and is kept in memory for quick access.
- 🔄 HDFS uses a write-ahead log (edit log) and an fsimage file for persistent storage of metadata changes, ensuring data is not lost on Name Node failure.
- 🔁 HDFS employs rack-aware replication to enhance data availability and throughput, placing replicas in different racks to minimize the risk of simultaneous node failures.
- 🔄 Pipelining is used in replication to ensure all replicas acknowledge write operations, maintaining strong consistency despite potential failures.
- 📖 Reading from HDFS involves querying the Name Node for the location of data chunks and selecting the data node with the least network latency for the client.
- 🖊️ Writing to HDFS, especially appending, involves a process of selecting a primary replica and ensuring data is propagated through the replication pipeline.
- ⚠️ A single point of failure exists with the Name Node; however, High Availability (HA) HDFS uses a quorum journal manager and Zookeeper for failover.
- 🔑 The video concludes by highlighting HDFS's strengths for large-scale compute and data storage, and its role as a foundation for databases that provide more complex querying capabilities.
Q & A
What is the primary purpose of the Hadoop Distributed File System (HDFS)?
-The primary purpose of HDFS is to store large data sets reliably across clusters of commodity hardware, providing high throughput access to the data for distributed processing.
Why is HDFS designed to be written once and then read many times?
-HDFS is designed this way to optimize for large-scale data processing workloads, where data is often written once and then processed or analyzed multiple times.
What is the typical size of the chunks in which HDFS stores files?
-Typically, the chunks in HDFS are around 64 to 128 megabytes in size.
What is the role of the NameNode in HDFS?
-The NameNode in HDFS is responsible for storing all the metadata regarding files, including the mapping of file blocks to the DataNodes where they are stored.
How does HDFS handle file system metadata changes?
-HDFS handles file system metadata changes by using an edit log, which is a write-ahead log for the NameNode, and periodically checkpointing the in-memory state to an fsimage file on disk.
What is the significance of the DataNode's block report in HDFS?
-The block report from a DataNode informs the NameNode about the blocks it holds, allowing the NameNode to maintain an up-to-date map of file blocks and their locations.
What does rack-aware replication mean in the context of HDFS?
-Rack-aware replication in HDFS means that data chunks are replicated in a way that considers the physical location of the nodes, typically placing one replica in the same rack as the writer and others in remote racks to maximize availability and minimize network traffic.
How does the replication process in HDFS ensure data consistency?
-The replication process in HDFS ensures data consistency by using a pipeline approach where data is written to a primary replica and then propagated to secondary replicas. A write is only considered successful if all replicas in the pipeline acknowledge it.
What is the client's strategy when it encounters a write failure in HDFS?
-When a client encounters a write failure in HDFS, it should keep retrying the write operation until it receives a success message.
What is the main issue with the original design of HDFS in terms of fault tolerance?
-The main issue with the original design of HDFS in terms of fault tolerance is the single point of failure represented by the NameNode. If the NameNode goes down, the entire system crashes.
How does High Availability (HA) in HDFS address the single NameNode issue?
-High Availability in HDFS addresses the single NameNode issue by using a backup NameNode that stays updated with the state of the primary NameNode through a replicated edit log, allowing for a failover to the backup NameNode in case the primary one goes down.
Outlines
📚 Introduction to HDFS Architecture
The speaker begins by discussing their personal situation before diving into the main topic: the architecture of the Hadoop Distributed File System (HDFS). They explain HDFS's significance in large-scale distributed systems and its origins from the Google File System. The speaker highlights HDFS's popularity due to its ability to run on standard desktop computers and its use in batch processing with tools like MapReduce and Spark. They also mention the importance of HDFS as a foundational building block for databases, allowing for efficient data interaction and computation. The paragraph concludes with an overview of HDFS's design, focusing on its write-once-read-many (WORM) approach and the use of data chunks for improved parallelism in reading and writing large files.
🔑 The Crucial Role of the Name Node in HDFS
This section delves into the critical component of HDFS: the Name Node. The Name Node is responsible for storing all metadata about files, including names, subdirectories, and the locations of data chunks across various data nodes. The speaker describes the Name Node's operation, which involves maintaining an in-memory representation of the file system's state and using an edit log as a write-ahead log for changes. They also explain the process of checkpointing the file system state to disk and the Name Node's recovery procedure in case of a crash. The paragraph further discusses the Name Node's role in handling replication, ensuring data availability, and managing the replication factor through block reports from data nodes.
🔄 HDFS Replication and High Availability
The speaker addresses the replication process in HDFS, emphasizing its rack-aware design to maximize availability and throughput. They explain the default replication factor of three and how HDFS places replicas across different racks to minimize the risk of data loss due to rack failures. The paragraph also covers the pipelining mechanism for data replication, ensuring that all replicas acknowledge a write operation before the client considers it successful. The speaker then discusses the read process in HDFS, where the client selects the nearest data node for minimal latency. They also describe the process for appending to a file, including the selection of a primary replica and the replication pipeline. The paragraph concludes with a brief mention of issues in Hadoop, particularly the single point of failure with the Name Node, and introduces the concept of high availability in HDFS through coordination services like ZooKeeper.
🛡️ Enhancing HDFS with High Availability and Coordination Services
In this final paragraph, the speaker discusses the evolution of HDFS to include high availability features, overcoming the single Name Node limitation. They explain the use of a quorum journal manager, which replicates the edit log across multiple nodes, allowing a backup Name Node to stay synchronized with the primary. In the event of a primary Name Node failure, the backup can take over by acquiring a distributed lock, ensuring continuous operation. The speaker concludes by summarizing HDFS's strengths in providing high read and write throughput, its rack-aware replication schema, and its improved fault tolerance through coordination services. They also mention the trade-offs of HDFS, such as its strong consistency model and potential data inconsistencies that applications may need to handle. The paragraph ends with a teaser for the next video, which will explore databases built on top of HDFS.
Mindmap
Keywords
💡HDFS
💡High Throughput
💡Name Node
💡Data Node
💡Replication
💡Rack Awareness
💡Edit Log
💡FS Image File
💡Pipelining
💡High Availability
💡HBase
Highlights
Introduction to Hadoop Distributed File System (HDFS) architecture and its role in high-throughput reads and writes.
HDFS is based on the Google File System and is popular for running on standard desktop computers.
HDFS is designed for batch processing with tools like MapReduce, Spark, and Tez.
Overview of HDFS's storage method using data chunks across multiple data nodes to improve parallelism.
The importance of the Name Node in storing metadata and managing file system changes through the edit log.
Explanation of the Name Node's boot process, including entering safe mode and receiving block reports from data nodes.
Rack-aware replication in HDFS to maximize availability and throughput, reducing latency and risk of data loss.
How pipelining works in HDFS for efficient data replication across nodes.
Client-side process for reading files in HDFS, including querying the Name Node and choosing the optimal data node for minimal latency.
The complexity of appending to files in HDFS, involving selecting a primary replica and managing the replication pipeline.
Visualization of the write process in HDFS, demonstrating the interaction between replicas and the Name Node.
Challenges with the single Name Node design and the risks associated with Name Node failure.
Introduction to High Availability HDFS and the use of coordination services for Name Node failover.
The role of the Quorum Journal Manager in maintaining a replicated log for Name Node state synchronization.
HDFS's strengths in providing high read and write throughput and its evolution towards fault tolerance.
HDFS's limitations, including potential data inconsistencies and the need for application-level handling.
Upcoming discussion on databases built on top of HDFS for enhanced data interaction and complex querying.
Transcripts
all right i'm back again this time in
the morning because my roommate's not
here if you guys can tell my voice is a
little messed up
i guess i was playing with the boys a
little too late last night and for some
reason my knees are a little scratched
up too i don't really get that one but
who knows so anyways today we're going
to talk about the architecture of hdfs
and figure out why that works the way it
does and how they're able to achieve
high throughput both on reads and writes
and then that'll allow us to segue
hopefully pretty easily into hbase and
see kind of the good reasons to use
something like that
okay so hdfs and the design of it
just to give a background i've mentioned
distributed file systems in the past
but basically they're a really important
component of a ton of large-scale
distributed systems
even though hdfs is probably the most
popular one it itself is based off the
google file system which is a paper that
came out well over a decade ago now
and because of the fact that hdfs can be
run on just normal desktop computers
it's really popular obviously there's a
big wave to be able to just like spin up
instances of things using things like
ect ec2 clusters or just amazon web
services in general
so even though hdfs is really useful for
things like batch processing we've
talked about this with mapreduce spark
and tes
hdfs is really really good for a
database building block so that you can
provide an extra layer to kind of
interact with the data on hdfs and then
ultimately run a ton of computations on
it
okay so just to give an overview as you
can see on the right
you're going gonna see a ton of terms
that you don't know yet but by the end
of this video you will so generally
speaking hadoop is designed to basically
be able to write a file once and then
you know from then on you can append or
truncate it but generally speaking just
read it many times over the way this is
done is by storing them in chunks across
a bunch of different data nodes and
typically these are around 64 128
megabyte chunks and the reason you do
that is to improve the parallelism of
both reading and writing big files
oftentimes these files are gigabytes or
maybe even terabytes in size and as a
result having to write them all
sequentially would be terrible
and then finally in order to ensure
availability and no loss data chunks are
obviously going to be replicated
okay so the first component of hdfs that
we have to talk about is probably the
most important one and that's going to
be called the name node so the name node
generally speaking is where all of the
metadata regarding files is stored so
that includes things like names but not
only does it just hold the names of the
files and perhaps even their
subdirectories if it is a directory
but more importantly it has to keep
track of all the chunks so basically all
of the data nodes where those chunks are
located as well as their corresponding
version numbers like i said you can
append or truncate files and doing so
would increment the version number of it
okay so how does it do this well it
keeps all of that metadata in memory all
of the changes to file system memory go
to something
sorry all of the changes to file system
metadata go to something called the edit
log so the edit log is effectively just
a write ahead log for the name node
because obviously if we had to you know
like go ahead and change kind of disk
state every single time or some
persistent state of the entire state of
the file system those writes would not
be sequential and they would take longer
so we do is we put them to a write-ahead
log
change the state in memory and then
occasionally checkpoint that state on
disk to something called an fs image
file
and then if the name node ever crashes
and has to reboot the fs image
checkpoint file in conjunction with
whatever edit log rights come after that
checkpoint can be combined to create a
new state for the name node
okay in terms of continuing to talk
about the name node
it actually only keeps the location of
all the chunks
only in memory so when it first boots
what happens is the name node goes into
safe mode and it's going to receive
something called a block report from
each data node where the data node is
going to tell it which chunks are held
on that data node the name node is then
going to compile all of this information
construct that local state and say
here's where the chunks are located and
say now
it sees that only one replica is holding
a given chunk and you know the user is
say specified replication factor of
three for that chunk it's going to say
okay we don't have enough replicas for
this
chunk in particular let's go ahead and
add some additional replicas for it so
go ahead and replicate that chunk to two
other nodes and that way we can reach
the replication threshold the same thing
will occur if a name node assumes that a
given data node is dead because it
hasn't received any heartbeats from it
for a while
okay now let's talk about replication so
replication in hadoop is something
called rack aware and this is really
important because it allows for both
maximizing availability and throughput
so we'll talk about that in a second
chunks are going to be replicated in a
way that not only reduces latency for
clients but also reduces the possibility
of all the replica nodes going down
because of the fact that they're put in
a different rack or data center
so for example for the default
replication factor of three hadoop is
going to put one replica in the same
rack as the writer and then two replicas
on the same remote random rack
the reason they put them in the same
random rack is just to minimize network
bandwidth you don't have to go to two
different racks and since we have kind
of a synchronous replication here where
we wait for all of these rights to
complete it's actually pretty important
that all of these replicas complete
their right as fast as possible it's not
eventually consistent so i'll touch upon
that in a second
so
how does replication actually work well
there's something called pipelining
basically the replicas are arranged into
some order and the data is pipelined
from one replica in the next on a right
or an append or a truncate rights are
only considered successful from the
client's point of view if all of the
replicas in this pipeline actually
acknowledge them so even though in
theory this should lead to strong
consistency
the issue is that say the first replica
receives a right it's going to go ahead
and commit that to itself so it's going
to perform the right and then the second
and third replicas don't ever actually
acknowledge the right meaning that you
know they didn't perform them themselves
well the client is going to receive a
failure for its right however
it's going to be the case that the right
is still in one of the replicas so
generally speaking when a client
receives a failure on a right
it needs to just keep retrying until it
receives a success message
so as you can see the first replica is
going to send that right to the second
one which sends the right to the third
one which then sends the acknowledgement
back and back again so once this whole
process is complete the client sees its
right as successful
okay
in terms of reading in hadoop basically
all that happens here is the client is
going to query the master or when i say
the master in this case i mean the name
node to get a list of data nodes
carrying the chunk that it wants it's
going to figure out which data node is
closest to it because like i said hadoop
is aware of the rack that the nodes are
in and as a result of that it can say
for a given client which one is probably
going to have minimal network latency
when communicating with it so you choose
the best data node to read from
you're going to cache this result on the
client in the case that you want to read
that file again because like i said
write once read many times and then the
client's going to just go ahead and
perform that read
okay in terms of doing rights this is a
little bit more complex but i'll have a
visualization after i walk through this
process
so to append to a file
go ahead and reach out to the name node
see the data nodes where the chunk is
located and then you have to pick
something called a primary replica this
is going to be the first replica in that
replication pipeline
if there's already a primary and the
lease for the primary is still valid
because you know the lease basically
says how long until there's no longer a
primary perform the right to the primary
replica let it go through the chain of
replication otherwise we need to pick a
primary replica how can we do this well
we look at the data nodes that the chunk
is located on
and pick one of them with the most
up-to-date version of that chunk if it
doesn't exist we have a data loss
problem and hopefully this never comes
up
once the primary replica is determined
all the other replicas are considered
secondary we kind of establish that path
for the replication and then the client
is going to make the right to the
primary replica and you know hope to get
a success result
okay so to actually visualize this let's
say i'm trying to write jordan's
nudes.png we've got three replicas
replica one two and three
and the first thing we're going to do is
i'm going to
go ahead and ask the name node for what
chunks are holding it and try and find
out what the leader was so i find out
that the leader was replica 3 because as
you can see it has version 23 of the
file there but that's since expired
so what are we going to do we're going
to randomly pick a replica with an
up-to-date version number as the leader
so now the leader is going to be r1 and
let's say we have a lease that expires
in an hour for it because replica 1 also
has version 23 it could have been one or
three here
now all that's going to happen is we're
going to go ahead and contact replica
one which is the leftmost one in the
bottom there and send the right through
it and let it propagate propagate
through the
pipeline okay so what are some issues
with hadoop well if you've been paying
attention so far you may have noticed
that i've only mentioned one name node
which is obviously a problem what
happens if the name node goes down well
everything crashes so in the original
hadoop implementation
there was kind of like this hacky way of
solving things called a secondary name
node which was basically just like a
standby name node that went ahead and
tried to take in all those changes but
there's actually a better way of solving
this and it uses coordination services
like i talked about in the last video so
this is known as high availability hdfs
what do we actually do well keep in mind
that the name node basically the the
main persistence point of the name node
that you can use to derive the state of
the name node is the edit log so the
edit log is basically just going to keep
track of all that file metadata changes
such as renaming files creating a new
directory anything along those lines and
so instead of just keeping all of those
changes locally to the name node what
we're going to do is go ahead and use
something like a few zookeeper nodes to
create a replicated log which represents
the edit log
this in hadoop turns is known as the
quorum journal manager so anyways after
we have this replicated log we can now
have a second instance of a name node
which you know we'll just call it the
backup for now and all it's going to do
is read that replicated log and keep its
state up to date in the same way the
name node does
so by using uh this coordination service
here we're actually able to keep a
secondary name node relatively up to
date so in the event that the first one
goes down
you know say we have
the coordination service also has a
distributed lock
the first one goes down it will no
longer be holding that distributed lock
and then the second one can grab the
distributed lock to basically say i'm
the leader now i am going to be the name
node
okay so in conclusion um hdfs can
provide really high read and write
throughput by using a rack-aware
replication schema
and reading as well
this is super useful and in addition to
that while the original hdfs kind of
design wasn't very fault tolerant the
fact that they've now added a
coordination service to it is great for
kind of that leader failover of the name
node and allows high availability
obviously hdfs isn't perfect like i said
it kind of aims for strong consistency
but in reality you might have data
inconsistencies and have to handle this
in your application code
but on the whole hdfs is really good for
storing data in conjunction with large
scale compute we're going to see that
plenty of databases are built on top of
hdfs in order to kind of provide a
better programming interface and allow
for more complicated querying of it and
that's what i'm going to be talking
about in my next video
so i hope this was useful guys and
welcome to all the new subscribers again
and i'll see you soon
تصفح المزيد من مقاطع الفيديو ذات الصلة
HDFS- All you need to know! | Hadoop Distributed File System | Hadoop Full Course | Lecture 5
Introduction to Hadoop
Big Data In 5 Minutes | What Is Big Data?| Big Data Analytics | Big Data Tutorial | Simplilearn
Hadoop Ecosystem Explained | Hadoop Ecosystem Architecture And Components | Hadoop | Simplilearn
Hadoop In 5 Minutes | What Is Hadoop? | Introduction To Hadoop | Hadoop Explained |Simplilearn
3 Overview of the Hadoop Ecosystem
5.0 / 5 (0 votes)