Google SWE teaches systems design | EP21: Hadoop File System Design

Jordan has no life
25 Apr 202212:12

Summary

TLDRThis video delves into the architecture of Hadoop Distributed File System (HDFS), explaining its design for high throughput in both reading and writing large-scale data. It covers HDFS's distributed nature, chunk storage, metadata management by the Name Node, and the importance of replication for data availability. The script also touches on the evolution of HDFS to include high availability through coordination services, addressing the single point of failure issue. The video promises to connect these concepts to databases built on top of HDFS in a subsequent video.

Takeaways

  • 🌞 The video discusses the architecture of Hadoop Distributed File System (HDFS), focusing on its design for high throughput reads and writes.
  • πŸ“š HDFS is based on the Google File System and is popular for running on standard desktop computers, making it accessible for large-scale distributed systems.
  • πŸ” HDFS is designed for write-once-read-many times files, storing data in chunks across multiple data nodes to improve parallelism for large files.
  • πŸ—ƒοΈ The Name Node is a critical component of HDFS, storing all metadata about files and their chunks, and is kept in memory for quick access.
  • πŸ”„ HDFS uses a write-ahead log (edit log) and an fsimage file for persistent storage of metadata changes, ensuring data is not lost on Name Node failure.
  • πŸ” HDFS employs rack-aware replication to enhance data availability and throughput, placing replicas in different racks to minimize the risk of simultaneous node failures.
  • πŸ”„ Pipelining is used in replication to ensure all replicas acknowledge write operations, maintaining strong consistency despite potential failures.
  • πŸ“– Reading from HDFS involves querying the Name Node for the location of data chunks and selecting the data node with the least network latency for the client.
  • πŸ–ŠοΈ Writing to HDFS, especially appending, involves a process of selecting a primary replica and ensuring data is propagated through the replication pipeline.
  • ⚠️ A single point of failure exists with the Name Node; however, High Availability (HA) HDFS uses a quorum journal manager and Zookeeper for failover.
  • πŸ”‘ The video concludes by highlighting HDFS's strengths for large-scale compute and data storage, and its role as a foundation for databases that provide more complex querying capabilities.

Q & A

  • What is the primary purpose of the Hadoop Distributed File System (HDFS)?

    -The primary purpose of HDFS is to store large data sets reliably across clusters of commodity hardware, providing high throughput access to the data for distributed processing.

  • Why is HDFS designed to be written once and then read many times?

    -HDFS is designed this way to optimize for large-scale data processing workloads, where data is often written once and then processed or analyzed multiple times.

  • What is the typical size of the chunks in which HDFS stores files?

    -Typically, the chunks in HDFS are around 64 to 128 megabytes in size.

  • What is the role of the NameNode in HDFS?

    -The NameNode in HDFS is responsible for storing all the metadata regarding files, including the mapping of file blocks to the DataNodes where they are stored.

  • How does HDFS handle file system metadata changes?

    -HDFS handles file system metadata changes by using an edit log, which is a write-ahead log for the NameNode, and periodically checkpointing the in-memory state to an fsimage file on disk.

  • What is the significance of the DataNode's block report in HDFS?

    -The block report from a DataNode informs the NameNode about the blocks it holds, allowing the NameNode to maintain an up-to-date map of file blocks and their locations.

  • What does rack-aware replication mean in the context of HDFS?

    -Rack-aware replication in HDFS means that data chunks are replicated in a way that considers the physical location of the nodes, typically placing one replica in the same rack as the writer and others in remote racks to maximize availability and minimize network traffic.

  • How does the replication process in HDFS ensure data consistency?

    -The replication process in HDFS ensures data consistency by using a pipeline approach where data is written to a primary replica and then propagated to secondary replicas. A write is only considered successful if all replicas in the pipeline acknowledge it.

  • What is the client's strategy when it encounters a write failure in HDFS?

    -When a client encounters a write failure in HDFS, it should keep retrying the write operation until it receives a success message.

  • What is the main issue with the original design of HDFS in terms of fault tolerance?

    -The main issue with the original design of HDFS in terms of fault tolerance is the single point of failure represented by the NameNode. If the NameNode goes down, the entire system crashes.

  • How does High Availability (HA) in HDFS address the single NameNode issue?

    -High Availability in HDFS addresses the single NameNode issue by using a backup NameNode that stays updated with the state of the primary NameNode through a replicated edit log, allowing for a failover to the backup NameNode in case the primary one goes down.

Outlines

00:00

πŸ“š Introduction to HDFS Architecture

The speaker begins by discussing their personal situation before diving into the main topic: the architecture of the Hadoop Distributed File System (HDFS). They explain HDFS's significance in large-scale distributed systems and its origins from the Google File System. The speaker highlights HDFS's popularity due to its ability to run on standard desktop computers and its use in batch processing with tools like MapReduce and Spark. They also mention the importance of HDFS as a foundational building block for databases, allowing for efficient data interaction and computation. The paragraph concludes with an overview of HDFS's design, focusing on its write-once-read-many (WORM) approach and the use of data chunks for improved parallelism in reading and writing large files.

05:01

πŸ”‘ The Crucial Role of the Name Node in HDFS

This section delves into the critical component of HDFS: the Name Node. The Name Node is responsible for storing all metadata about files, including names, subdirectories, and the locations of data chunks across various data nodes. The speaker describes the Name Node's operation, which involves maintaining an in-memory representation of the file system's state and using an edit log as a write-ahead log for changes. They also explain the process of checkpointing the file system state to disk and the Name Node's recovery procedure in case of a crash. The paragraph further discusses the Name Node's role in handling replication, ensuring data availability, and managing the replication factor through block reports from data nodes.

10:02

πŸ”„ HDFS Replication and High Availability

The speaker addresses the replication process in HDFS, emphasizing its rack-aware design to maximize availability and throughput. They explain the default replication factor of three and how HDFS places replicas across different racks to minimize the risk of data loss due to rack failures. The paragraph also covers the pipelining mechanism for data replication, ensuring that all replicas acknowledge a write operation before the client considers it successful. The speaker then discusses the read process in HDFS, where the client selects the nearest data node for minimal latency. They also describe the process for appending to a file, including the selection of a primary replica and the replication pipeline. The paragraph concludes with a brief mention of issues in Hadoop, particularly the single point of failure with the Name Node, and introduces the concept of high availability in HDFS through coordination services like ZooKeeper.

πŸ›‘οΈ Enhancing HDFS with High Availability and Coordination Services

In this final paragraph, the speaker discusses the evolution of HDFS to include high availability features, overcoming the single Name Node limitation. They explain the use of a quorum journal manager, which replicates the edit log across multiple nodes, allowing a backup Name Node to stay synchronized with the primary. In the event of a primary Name Node failure, the backup can take over by acquiring a distributed lock, ensuring continuous operation. The speaker concludes by summarizing HDFS's strengths in providing high read and write throughput, its rack-aware replication schema, and its improved fault tolerance through coordination services. They also mention the trade-offs of HDFS, such as its strong consistency model and potential data inconsistencies that applications may need to handle. The paragraph ends with a teaser for the next video, which will explore databases built on top of HDFS.

Mindmap

Keywords

πŸ’‘HDFS

HDFS stands for Hadoop Distributed File System, a critical component of large-scale distributed systems designed to store and manage large volumes of data across clusters of commodity servers. In the video, HDFS is the central theme, with discussions on its architecture, benefits, and how it achieves high throughput for both reading and writing operations. The script mentions HDFS's ability to run on normal desktop computers, making it popular for batch processing and as a building block for databases.

πŸ’‘High Throughput

High throughput in the context of the video refers to the capability of HDFS to handle a large number of operations, particularly read and write requests, efficiently and quickly. The script explains that HDFS achieves this by storing data in chunks across different data nodes, which improves parallelism and thus, throughput. The term is used to describe the performance benefits of HDFS in handling large files, which are often in gigabytes or terabytes.

πŸ’‘Name Node

The Name Node is a key component of HDFS, responsible for storing all metadata regarding files in the system. The video script describes it as the most important part of HDFS, which keeps track of file names, data nodes where chunks are located, and version numbers. The Name Node's role is crucial for maintaining the system's integrity and availability, as it manages the file system namespace and regulates client access to file data.

πŸ’‘Data Node

Data Nodes are the workers in an HDFS cluster that store the actual data. The script explains that data is stored in chunks on these nodes, typically around 64 to 128 megabytes in size. Data Nodes report back to the Name Node with block reports, informing it of the chunks they hold, which is essential for the Name Node to maintain an updated state of the system.

πŸ’‘Replication

Replication in HDFS is the process of making multiple copies of data chunks to ensure data availability and fault tolerance. The script mentions that HDFS uses a rack-aware replication strategy, placing one replica in the same rack as the writer and two on a remote random rack to minimize network bandwidth usage while maximizing availability. Replication is a fundamental concept in HDFS, critical for maintaining data integrity.

πŸ’‘Rack Awareness

Rack awareness is a feature of HDFS replication that ensures data is distributed across different racks or data centers to reduce the risk of data loss due to rack failures. The video script explains that this feature is important for maximizing availability and throughput by placing replicas in such a way that they are less likely to be affected by the same failure points.

πŸ’‘Edit Log

The Edit Log in HDFS is a write-ahead log that records all changes to the file system metadata. The video script describes how the Name Node uses the Edit Log to keep track of all file metadata changes, such as renaming files or creating directories. It is crucial for the Name Node's recovery process, as it allows the system to reconstruct its state after a failure.

πŸ’‘FS Image File

The FS Image File is a checkpoint of the file system's metadata state, stored on disk. The video script explains that the Name Node periodically checkpoints its in-memory state to the FS Image File. In the event of a Name Node failure, the FS Image File, combined with the Edit Log, is used to restore the system's state.

πŸ’‘Pipelining

Pipelining in the context of HDFS replication refers to the process where data is sent through a series of replicas in a specific order. The video script describes how this technique is used to ensure that all replicas acknowledge a write operation before the client considers it successful. This method is important for maintaining strong consistency in the system.

πŸ’‘High Availability

High Availability in HDFS refers to the system's ability to remain operational and accessible even in the event of component failures. The video script discusses how the original HDFS design was not very fault-tolerant but has since been improved with the introduction of High Availability features, such as the use of a secondary Name Node and coordination services like ZooKeeper, to ensure continuous service.

πŸ’‘HBase

HBase is a distributed, scalable, big data store based on Hadoop. Although not extensively covered in the script, the video mentions HBase as a segue from HDFS, suggesting that it is used for building databases on top of HDFS to provide a better programming interface and enable more complex querying of data stored in HDFS.

Highlights

Introduction to Hadoop Distributed File System (HDFS) architecture and its role in high-throughput reads and writes.

HDFS is based on the Google File System and is popular for running on standard desktop computers.

HDFS is designed for batch processing with tools like MapReduce, Spark, and Tez.

Overview of HDFS's storage method using data chunks across multiple data nodes to improve parallelism.

The importance of the Name Node in storing metadata and managing file system changes through the edit log.

Explanation of the Name Node's boot process, including entering safe mode and receiving block reports from data nodes.

Rack-aware replication in HDFS to maximize availability and throughput, reducing latency and risk of data loss.

How pipelining works in HDFS for efficient data replication across nodes.

Client-side process for reading files in HDFS, including querying the Name Node and choosing the optimal data node for minimal latency.

The complexity of appending to files in HDFS, involving selecting a primary replica and managing the replication pipeline.

Visualization of the write process in HDFS, demonstrating the interaction between replicas and the Name Node.

Challenges with the single Name Node design and the risks associated with Name Node failure.

Introduction to High Availability HDFS and the use of coordination services for Name Node failover.

The role of the Quorum Journal Manager in maintaining a replicated log for Name Node state synchronization.

HDFS's strengths in providing high read and write throughput and its evolution towards fault tolerance.

HDFS's limitations, including potential data inconsistencies and the need for application-level handling.

Upcoming discussion on databases built on top of HDFS for enhanced data interaction and complex querying.

Transcripts

play00:00

all right i'm back again this time in

play00:02

the morning because my roommate's not

play00:04

here if you guys can tell my voice is a

play00:06

little messed up

play00:07

i guess i was playing with the boys a

play00:09

little too late last night and for some

play00:11

reason my knees are a little scratched

play00:12

up too i don't really get that one but

play00:14

who knows so anyways today we're going

play00:16

to talk about the architecture of hdfs

play00:20

and figure out why that works the way it

play00:22

does and how they're able to achieve

play00:24

high throughput both on reads and writes

play00:27

and then that'll allow us to segue

play00:29

hopefully pretty easily into hbase and

play00:31

see kind of the good reasons to use

play00:33

something like that

play00:36

okay so hdfs and the design of it

play00:40

just to give a background i've mentioned

play00:42

distributed file systems in the past

play00:44

but basically they're a really important

play00:46

component of a ton of large-scale

play00:48

distributed systems

play00:50

even though hdfs is probably the most

play00:52

popular one it itself is based off the

play00:54

google file system which is a paper that

play00:56

came out well over a decade ago now

play00:59

and because of the fact that hdfs can be

play01:02

run on just normal desktop computers

play01:04

it's really popular obviously there's a

play01:06

big wave to be able to just like spin up

play01:08

instances of things using things like

play01:09

ect ec2 clusters or just amazon web

play01:13

services in general

play01:15

so even though hdfs is really useful for

play01:17

things like batch processing we've

play01:19

talked about this with mapreduce spark

play01:20

and tes

play01:22

hdfs is really really good for a

play01:24

database building block so that you can

play01:27

provide an extra layer to kind of

play01:28

interact with the data on hdfs and then

play01:31

ultimately run a ton of computations on

play01:33

it

play01:35

okay so just to give an overview as you

play01:37

can see on the right

play01:38

you're going gonna see a ton of terms

play01:39

that you don't know yet but by the end

play01:41

of this video you will so generally

play01:43

speaking hadoop is designed to basically

play01:45

be able to write a file once and then

play01:47

you know from then on you can append or

play01:49

truncate it but generally speaking just

play01:52

read it many times over the way this is

play01:54

done is by storing them in chunks across

play01:57

a bunch of different data nodes and

play01:59

typically these are around 64 128

play02:01

megabyte chunks and the reason you do

play02:03

that is to improve the parallelism of

play02:05

both reading and writing big files

play02:07

oftentimes these files are gigabytes or

play02:09

maybe even terabytes in size and as a

play02:11

result having to write them all

play02:12

sequentially would be terrible

play02:14

and then finally in order to ensure

play02:16

availability and no loss data chunks are

play02:18

obviously going to be replicated

play02:21

okay so the first component of hdfs that

play02:24

we have to talk about is probably the

play02:25

most important one and that's going to

play02:27

be called the name node so the name node

play02:29

generally speaking is where all of the

play02:31

metadata regarding files is stored so

play02:33

that includes things like names but not

play02:35

only does it just hold the names of the

play02:37

files and perhaps even their

play02:38

subdirectories if it is a directory

play02:40

but more importantly it has to keep

play02:42

track of all the chunks so basically all

play02:44

of the data nodes where those chunks are

play02:46

located as well as their corresponding

play02:48

version numbers like i said you can

play02:50

append or truncate files and doing so

play02:52

would increment the version number of it

play02:54

okay so how does it do this well it

play02:57

keeps all of that metadata in memory all

play02:59

of the changes to file system memory go

play03:02

to something

play03:03

sorry all of the changes to file system

play03:05

metadata go to something called the edit

play03:07

log so the edit log is effectively just

play03:09

a write ahead log for the name node

play03:11

because obviously if we had to you know

play03:13

like go ahead and change kind of disk

play03:15

state every single time or some

play03:17

persistent state of the entire state of

play03:20

the file system those writes would not

play03:21

be sequential and they would take longer

play03:23

so we do is we put them to a write-ahead

play03:25

log

play03:26

change the state in memory and then

play03:28

occasionally checkpoint that state on

play03:30

disk to something called an fs image

play03:32

file

play03:33

and then if the name node ever crashes

play03:35

and has to reboot the fs image

play03:37

checkpoint file in conjunction with

play03:39

whatever edit log rights come after that

play03:41

checkpoint can be combined to create a

play03:44

new state for the name node

play03:48

okay in terms of continuing to talk

play03:50

about the name node

play03:51

it actually only keeps the location of

play03:53

all the chunks

play03:55

only in memory so when it first boots

play03:57

what happens is the name node goes into

play03:59

safe mode and it's going to receive

play04:00

something called a block report from

play04:02

each data node where the data node is

play04:04

going to tell it which chunks are held

play04:06

on that data node the name node is then

play04:08

going to compile all of this information

play04:10

construct that local state and say

play04:12

here's where the chunks are located and

play04:14

say now

play04:16

it sees that only one replica is holding

play04:18

a given chunk and you know the user is

play04:20

say specified replication factor of

play04:22

three for that chunk it's going to say

play04:24

okay we don't have enough replicas for

play04:26

this

play04:27

chunk in particular let's go ahead and

play04:30

add some additional replicas for it so

play04:32

go ahead and replicate that chunk to two

play04:34

other nodes and that way we can reach

play04:36

the replication threshold the same thing

play04:38

will occur if a name node assumes that a

play04:40

given data node is dead because it

play04:42

hasn't received any heartbeats from it

play04:43

for a while

play04:45

okay now let's talk about replication so

play04:48

replication in hadoop is something

play04:50

called rack aware and this is really

play04:52

important because it allows for both

play04:54

maximizing availability and throughput

play04:56

so we'll talk about that in a second

play04:58

chunks are going to be replicated in a

play05:00

way that not only reduces latency for

play05:02

clients but also reduces the possibility

play05:05

of all the replica nodes going down

play05:07

because of the fact that they're put in

play05:09

a different rack or data center

play05:11

so for example for the default

play05:13

replication factor of three hadoop is

play05:15

going to put one replica in the same

play05:16

rack as the writer and then two replicas

play05:19

on the same remote random rack

play05:21

the reason they put them in the same

play05:22

random rack is just to minimize network

play05:24

bandwidth you don't have to go to two

play05:26

different racks and since we have kind

play05:28

of a synchronous replication here where

play05:29

we wait for all of these rights to

play05:31

complete it's actually pretty important

play05:33

that all of these replicas complete

play05:35

their right as fast as possible it's not

play05:36

eventually consistent so i'll touch upon

play05:38

that in a second

play05:40

so

play05:41

how does replication actually work well

play05:43

there's something called pipelining

play05:44

basically the replicas are arranged into

play05:46

some order and the data is pipelined

play05:48

from one replica in the next on a right

play05:51

or an append or a truncate rights are

play05:53

only considered successful from the

play05:54

client's point of view if all of the

play05:56

replicas in this pipeline actually

play05:58

acknowledge them so even though in

play06:00

theory this should lead to strong

play06:01

consistency

play06:02

the issue is that say the first replica

play06:05

receives a right it's going to go ahead

play06:08

and commit that to itself so it's going

play06:10

to perform the right and then the second

play06:12

and third replicas don't ever actually

play06:14

acknowledge the right meaning that you

play06:16

know they didn't perform them themselves

play06:18

well the client is going to receive a

play06:20

failure for its right however

play06:23

it's going to be the case that the right

play06:24

is still in one of the replicas so

play06:26

generally speaking when a client

play06:28

receives a failure on a right

play06:30

it needs to just keep retrying until it

play06:32

receives a success message

play06:35

so as you can see the first replica is

play06:37

going to send that right to the second

play06:38

one which sends the right to the third

play06:40

one which then sends the acknowledgement

play06:41

back and back again so once this whole

play06:44

process is complete the client sees its

play06:46

right as successful

play06:48

okay

play06:49

in terms of reading in hadoop basically

play06:51

all that happens here is the client is

play06:53

going to query the master or when i say

play06:55

the master in this case i mean the name

play06:56

node to get a list of data nodes

play06:58

carrying the chunk that it wants it's

play07:00

going to figure out which data node is

play07:02

closest to it because like i said hadoop

play07:04

is aware of the rack that the nodes are

play07:06

in and as a result of that it can say

play07:08

for a given client which one is probably

play07:10

going to have minimal network latency

play07:12

when communicating with it so you choose

play07:14

the best data node to read from

play07:16

you're going to cache this result on the

play07:18

client in the case that you want to read

play07:20

that file again because like i said

play07:21

write once read many times and then the

play07:24

client's going to just go ahead and

play07:25

perform that read

play07:27

okay in terms of doing rights this is a

play07:29

little bit more complex but i'll have a

play07:30

visualization after i walk through this

play07:32

process

play07:34

so to append to a file

play07:36

go ahead and reach out to the name node

play07:37

see the data nodes where the chunk is

play07:39

located and then you have to pick

play07:41

something called a primary replica this

play07:42

is going to be the first replica in that

play07:44

replication pipeline

play07:46

if there's already a primary and the

play07:48

lease for the primary is still valid

play07:50

because you know the lease basically

play07:51

says how long until there's no longer a

play07:53

primary perform the right to the primary

play07:55

replica let it go through the chain of

play07:57

replication otherwise we need to pick a

play07:59

primary replica how can we do this well

play08:01

we look at the data nodes that the chunk

play08:04

is located on

play08:05

and pick one of them with the most

play08:07

up-to-date version of that chunk if it

play08:09

doesn't exist we have a data loss

play08:10

problem and hopefully this never comes

play08:12

up

play08:14

once the primary replica is determined

play08:16

all the other replicas are considered

play08:18

secondary we kind of establish that path

play08:21

for the replication and then the client

play08:23

is going to make the right to the

play08:24

primary replica and you know hope to get

play08:27

a success result

play08:30

okay so to actually visualize this let's

play08:32

say i'm trying to write jordan's

play08:33

nudes.png we've got three replicas

play08:36

replica one two and three

play08:38

and the first thing we're going to do is

play08:40

i'm going to

play08:42

go ahead and ask the name node for what

play08:44

chunks are holding it and try and find

play08:46

out what the leader was so i find out

play08:48

that the leader was replica 3 because as

play08:50

you can see it has version 23 of the

play08:52

file there but that's since expired

play08:55

so what are we going to do we're going

play08:57

to randomly pick a replica with an

play08:59

up-to-date version number as the leader

play09:01

so now the leader is going to be r1 and

play09:03

let's say we have a lease that expires

play09:04

in an hour for it because replica 1 also

play09:06

has version 23 it could have been one or

play09:08

three here

play09:10

now all that's going to happen is we're

play09:11

going to go ahead and contact replica

play09:14

one which is the leftmost one in the

play09:15

bottom there and send the right through

play09:18

it and let it propagate propagate

play09:19

through the

play09:20

pipeline okay so what are some issues

play09:24

with hadoop well if you've been paying

play09:26

attention so far you may have noticed

play09:28

that i've only mentioned one name node

play09:30

which is obviously a problem what

play09:32

happens if the name node goes down well

play09:34

everything crashes so in the original

play09:36

hadoop implementation

play09:38

there was kind of like this hacky way of

play09:40

solving things called a secondary name

play09:42

node which was basically just like a

play09:44

standby name node that went ahead and

play09:46

tried to take in all those changes but

play09:48

there's actually a better way of solving

play09:50

this and it uses coordination services

play09:52

like i talked about in the last video so

play09:54

this is known as high availability hdfs

play09:58

what do we actually do well keep in mind

play09:59

that the name node basically the the

play10:01

main persistence point of the name node

play10:04

that you can use to derive the state of

play10:06

the name node is the edit log so the

play10:08

edit log is basically just going to keep

play10:10

track of all that file metadata changes

play10:12

such as renaming files creating a new

play10:15

directory anything along those lines and

play10:17

so instead of just keeping all of those

play10:19

changes locally to the name node what

play10:21

we're going to do is go ahead and use

play10:23

something like a few zookeeper nodes to

play10:26

create a replicated log which represents

play10:28

the edit log

play10:30

this in hadoop turns is known as the

play10:32

quorum journal manager so anyways after

play10:34

we have this replicated log we can now

play10:36

have a second instance of a name node

play10:38

which you know we'll just call it the

play10:40

backup for now and all it's going to do

play10:42

is read that replicated log and keep its

play10:44

state up to date in the same way the

play10:46

name node does

play10:47

so by using uh this coordination service

play10:50

here we're actually able to keep a

play10:51

secondary name node relatively up to

play10:54

date so in the event that the first one

play10:55

goes down

play10:57

you know say we have

play10:59

the coordination service also has a

play11:01

distributed lock

play11:02

the first one goes down it will no

play11:04

longer be holding that distributed lock

play11:06

and then the second one can grab the

play11:07

distributed lock to basically say i'm

play11:09

the leader now i am going to be the name

play11:10

node

play11:12

okay so in conclusion um hdfs can

play11:15

provide really high read and write

play11:17

throughput by using a rack-aware

play11:19

replication schema

play11:20

and reading as well

play11:23

this is super useful and in addition to

play11:25

that while the original hdfs kind of

play11:28

design wasn't very fault tolerant the

play11:30

fact that they've now added a

play11:32

coordination service to it is great for

play11:34

kind of that leader failover of the name

play11:36

node and allows high availability

play11:38

obviously hdfs isn't perfect like i said

play11:41

it kind of aims for strong consistency

play11:43

but in reality you might have data

play11:45

inconsistencies and have to handle this

play11:47

in your application code

play11:48

but on the whole hdfs is really good for

play11:51

storing data in conjunction with large

play11:53

scale compute we're going to see that

play11:55

plenty of databases are built on top of

play11:57

hdfs in order to kind of provide a

play11:59

better programming interface and allow

play12:01

for more complicated querying of it and

play12:04

that's what i'm going to be talking

play12:05

about in my next video

play12:07

so i hope this was useful guys and

play12:09

welcome to all the new subscribers again

play12:10

and i'll see you soon

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
HDFSDistributed SystemsData StorageBig DataHigh AvailabilityReplicationName NodeData NodesRack AwarenessHadoop EcosystemLarge-Scale Computing