Introduction to Hadoop

Kavitha S
4 Aug 202415:02

Summary

TLDRThis video, presented by Mrs. Kavida S., an assistant professor at MIT Academy of Engineering, offers a comprehensive overview of Hadoop architecture. It explains Hadoop's role as an open-source framework for distributed processing of large data sets using clusters of computers. Key components like MapReduce for computation, HDFS for storage, and YARN for resource management are discussed. The video also covers the map and reduce steps, a word count problem example, and the functioning of Hadoop's distributed file system, highlighting its fault tolerance and efficiency in handling large-scale data.

Takeaways

  • 😀 Hadoop is an open-source framework by Apache, written in Java, allowing distributed processing of large datasets across clusters of computers.
  • 💻 Hadoop's architecture includes four main modules: MapReduce, HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), and Hadoop Common.
  • 📊 Hadoop MapReduce processes data in parallel by breaking tasks into smaller parts (Map phase) and then aggregating them (Reduce phase).
  • 🗂️ HDFS is a distributed file system that provides high-throughput access to application data, ensuring fault tolerance by replicating data across different nodes.
  • 🔑 In MapReduce, data is processed using key-value pairs, which can be simple or complex, like a filename as the key and its contents as the value.
  • 📈 A common example problem solved by MapReduce is word count, where the input is split into words, grouped, and counted across large datasets.
  • 🔄 The Hadoop framework ensures data is distributed across nodes and uses replication to handle hardware failures, ensuring data integrity and high availability.
  • 🖥️ HDFS works with a master node (NameNode) and multiple worker nodes (DataNodes) to manage and store data efficiently, with the NameNode as a central point of access.
  • 🚨 The NameNode is a single point of failure in the system, but high availability features allow for failover with an active and standby NameNode setup.
  • 🗃️ YARN separates resource management from job scheduling, replacing the JobTracker and TaskTracker components of older Hadoop versions.

Q & A

  • What is Hadoop?

    -Hadoop is an open-source framework by Apache, written in Java, that allows for distributed processing of large data sets across clusters of computers using simple programming models.

  • How does Hadoop scale across machines?

    -Hadoop is designed to scale up from a single server to thousands of machines, each offering local computation and storage, making it ideal for handling large amounts of data.

  • What is the purpose of the MapReduce algorithm in Hadoop?

    -MapReduce is an algorithm used in Hadoop for parallel processing. It breaks down tasks into smaller sub-tasks (Map step) and processes them in parallel before combining the results (Reduce step) to produce the final output.

  • What are the four main modules of Hadoop architecture?

    -The four main modules are: 1. Hadoop Common – Java libraries and utilities, 2. HDFS – Hadoop Distributed File System, 3. YARN – resource management, and 4. MapReduce – a system for parallel processing.

  • What is the role of Hadoop Distributed File System (HDFS)?

    -HDFS is a distributed file system in Hadoop that provides high-throughput access to application data. It stores large data sets reliably across many nodes and is fault-tolerant.

  • Can you explain how MapReduce works with an example?

    -In MapReduce, the 'Map' step processes data to extract useful information, and the 'Reduce' step aggregates the results. For example, in a word count problem, 'Map' extracts individual words and counts their occurrences, and 'Reduce' consolidates the word counts across multiple instances.

  • What is YARN, and what role does it play in Hadoop?

    -YARN stands for Yet Another Resource Negotiator. It separates resource management from job scheduling and monitoring, helping Hadoop efficiently allocate resources across the cluster.

  • How does Hadoop handle failures and ensure data reliability?

    -Hadoop handles failures by replicating data blocks, typically across three nodes, so that in case of hardware failure, the system can continue operating with minimal disruption.

  • What are the roles of the NameNode and DataNode in HDFS?

    -The NameNode acts as the master server, managing the file system's namespace and regulating access to files. DataNodes store the actual data and handle read/write operations based on client requests, under instructions from the NameNode.

  • What are some real-world use cases of Hadoop?

    -Some real-world use cases include LinkedIn's processing of daily logs and user activities, and Yahoo's deployment for search index creation, web page content optimization, and spam filtering.

Outlines

00:00

📘 Introduction to Hadoop and its Architecture

The speaker, Mrs. Kavida S., introduces the topic of Hadoop architecture. Hadoop is an open-source framework developed by Apache, written in Java, designed for the distributed processing of large data sets across computer clusters. Hadoop allows scaling from a single server to thousands of machines. It uses the MapReduce algorithm for parallel data processing. The architecture comprises four main components: MapReduce for computation, HDFS for storage, YARN for resource management, and Common utilities.

05:00

🛠 Components of Hadoop Architecture

This section elaborates on the four modules of Hadoop. Hadoop Common contains Java libraries and utilities for the system. YARN handles job scheduling and resource management. HDFS is a high-throughput, distributed file system. MapReduce breaks down large tasks into smaller ones, processing them in parallel. A sample problem—word counting in a large text file—is introduced to demonstrate MapReduce’s efficiency. The steps involve mapping words, sorting, and reducing the data to obtain the final word count.

10:01

🔄 Word Count Problem with MapReduce

The word count example is explored in depth. After breaking sentences into words, MapReduce maps each word to a count, groups similar words, and then shuffles and reduces the data. This process allows counting word occurrences across sentences. The advantages of MapReduce include distributing workloads across multiple machines, running tasks in parallel, managing errors, and optimizing performance even during partial system failures.

🗃 Hadoop Distributed File System (HDFS)

HDFS is introduced as a fault-tolerant, distributed file system designed to run on low-cost hardware. It ensures high throughput access to data, making it ideal for large-scale applications. Files are divided into blocks and replicated across cluster nodes. HDFS ensures data reliability by replicating blocks to handle hardware failures. The process involves splitting files, distributing blocks, and managing replication to ensure data availability and fault tolerance.

🖥 Master-Slave Architecture in HDFS

This section explains the role of NameNode (master) and DataNodes (slaves) in HDFS. The NameNode manages file system operations such as renaming, opening, or closing files. DataNodes handle storage tasks, such as reading and writing files. The HDFS system is highly reliant on the NameNode, making it a potential point of failure. Solutions like NameNode High Availability and Secondary NameNode are described to enhance system resilience.

🎯 Yarn: Resource Management in Hadoop

YARN (Yet Another Resource Negotiator) is introduced as the component responsible for separating resource management from job scheduling and monitoring. YARN replaces the traditional JobTracker and TaskTracker system, improving cluster efficiency. The section also discusses how jobs are scheduled and managed in a Hadoop cluster, including LinkedIn and Yahoo’s use cases for Hadoop in tasks like analyzing user activity, optimizing web content, and managing ad placements.

Mindmap

Keywords

💡Hadoop

Hadoop is an Apache open-source framework written in Java, designed to handle large data sets distributed across clusters of computers. It allows for scalable storage and parallel processing, making it essential for managing massive data volumes. In the video, Hadoop is introduced as the central technology, enabling applications to perform complex analyses on vast data.

💡MapReduce

MapReduce is a core component of the Hadoop framework used for parallel processing of large data sets. It consists of two main steps: the 'Map' step, which processes data into key-value pairs, and the 'Reduce' step, which consolidates the results. The video explains MapReduce with an example of counting words in a document, showing how it breaks down tasks for efficient execution.

💡HDFS (Hadoop Distributed File System)

HDFS is a distributed file system that allows Hadoop to store data across multiple machines while ensuring fault tolerance and high throughput. It divides files into uniform-sized blocks, which are distributed across various nodes. The video highlights HDFS as crucial for handling large-scale data storage and managing data replication to prevent loss due to hardware failure.

💡YARN (Yet Another Resource Negotiator)

YARN is Hadoop’s cluster resource management framework, responsible for job scheduling and managing system resources. It separates resource management from job scheduling, making Hadoop more efficient. In the video, YARN is shown as the replacement for the older Job Tracker and Task Tracker components, improving resource utilization in the cluster.

💡Cluster

A cluster in Hadoop refers to a group of interconnected computers that work together to perform tasks. Each computer in the cluster can store data and run processing tasks simultaneously, allowing for the distribution of large-scale computations. The video emphasizes Hadoop’s ability to scale from a single server to thousands of machines, illustrating how clusters manage both computation and storage.

💡Data Node

Data Nodes are part of the HDFS architecture in Hadoop, where they handle the storage of data. Each node stores file blocks and manages read/write requests from clients. The video explains that Data Nodes perform tasks such as block creation, deletion, and replication, and communicate with the Name Node to manage data effectively.

💡Name Node

The Name Node is the master server in HDFS responsible for managing the file system’s namespace and regulating client access to files. It tracks the locations of file blocks stored across Data Nodes. In the video, the Name Node is described as a critical component, though its single point of failure is noted as a potential drawback in certain configurations.

💡Fault Tolerance

Fault tolerance refers to Hadoop's ability to handle failures without losing data or disrupting processing. This is achieved through data replication across multiple nodes. In the video, the importance of HDFS’s fault-tolerant design is highlighted, where each data block is replicated to three nodes to ensure reliability, even in the event of hardware failure.

💡Word Count Problem

The Word Count problem is a classic example used to illustrate how MapReduce works in Hadoop. It involves counting the frequency of each distinct word in a text document. The video uses this problem to demonstrate how data is split, mapped, shuffled, and reduced to produce a final count of word occurrences.

💡Job Tracker

The Job Tracker is a component in older versions of Hadoop (pre-YARN) responsible for resource management and job scheduling across the cluster. It works with Task Trackers to monitor task progress. In the video, the Job Tracker’s role in managing task life cycles and ensuring efficient resource use is explained, though it is eventually replaced by YARN for better scalability.

Highlights

Hadoop is an open-source framework written in Java that allows distributed processing of large datasets across clusters of computers.

Hadoop is designed to scale from a single server to thousands of machines, each offering local computation and storage.

Hadoop applications use the MapReduce algorithm, where data is processed in parallel with other tasks.

The four core modules of Hadoop are MapReduce, HDFS (Hadoop Distributed File System), YARN, and common utilities.

MapReduce allows breaking a large task into smaller tasks, running them in parallel, and consolidating the outputs into the final result.

MapReduce uses key-value pairs for input and output, which can be used for complex problems like word counting in large text documents.

Hadoop Distributed File System (HDFS) provides high-throughput access to application data and is suitable for large data sets.

HDFS splits files into uniform-sized blocks (typically 128MB) and replicates them across nodes for fault tolerance.

The NameNode in HDFS manages the file system namespace, while DataNodes manage the actual data storage.

YARN (Yet Another Resource Negotiator) separates resource management from job scheduling and monitoring.

YARN replaces the older Hadoop JobTracker and TaskTracker with improved functionality for managing cluster resources.

HDFS is designed to be fault-tolerant, scalable, and highly efficient for processing large amounts of data.

Companies like LinkedIn and Yahoo use Hadoop to process transaction logs, analyze user activity, and optimize services like ad placement and spam filters.

Hadoop's high availability feature allows for the use of two NameNodes (active and standby) to ensure continuous operation.

Hadoop's MapReduce framework excels at distributing workloads across clusters of computers to handle massive datasets efficiently.

Transcripts

play00:01

hello everyone this video is about

play00:04

Hadoop architecture I am Mrs kavida S

play00:07

working as assistant professor in the

play00:09

department of Computer Engineering of

play00:11

MIT Academy of

play00:13

engineering Hadoop what is Hado Hadoop

play00:17

is an Apachi open source framework

play00:19

written in Java that allows distributed

play00:21

processing of large data sets across

play00:24

clusters of computers using simple

play00:27

programming models Hardo is designed to

play00:29

SC SC up from single server to thousands

play00:32

of machines each offering local

play00:34

computation and storage in short howo is

play00:38

used to develop applications that could

play00:40

perform complete statistical analysis on

play00:43

huge amounts of data howo runs

play00:47

applications using the map produce

play00:49

algorithm where the data is processed in

play00:51

parallel with

play00:53

others Hadoop architecture Hadoop

play00:56

framework consists of map produce for

play00:58

distributed computation hdfs for

play01:01

distributed storage yarn framework and

play01:04

common utilities we will see each of

play01:06

these in the subsequent

play01:08

slides Hadoop architecture Hado

play01:11

framework includes following four

play01:13

modules Hadoop

play01:15

common these are Java libraries and

play01:18

utilities required by other Hardo

play01:20

modules these libraries provides file

play01:23

system and O Level abstractions and

play01:26

contains the necessary Java files and

play01:28

scripts required to start Hadoop Hadoop

play01:31

y this is a framework for job scheduling

play01:34

and cluster Resource Management Hadoop

play01:37

distributed file system hdfs it is a

play01:40

distributed file system that provides

play01:42

High throughput access to application

play01:44

data then Hardo map produce this is yarn

play01:48

based system for parallel processing of

play01:51

large data sets map ruce map ruce

play01:55

Paradigm provides the means to break a

play01:57

larger task into smaller task run task

play02:00

in parallel and consolidate the outputs

play02:02

of the individual task into the final

play02:05

output as its name implies map produce

play02:08

consists of two basic parts a map step

play02:11

and a reduce step which are detailed as

play02:13

follows

play02:15

map applies an operation to a piece of

play02:18

data and map provides some intermediate

play02:22

output then reduce reduce consolidates

play02:25

the intermediate outputs from the map

play02:28

steps and it provides the finite output

play02:32

map ruce each step uses key value payers

play02:35

denoted as key comma value as input and

play02:39

output it is useful to think of the key

play02:41

value pairs as a simple ordered pair

play02:45

however the payers can take fairly

play02:47

complex forms for example the key could

play02:50

be a file name and the value could be

play02:52

the entire contents of the

play02:54

file example problem counting words

play02:57

assume that we have a huge text document

play03:00

and count the number of times each

play03:02

distinct word appears in the file sample

play03:05

application for the same would be

play03:07

analyzing web server logs to find

play03:09

popular

play03:11

URLs so for the pro word counting

play03:13

problem there can be two cases case one

play03:16

file is too large for the memory but all

play03:19

word count payers fit in the memory so

play03:22

you can generate a big string array or

play03:25

you can create a hash table the second

play03:27

case would be all word PS do not fit in

play03:30

the memory but fit into the disk a

play03:32

possible approach would be write

play03:34

computer functions or programs for each

play03:37

step that is break the text document

play03:40

into the sequence of words sort the

play03:43

words this will bring the same words

play03:45

together and third step would be count

play03:47

the frequencies in a single

play03:51

pass so case two captures the essence of

play03:54

map

play03:56

ruce so for the word count problem in

play03:59

initially we have to get the words that

play04:02

is the data file then sort it and then

play04:04

count it so on map redu it would be

play04:08

mapping that is extract something we are

play04:12

caring about that is in this example it

play04:14

will be the words and the

play04:17

count then then comes grouping the words

play04:21

and

play04:23

Shuffle and finally reduce the words

play04:27

that is it include reduction includes

play04:29

aggregate ation summarization Etc and

play04:32

finally save the

play04:34

results so the word count problem is

play04:38

picturized in this slide that is input

play04:42

consider two sentences for the input in

play04:45

a simp this is a simplified example so

play04:48

consider two sentences for the input the

play04:51

sentences are this is an apple and apple

play04:54

is red in color split the input into two

play04:58

sentences this is an apple and second

play05:00

one would be apple is red in

play05:02

color then apply the map that is this is

play05:07

occurrences once is occurrence one and

play05:11

occurrence is one and apple occurrence

play05:13

is one for the first sentence then for

play05:15

the second sentence apple is red in and

play05:20

color for all these words occurrences

play05:22

would be one then after mapping the next

play05:25

step would be

play05:27

shuffling that is similar words are

play05:29

grouped together that is this occurrence

play05:32

will be one and is is is occurring in

play05:35

the first sentence as well as in the

play05:37

second sentence so the word is is

play05:39

grouped together then An Occurrence is

play05:42

only one then the word apple apple

play05:46

occurs in the first sentence as well as

play05:48

in the second sentence so the word apple

play05:50

is grouped together similarly R in an

play05:53

color occurrence is only one then after

play05:57

shuffling the next step is reducing that

play06:00

is this occurrence is only once for is

play06:03

is is occurring two times so is uh is

play06:07

return with the value two similarly and

play06:11

occurrences only one and for Apple there

play06:13

are two occurrences and for red in and

play06:17

color occurrence is only one and from

play06:20

the reducing step the output

play06:24

is this occurrences one is occurrence

play06:27

two and one apple occurr is two red is

play06:30

one in is one and color is

play06:34

one so map produce has the advantage of

play06:37

being able to distribute the workload

play06:39

over a cluster of computers and run the

play06:41

task in

play06:43

parallel executing a map Produce job

play06:45

requires the management and coordination

play06:47

of several

play06:49

activities map produce jobs need to be

play06:51

scheduled based on the systems

play06:54

workload then jobs need to be monitored

play06:57

and managed to ensure that they

play06:59

encountered

play07:02

errors sorry job uh jobs need to be

play07:06

monitored and managed to ensure that any

play07:08

encountered errors are properly handled

play07:11

so that the job continues to execute if

play07:14

the system partially fails input data

play07:16

needs to be spread across the cluster

play07:19

map step processing of the input needs

play07:21

to be conducted across the distributed

play07:23

system preferably on the same machines

play07:26

where the data resides then intermediate

play07:29

out outputs from the numerous map steps

play07:32

map steps need to be collected and

play07:34

provided to the proper machines for the

play07:36

reduced step execution final output

play07:39

needs to be made available for use by

play07:42

another user another application or

play07:45

perhaps another map ruce job the next

play07:49

component of Hadoop is Hadoop

play07:51

distributed file system

play07:54

hdfs hdfs is based on the Google file

play07:57

system and provides a distributed file

play07:59

system that is designed to run on

play08:02

commodity Hardware it has many

play08:04

similarities with existing distributed

play08:06

file systems however the differences

play08:09

from other distributed file systems are

play08:11

significant it is highly fault tolerant

play08:14

and is designed to be deployed on

play08:16

lowcost Hardware it provides High

play08:19

throughput access to application data

play08:21

and is suitable for applications having

play08:24

large data sets how does Hadoop work

play08:27

Hadoop runs code across clust of

play08:29

computers this process includes the

play08:31

following core task that hadu performs

play08:34

data is initially divided into

play08:36

directories and files files are divided

play08:38

into uniform sized blocks of 128 M and

play08:43

64m preferably

play08:45

128m these files are then distributed

play08:48

across various cluster nodes for further

play08:50

processing hdfs being on top of the

play08:53

local file system supervises the

play08:55

processing how does Hardo work blocks

play08:58

are replicated for handling Hardware

play08:59

failure generally three replicas will be

play09:02

there checking that the code was

play09:04

executed successfully performing the

play09:06

sort that takes place between the map

play09:08

and reduced stages sending the sorted

play09:11

data to a certain computer and writing

play09:13

the debugging logs for each

play09:16

job features of hdfs it is suitable for

play09:19

the distributed storage and

play09:20

processing Hardo provides a command

play09:23

interface to interact with hdfs the

play09:25

built-in servers of name node and data

play09:27

node help users to easily check the

play09:30

status of cluster streaming access to

play09:32

file system data and hdfs provides file

play09:35

permissions and

play09:38

authentication name node the system

play09:41

having the name node acts as the master

play09:43

server and it does the following task

play09:46

manages the file system name space

play09:48

regulates the client's access to files

play09:50

it also executes file system operations

play09:53

such as renaming closing opening files

play09:55

and directories keep the track where

play09:58

this various blocks of data file is

play10:00

stored data

play10:03

node these nodes manage the data storage

play10:06

of their system data notes perform

play10:09

readwrite operations on the file system

play10:11

as per client request they also perform

play10:14

operations such as block creation

play10:17

deletion and replication according to

play10:19

the instructions of the name node each

play10:22

data node periodically build a report

play10:24

about the block stored on the data node

play10:27

and send report to the name node

play10:29

schematically hdfs is shown here that is

play10:33

there is a master node then eight worker

play10:35

nodes across two racks are there and

play10:38

there is a secondary node already the

play10:41

functions of name node secondary node

play10:44

and data node are explained in the

play10:46

previous

play10:48

slides then the process if a client

play10:50

application wants to access a particular

play10:53

file stored in hdfs the application

play10:55

contacts the name node then name node

play10:58

provides the application with the

play10:59

locations of the various blocks for that

play11:01

file the application then communicates

play11:03

with appropriate data nodes to access

play11:06

the

play11:07

file name node limitation for

play11:10

performance reasons the name node

play11:12

resides in a machine's memory name node

play11:14

is critical to the operations of hdfs

play11:17

any unavailability or Corruption of the

play11:19

name node results in a data

play11:21

unavailability event order on the

play11:23

cluster name node is viewed as a single

play11:25

point failure in the haloop environment

play11:27

name node is typically run on a

play11:30

dedicated machine secondary name node it

play11:33

provides the capability to perform some

play11:36

of the name node task to reduce the load

play11:37

on the name node updating the file

play11:39

system image with the contents of the

play11:41

file system edit logs secondary name

play11:44

node is not a backup or redundant name

play11:47

note then hdfs High availability feature

play11:51

this feature enables the use of two name

play11:53

notes one in the active State and

play11:56

another in a standby state if an active

play11:59

name node fails the standby name node

play12:01

takes over when using the hdfs high

play12:04

availability feature a secondary name

play12:07

node is

play12:09

unnecessary then the next component in

play12:11

haloop architecture is yarn Yann stands

play12:14

for yet another resource negotiator Yan

play12:17

separates the resource management of the

play12:18

cluster from the scheduling and

play12:21

monitoring of jobs running on the

play12:23

cluster Yan replaces the functionality

play12:26

previously provided by the job tracker

play12:28

and task tracker demands let us discuss

play12:31

the job tracker and task

play12:33

trackers Hado with the job tracker and

play12:36

task tracker is shown here high level

play12:40

Hadoop architecture then uh task tracker

play12:43

is there job tracker is there the name

play12:47

node data

play12:50

node the functionalities of job tracker

play12:53

and task tracker primary function of the

play12:55

job tracker is Resource Management

play12:58

tracking res Source availability and

play13:00

task life cycle management the task

play13:02

tracker has s has a simple function of

play13:06

following the orders of the job tracker

play13:08

and updating the job tracker with its

play13:10

progress status periodically the task TR

play13:13

tracker is preconfigured with a number

play13:15

of slots indicating the number of tasks

play13:18

it can accept when the job tracker tries

play13:20

to schedule a task it looks for an empty

play13:23

slot in the task tracker running on the

play13:26

same server which host the data node

play13:28

where the data for that t results if not

play13:32

found it looks for the machine in the

play13:34

same rack there is no consideration of

play13:36

system load during this

play13:39

allocation LinkedIn LinkedIn is an

play13:42

online Professional Network of 250

play13:44

million users in 200 countries as of

play13:47

early 2014 LinkedIn provides several

play13:50

free and subscription Based Services

play13:53

such as company information Pages job

play13:55

postings Talent searches Etc

play14:00

so LinkedIn utilizes Hado for the

play14:02

following purposes process daily

play14:05

production database transaction logs

play14:07

examine the users activities such as

play14:09

views and clicks feed the extracted data

play14:13

back to the production systems

play14:14

restructure the data to add to an

play14:17

analytical database and develop and test

play14:20

analytical

play14:22

models

play14:25

Yahoo as of 2012 Yahoo has one of the

play14:28

largest publicly announced Hadoop

play14:30

deployments at 42,000 noes across

play14:33

several clusters utilizing 350 petabytes

play14:36

of RA storage Yahoo s Publications

play14:40

include the following search index

play14:42

creation and maintenance web page

play14:44

content optimization web ad placement

play14:47

optimization spam filters adoc analysis

play14:49

and analytical model

play14:52

development the references used for the

play14:55

video are from Hadoop data fler and

play14:57

introduction to my produce from

play14:59

G thank you

Rate This

5.0 / 5 (0 votes)

Связанные теги
HadoopBig DataMapReduceHDFSYARNDistributed ComputingData ProcessingCluster ManagementJavaOpen Source
Вам нужно краткое изложение на английском?