Introduction to Hadoop
Summary
TLDRThis video, presented by Mrs. Kavida S., an assistant professor at MIT Academy of Engineering, offers a comprehensive overview of Hadoop architecture. It explains Hadoop's role as an open-source framework for distributed processing of large data sets using clusters of computers. Key components like MapReduce for computation, HDFS for storage, and YARN for resource management are discussed. The video also covers the map and reduce steps, a word count problem example, and the functioning of Hadoop's distributed file system, highlighting its fault tolerance and efficiency in handling large-scale data.
Takeaways
- 😀 Hadoop is an open-source framework by Apache, written in Java, allowing distributed processing of large datasets across clusters of computers.
- 💻 Hadoop's architecture includes four main modules: MapReduce, HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator), and Hadoop Common.
- 📊 Hadoop MapReduce processes data in parallel by breaking tasks into smaller parts (Map phase) and then aggregating them (Reduce phase).
- 🗂️ HDFS is a distributed file system that provides high-throughput access to application data, ensuring fault tolerance by replicating data across different nodes.
- 🔑 In MapReduce, data is processed using key-value pairs, which can be simple or complex, like a filename as the key and its contents as the value.
- 📈 A common example problem solved by MapReduce is word count, where the input is split into words, grouped, and counted across large datasets.
- 🔄 The Hadoop framework ensures data is distributed across nodes and uses replication to handle hardware failures, ensuring data integrity and high availability.
- 🖥️ HDFS works with a master node (NameNode) and multiple worker nodes (DataNodes) to manage and store data efficiently, with the NameNode as a central point of access.
- 🚨 The NameNode is a single point of failure in the system, but high availability features allow for failover with an active and standby NameNode setup.
- 🗃️ YARN separates resource management from job scheduling, replacing the JobTracker and TaskTracker components of older Hadoop versions.
Q & A
What is Hadoop?
-Hadoop is an open-source framework by Apache, written in Java, that allows for distributed processing of large data sets across clusters of computers using simple programming models.
How does Hadoop scale across machines?
-Hadoop is designed to scale up from a single server to thousands of machines, each offering local computation and storage, making it ideal for handling large amounts of data.
What is the purpose of the MapReduce algorithm in Hadoop?
-MapReduce is an algorithm used in Hadoop for parallel processing. It breaks down tasks into smaller sub-tasks (Map step) and processes them in parallel before combining the results (Reduce step) to produce the final output.
What are the four main modules of Hadoop architecture?
-The four main modules are: 1. Hadoop Common – Java libraries and utilities, 2. HDFS – Hadoop Distributed File System, 3. YARN – resource management, and 4. MapReduce – a system for parallel processing.
What is the role of Hadoop Distributed File System (HDFS)?
-HDFS is a distributed file system in Hadoop that provides high-throughput access to application data. It stores large data sets reliably across many nodes and is fault-tolerant.
Can you explain how MapReduce works with an example?
-In MapReduce, the 'Map' step processes data to extract useful information, and the 'Reduce' step aggregates the results. For example, in a word count problem, 'Map' extracts individual words and counts their occurrences, and 'Reduce' consolidates the word counts across multiple instances.
What is YARN, and what role does it play in Hadoop?
-YARN stands for Yet Another Resource Negotiator. It separates resource management from job scheduling and monitoring, helping Hadoop efficiently allocate resources across the cluster.
How does Hadoop handle failures and ensure data reliability?
-Hadoop handles failures by replicating data blocks, typically across three nodes, so that in case of hardware failure, the system can continue operating with minimal disruption.
What are the roles of the NameNode and DataNode in HDFS?
-The NameNode acts as the master server, managing the file system's namespace and regulating access to files. DataNodes store the actual data and handle read/write operations based on client requests, under instructions from the NameNode.
What are some real-world use cases of Hadoop?
-Some real-world use cases include LinkedIn's processing of daily logs and user activities, and Yahoo's deployment for search index creation, web page content optimization, and spam filtering.
Outlines
📘 Introduction to Hadoop and its Architecture
The speaker, Mrs. Kavida S., introduces the topic of Hadoop architecture. Hadoop is an open-source framework developed by Apache, written in Java, designed for the distributed processing of large data sets across computer clusters. Hadoop allows scaling from a single server to thousands of machines. It uses the MapReduce algorithm for parallel data processing. The architecture comprises four main components: MapReduce for computation, HDFS for storage, YARN for resource management, and Common utilities.
🛠 Components of Hadoop Architecture
This section elaborates on the four modules of Hadoop. Hadoop Common contains Java libraries and utilities for the system. YARN handles job scheduling and resource management. HDFS is a high-throughput, distributed file system. MapReduce breaks down large tasks into smaller ones, processing them in parallel. A sample problem—word counting in a large text file—is introduced to demonstrate MapReduce’s efficiency. The steps involve mapping words, sorting, and reducing the data to obtain the final word count.
🔄 Word Count Problem with MapReduce
The word count example is explored in depth. After breaking sentences into words, MapReduce maps each word to a count, groups similar words, and then shuffles and reduces the data. This process allows counting word occurrences across sentences. The advantages of MapReduce include distributing workloads across multiple machines, running tasks in parallel, managing errors, and optimizing performance even during partial system failures.
🗃 Hadoop Distributed File System (HDFS)
HDFS is introduced as a fault-tolerant, distributed file system designed to run on low-cost hardware. It ensures high throughput access to data, making it ideal for large-scale applications. Files are divided into blocks and replicated across cluster nodes. HDFS ensures data reliability by replicating blocks to handle hardware failures. The process involves splitting files, distributing blocks, and managing replication to ensure data availability and fault tolerance.
🖥 Master-Slave Architecture in HDFS
This section explains the role of NameNode (master) and DataNodes (slaves) in HDFS. The NameNode manages file system operations such as renaming, opening, or closing files. DataNodes handle storage tasks, such as reading and writing files. The HDFS system is highly reliant on the NameNode, making it a potential point of failure. Solutions like NameNode High Availability and Secondary NameNode are described to enhance system resilience.
🎯 Yarn: Resource Management in Hadoop
YARN (Yet Another Resource Negotiator) is introduced as the component responsible for separating resource management from job scheduling and monitoring. YARN replaces the traditional JobTracker and TaskTracker system, improving cluster efficiency. The section also discusses how jobs are scheduled and managed in a Hadoop cluster, including LinkedIn and Yahoo’s use cases for Hadoop in tasks like analyzing user activity, optimizing web content, and managing ad placements.
Mindmap
Keywords
💡Hadoop
💡MapReduce
💡HDFS (Hadoop Distributed File System)
💡YARN (Yet Another Resource Negotiator)
💡Cluster
💡Data Node
💡Name Node
💡Fault Tolerance
💡Word Count Problem
💡Job Tracker
Highlights
Hadoop is an open-source framework written in Java that allows distributed processing of large datasets across clusters of computers.
Hadoop is designed to scale from a single server to thousands of machines, each offering local computation and storage.
Hadoop applications use the MapReduce algorithm, where data is processed in parallel with other tasks.
The four core modules of Hadoop are MapReduce, HDFS (Hadoop Distributed File System), YARN, and common utilities.
MapReduce allows breaking a large task into smaller tasks, running them in parallel, and consolidating the outputs into the final result.
MapReduce uses key-value pairs for input and output, which can be used for complex problems like word counting in large text documents.
Hadoop Distributed File System (HDFS) provides high-throughput access to application data and is suitable for large data sets.
HDFS splits files into uniform-sized blocks (typically 128MB) and replicates them across nodes for fault tolerance.
The NameNode in HDFS manages the file system namespace, while DataNodes manage the actual data storage.
YARN (Yet Another Resource Negotiator) separates resource management from job scheduling and monitoring.
YARN replaces the older Hadoop JobTracker and TaskTracker with improved functionality for managing cluster resources.
HDFS is designed to be fault-tolerant, scalable, and highly efficient for processing large amounts of data.
Companies like LinkedIn and Yahoo use Hadoop to process transaction logs, analyze user activity, and optimize services like ad placement and spam filters.
Hadoop's high availability feature allows for the use of two NameNodes (active and standby) to ensure continuous operation.
Hadoop's MapReduce framework excels at distributing workloads across clusters of computers to handle massive datasets efficiently.
Transcripts
hello everyone this video is about
Hadoop architecture I am Mrs kavida S
working as assistant professor in the
department of Computer Engineering of
MIT Academy of
engineering Hadoop what is Hado Hadoop
is an Apachi open source framework
written in Java that allows distributed
processing of large data sets across
clusters of computers using simple
programming models Hardo is designed to
SC SC up from single server to thousands
of machines each offering local
computation and storage in short howo is
used to develop applications that could
perform complete statistical analysis on
huge amounts of data howo runs
applications using the map produce
algorithm where the data is processed in
parallel with
others Hadoop architecture Hadoop
framework consists of map produce for
distributed computation hdfs for
distributed storage yarn framework and
common utilities we will see each of
these in the subsequent
slides Hadoop architecture Hado
framework includes following four
modules Hadoop
common these are Java libraries and
utilities required by other Hardo
modules these libraries provides file
system and O Level abstractions and
contains the necessary Java files and
scripts required to start Hadoop Hadoop
y this is a framework for job scheduling
and cluster Resource Management Hadoop
distributed file system hdfs it is a
distributed file system that provides
High throughput access to application
data then Hardo map produce this is yarn
based system for parallel processing of
large data sets map ruce map ruce
Paradigm provides the means to break a
larger task into smaller task run task
in parallel and consolidate the outputs
of the individual task into the final
output as its name implies map produce
consists of two basic parts a map step
and a reduce step which are detailed as
follows
map applies an operation to a piece of
data and map provides some intermediate
output then reduce reduce consolidates
the intermediate outputs from the map
steps and it provides the finite output
map ruce each step uses key value payers
denoted as key comma value as input and
output it is useful to think of the key
value pairs as a simple ordered pair
however the payers can take fairly
complex forms for example the key could
be a file name and the value could be
the entire contents of the
file example problem counting words
assume that we have a huge text document
and count the number of times each
distinct word appears in the file sample
application for the same would be
analyzing web server logs to find
popular
URLs so for the pro word counting
problem there can be two cases case one
file is too large for the memory but all
word count payers fit in the memory so
you can generate a big string array or
you can create a hash table the second
case would be all word PS do not fit in
the memory but fit into the disk a
possible approach would be write
computer functions or programs for each
step that is break the text document
into the sequence of words sort the
words this will bring the same words
together and third step would be count
the frequencies in a single
pass so case two captures the essence of
map
ruce so for the word count problem in
initially we have to get the words that
is the data file then sort it and then
count it so on map redu it would be
mapping that is extract something we are
caring about that is in this example it
will be the words and the
count then then comes grouping the words
and
Shuffle and finally reduce the words
that is it include reduction includes
aggregate ation summarization Etc and
finally save the
results so the word count problem is
picturized in this slide that is input
consider two sentences for the input in
a simp this is a simplified example so
consider two sentences for the input the
sentences are this is an apple and apple
is red in color split the input into two
sentences this is an apple and second
one would be apple is red in
color then apply the map that is this is
occurrences once is occurrence one and
occurrence is one and apple occurrence
is one for the first sentence then for
the second sentence apple is red in and
color for all these words occurrences
would be one then after mapping the next
step would be
shuffling that is similar words are
grouped together that is this occurrence
will be one and is is is occurring in
the first sentence as well as in the
second sentence so the word is is
grouped together then An Occurrence is
only one then the word apple apple
occurs in the first sentence as well as
in the second sentence so the word apple
is grouped together similarly R in an
color occurrence is only one then after
shuffling the next step is reducing that
is this occurrence is only once for is
is is occurring two times so is uh is
return with the value two similarly and
occurrences only one and for Apple there
are two occurrences and for red in and
color occurrence is only one and from
the reducing step the output
is this occurrences one is occurrence
two and one apple occurr is two red is
one in is one and color is
one so map produce has the advantage of
being able to distribute the workload
over a cluster of computers and run the
task in
parallel executing a map Produce job
requires the management and coordination
of several
activities map produce jobs need to be
scheduled based on the systems
workload then jobs need to be monitored
and managed to ensure that they
encountered
errors sorry job uh jobs need to be
monitored and managed to ensure that any
encountered errors are properly handled
so that the job continues to execute if
the system partially fails input data
needs to be spread across the cluster
map step processing of the input needs
to be conducted across the distributed
system preferably on the same machines
where the data resides then intermediate
out outputs from the numerous map steps
map steps need to be collected and
provided to the proper machines for the
reduced step execution final output
needs to be made available for use by
another user another application or
perhaps another map ruce job the next
component of Hadoop is Hadoop
distributed file system
hdfs hdfs is based on the Google file
system and provides a distributed file
system that is designed to run on
commodity Hardware it has many
similarities with existing distributed
file systems however the differences
from other distributed file systems are
significant it is highly fault tolerant
and is designed to be deployed on
lowcost Hardware it provides High
throughput access to application data
and is suitable for applications having
large data sets how does Hadoop work
Hadoop runs code across clust of
computers this process includes the
following core task that hadu performs
data is initially divided into
directories and files files are divided
into uniform sized blocks of 128 M and
64m preferably
128m these files are then distributed
across various cluster nodes for further
processing hdfs being on top of the
local file system supervises the
processing how does Hardo work blocks
are replicated for handling Hardware
failure generally three replicas will be
there checking that the code was
executed successfully performing the
sort that takes place between the map
and reduced stages sending the sorted
data to a certain computer and writing
the debugging logs for each
job features of hdfs it is suitable for
the distributed storage and
processing Hardo provides a command
interface to interact with hdfs the
built-in servers of name node and data
node help users to easily check the
status of cluster streaming access to
file system data and hdfs provides file
permissions and
authentication name node the system
having the name node acts as the master
server and it does the following task
manages the file system name space
regulates the client's access to files
it also executes file system operations
such as renaming closing opening files
and directories keep the track where
this various blocks of data file is
stored data
node these nodes manage the data storage
of their system data notes perform
readwrite operations on the file system
as per client request they also perform
operations such as block creation
deletion and replication according to
the instructions of the name node each
data node periodically build a report
about the block stored on the data node
and send report to the name node
schematically hdfs is shown here that is
there is a master node then eight worker
nodes across two racks are there and
there is a secondary node already the
functions of name node secondary node
and data node are explained in the
previous
slides then the process if a client
application wants to access a particular
file stored in hdfs the application
contacts the name node then name node
provides the application with the
locations of the various blocks for that
file the application then communicates
with appropriate data nodes to access
the
file name node limitation for
performance reasons the name node
resides in a machine's memory name node
is critical to the operations of hdfs
any unavailability or Corruption of the
name node results in a data
unavailability event order on the
cluster name node is viewed as a single
point failure in the haloop environment
name node is typically run on a
dedicated machine secondary name node it
provides the capability to perform some
of the name node task to reduce the load
on the name node updating the file
system image with the contents of the
file system edit logs secondary name
node is not a backup or redundant name
note then hdfs High availability feature
this feature enables the use of two name
notes one in the active State and
another in a standby state if an active
name node fails the standby name node
takes over when using the hdfs high
availability feature a secondary name
node is
unnecessary then the next component in
haloop architecture is yarn Yann stands
for yet another resource negotiator Yan
separates the resource management of the
cluster from the scheduling and
monitoring of jobs running on the
cluster Yan replaces the functionality
previously provided by the job tracker
and task tracker demands let us discuss
the job tracker and task
trackers Hado with the job tracker and
task tracker is shown here high level
Hadoop architecture then uh task tracker
is there job tracker is there the name
node data
node the functionalities of job tracker
and task tracker primary function of the
job tracker is Resource Management
tracking res Source availability and
task life cycle management the task
tracker has s has a simple function of
following the orders of the job tracker
and updating the job tracker with its
progress status periodically the task TR
tracker is preconfigured with a number
of slots indicating the number of tasks
it can accept when the job tracker tries
to schedule a task it looks for an empty
slot in the task tracker running on the
same server which host the data node
where the data for that t results if not
found it looks for the machine in the
same rack there is no consideration of
system load during this
allocation LinkedIn LinkedIn is an
online Professional Network of 250
million users in 200 countries as of
early 2014 LinkedIn provides several
free and subscription Based Services
such as company information Pages job
postings Talent searches Etc
so LinkedIn utilizes Hado for the
following purposes process daily
production database transaction logs
examine the users activities such as
views and clicks feed the extracted data
back to the production systems
restructure the data to add to an
analytical database and develop and test
analytical
models
Yahoo as of 2012 Yahoo has one of the
largest publicly announced Hadoop
deployments at 42,000 noes across
several clusters utilizing 350 petabytes
of RA storage Yahoo s Publications
include the following search index
creation and maintenance web page
content optimization web ad placement
optimization spam filters adoc analysis
and analytical model
development the references used for the
video are from Hadoop data fler and
introduction to my produce from
G thank you
Browse More Related Video
Hadoop In 5 Minutes | What Is Hadoop? | Introduction To Hadoop | Hadoop Explained |Simplilearn
Hadoop and it's Components Hdfs, Map Reduce, Yarn | Big Data For Engineering Exams | True Engineer
002 Hadoop Overview and History
005 Understanding Big Data Problem
Hadoop Ecosystem Explained | Hadoop Ecosystem Architecture And Components | Hadoop | Simplilearn
3 Overview of the Hadoop Ecosystem
5.0 / 5 (0 votes)