Map Reduce explained with example | System Design
Summary
TLDRThis video script introduces the MapReduce programming model, a powerful tool for processing massive datasets across multiple machines. Originating from Google in 2004, it simplifies parallel data processing through two phases: the 'map' phase, which converts data into key-value pairs, and the 'reduce' phase, which aggregates these pairs into meaningful outputs. The script explains the model's resilience to machine failures and its efficiency in handling large-scale data, using the example of word count across files. It also highlights the importance of recognizing MapReduce patterns in system design interviews.
Takeaways
- π The MapReduce programming model operates in two phases: 'Map', which handles data splitting and mapping, and 'Reduce', which shuffles and reduces the data into a final output.
- π οΈ MapReduce was developed to address the challenge of processing massive amounts of data across hundreds or thousands of machines efficiently and with meaningful insights.
- π Originating in 2004, the MapReduce model was introduced by two Google engineers in a white paper, providing a framework for handling large datasets in a distributed system.
- π The model assumes the existence of a distributed file system where data chunks are replicated and spread across multiple machines, managed by a central controller.
- π The 'Map' function transforms data into key-value pairs, which are then shuffled and reorganized for the 'Reduce' function to process into a final output.
- π The importance of the key-value structure in the intermediary step is highlighted, as it allows for the identification of common patterns or insights from the data.
- π‘ The MapReduce model handles machine failures or network partitions by re-performing map or reduce operations, assuming the functions are idempotent.
- π Idempotency is a requirement in MapReduce, meaning that repeating map or reduce functions does not change the output, ensuring consistency.
- π Engineers using MapReduce need to focus on understanding the inputs and expected outputs at each step, simplifying the complexities of distributed data processing.
- π The script provides an example of word count across files using MapReduce, illustrating the parallel mapping, shuffling, and reducing processes.
- π The ability to identify MapReduce patterns in system design interviews is crucial, as it helps in solving problems that require analyzing large distributed datasets.
Q & A
What are the two main phases of the MapReduce program?
-The two main phases of the MapReduce program are the 'map' phase, which deals with splitting and mapping of data, and the 'reduce' phase, which involves shuffling and reducing the data.
What is the purpose of the map function in MapReduce?
-The map function in MapReduce transforms the input data into key-value pairs, which are then used in the intermediary step of the MapReduce process.
What happens during the shuffle phase in MapReduce?
-During the shuffle phase, the key-value pairs produced by the map function are grouped by keys, preparing them for the reduce phase where they will be reduced into a final output.
Why was the MapReduce model created?
-The MapReduce model was created to process massive amounts of data efficiently and with speed, without sacrificing meaningful insights, across hundreds or thousands of machines in a distributed setting.
How does the MapReduce model handle failures like machine failures or network partitions?
-The MapReduce model handles failures by re-performing the map or reduce operations that were affected by the failure. This is possible because the map and reduce functions need to be idempotent, meaning repeating them multiple times does not change the output.
What is the significance of a distributed file system in the context of MapReduce?
-A distributed file system is crucial in MapReduce as it allows large datasets to be split into chunks, replicated, and spread across multiple machines, with a central controller managing the data's location and processing.
Why is it important for map and reduce functions to be idempotent in MapReduce?
-Idempotency of map and reduce functions is important because it ensures that the output remains consistent even if the operations are repeated due to failures, which is a common practice in handling errors in distributed systems.
What is the role of the central controller in a MapReduce job?
-The central controller in a MapReduce job is responsible for knowing where the data chunks reside and for communicating with all the machines that store or process data, ensuring the coordination of the MapReduce tasks.
How does the key-value structure of data in the intermediary step facilitate the reduce phase?
-The key-value structure in the intermediary step is important because it groups data with the same key together, making it easier for the reducer to process and identify common patterns or values associated with each key.
What is an example of a practical application of the MapReduce model?
-A practical application of the MapReduce model is counting the number of occurrences of each unique word across multiple files, where the mapper counts word frequencies and the reducer sums these frequencies to produce the final output.
Why is understanding the MapReduce pattern important for system design interviews?
-Understanding the MapReduce pattern is important for system design interviews because it helps candidates identify whether a given problem can be efficiently solved using MapReduce, which is a key skill in handling large-scale data processing tasks.
Outlines
π² Introduction to MapReduce
This paragraph introduces the MapReduce programming model, which is fundamental for processing large datasets across multiple machines. It explains the two phases: 'map', which involves splitting and mapping data into key-value pairs, and 'reduce', which shuffles and reduces these pairs into a final output. The paragraph also discusses the necessity of MapReduce, which arose from the need to efficiently process vast amounts of data collected by Google in the early 2000s. The model was designed to handle parallel processing and machine failures without sacrificing insights. The origin of MapReduce is attributed to a 2004 white paper by two Google engineers, which presented a method for processing large datasets in a distributed environment. The paragraph concludes by encouraging viewers to understand the framework through examples, especially those with a background in Python or JavaScript, as the MapReduce functions in distributed systems differ from theirεζΊ counterparts.
π Deep Dive into MapReduce with an Example
The second paragraph delves deeper into the MapReduce model by providing an example from the original Google white paper. It outlines the process of counting the occurrences of each unique word across multiple files using MapReduce. The mapper function is responsible for counting word occurrences and outputting them as key-value pairs, with the key being the word and the value being its frequency. The subsequent shuffle phase is crucial and often overlooked; it involves grouping values by keys, simplifying the reducer's task. In the reduce phase, the same key-value pairs are combined to produce a final output showing the frequency of each word. The paragraph also touches on the importance of recognizing when to apply MapReduce in system design interviews, suggesting that it is a skill that comes with practice. It concludes by hinting at the complexity of identifying MapReduce patterns in real-world problems, such as analyzing metadata from millions of YouTube videos.
Mindmap
Keywords
π‘MapReduce
π‘Distributed File System
π‘Shuffle
π‘Key-Value Pairs
π‘Horizontal Scaling
π‘Data Processing
π‘Idempotence
π‘Failure Handling
π‘System Design Interviews
π‘Word Count Example
π‘YouTube Videos Metadata
Highlights
The MapReduce programming model operates in two phases: Map and Reduce, with Map handling data splitting and mapping, and Reduce performing data shuffling and reduction.
Map function transforms data into key-value pairs, which are essential for the intermediary step in the MapReduce process.
The Reduce function consolidates data into a final output, such as a file, after the shuffling and reorganization of key-value pairs.
Google's need for processing large datasets with speed and efficiency led to the creation of the MapReduce model in 2004.
MapReduce allows for efficient processing of terabyte-scale datasets across thousands of machines in a distributed setting.
The model simplifies data processing by enabling engineers to focus on input and output without managing the complexities of distributed systems.
A distributed file system is assumed in MapReduce, where data chunks are replicated and spread across multiple machines.
Data locality is crucial; Map functions operate on data where it resides to avoid moving large datasets.
The key-value structure in the intermediary step is vital for identifying patterns and reducing data chunks effectively.
Idempotency of Map and Reduce functions is required to handle failures by re-performing operations without changing the outcome.
Engineers using MapReduce need a deep understanding of the inputs, intermediary key-value pairs, and expected final output.
The MapReduce framework is often implemented using libraries like Hadoop, which abstract the complexities of large-scale data processing.
An example from the original Google white paper illustrates counting word occurrences across multiple files using MapReduce.
The Shuffle phase, often overlooked, is critical for grouping values by keys, simplifying the Reducer's task.
Identifying patterns suitable for MapReduce is key in system design interviews, where understanding the problem's requirements is crucial.
MapReduce is applicable to large datasets distributed across many machines, requiring collective analysis or deduction.
The video aims to provide a high-level understanding of MapReduce with examples relevant to system design interviews.
The Map and Reduce functions in distributed systems differ from those in Python or JavaScript, adapted for large-scale data processing.
Transcripts
math videos program work in two phases
namely map and reduce map tasks deal
with splitting and mapping of data while
radius tasks Shuffle and reduce the data
so the map function will transform the
data into key value Pairs and these key
value appears live here in the
intermediary step of the mapreduce Java
process and then these key value pairs
are shuffled around and is reorganized
in a way that makes sense in the final
step they are then reduced into some
final output the output may be some file
that can be used somewhere else in the
system
now before diving deep let's similarize
ourselves with why it was needed in the
first place back in early 2000 Google
algorithms and applications were
collecting data 24 7 about people
processors systems and organizations
resulting in huge volumes of data and
since vertical scaling is limited to
process these large data sets they had
to horizontally scale to hundreds if not
thousands of machines
now when it comes to data processing
processing data across so many machines
is a very difficult task because you
need to do parallel processing across
all these machines and you have to
handle failures like better partitions
or machine failures
the challenge Google faced was how to
process this massive amount of data with
speed and efficiency and without
sacrificing the meaningful insights
this is where the mapreduce programming
model comes into rescue
back in 2004 two Google Engineers wrote
a white paper on mapreduce model which
allowed Engineers to process very large
data sets that were spread across
hundreds or thousands of machines in a
disputed setting very efficiently in a
fall torrent manner it's a pretty short
white paper but may be fairly complex to
understand for some but I highly
encourage you to skim through that you
can find the link in the white paper in
my description in this video I will give
a high level understanding of mapreduce
framework with examples in the context
of system design interviews
now if you have a python or JavaScript
background you are certainly familiar
with mapreduce functions here in the
context of mapreduce in distributed
system settings map and reduce are a
little bit different
Google Engineers realize that most of
the data processing tasks could be split
up into two steps and they created a
library that allowed Engineers to
process huge data sets in order of
terabytes spread across thousands of
machines and this is fairly simple
concept to understand
there is this map function that
transforms the data into key value peers
the key value pairs are shuffled around
and reorganized they are then reduced to
some final output now before
understanding it any further through an
example there are few things worth
noting
first thing is that when dealing with
mapreduce model we assume that we have a
distributed file system which means we
have got some large data sets that is
split up into chunks these chunks are
replicated and spread out across
hundreds of thousands of machines and
then this distributed file system has
some sort of central controller which is
aware of everything going on in the
mapreduce job
meaning the central controller will know
where exactly the chunks of data resides
and is able to communicate with all the
machine that store or processes data
that is a map machines or reduce
machines or workers
the second thing to note is that since
we are dealing with a very large data
sets we don't actually want to move
these large data sets we let them live
where they currently live on their
respective machines what we do is we
have this map functions operate to the
data locally
so instead of grabbing and aggregating
the data and move it somewhere else we
instead send the map programs through
the data and avoid moving all this large
data the third very important thing to
note about this model is the key value
structure of data in the intermediary
step is very important because when you
perform a reduce or reduce data chunks
these data chunks come from the same
data set because remember all these data
chunks are the chunks of the same large
data set so when you reduce these data
chunks you're likely looking into some
sort of commonality or pattern in this
various pieces of data because when you
got key value pairs you will have some
keys that are common some keys that are
common up here some keys that are common
up here or down here and it becomes easy
to reduce them into one single
meaningful value based on that key this
will become very clear when we deep dive
into our example the fourth thing to
note is how the model handle machine
failures or network partition to handle
failures the mapreduce model basically
re-performs the map operations or the
reduce operations when the failures
occurs so if your failure occurs in the
step here that is maybe there was a
network partition when generating the
key value pair the central controller
which I mentioned earlier will just
re-perform this operation and then move
on to the reduce step
and this can happen only if the map and
reduce functions are iron potent that is
if you repeat the map or reduce function
multiple times the output doesn't change
the outcome doesn't change regardless of
how many times you repeat the map or
reduce function so item potency is a
requirement in the map reduce model now
when Engineers are dealing with the
mapreduce model they just need to make
sure that they have a good understanding
of the inputs and expected output that
is what is the output of the map
function what is the intermediary key
value PS look like how they can be
reorganized in a meaningful way which as
an input to the reduce function and
allow it to optimally reduce to a final
output
as you can see this framework simplifies
a lot of things for engineers Engineers
just need to worry about the input and
output data in various steps they don't
need to worry about the intricacies and
complexities of processing large data
sets in a distributed system
basically Engineers will end up using
some sort of Library implementation such
as Hadoop and they just need to be aware
of the various Imports and outputs of
these various functions
all right with all these points in mind
let's understand it through an example
which I am taking from the same white
paper presented by the Google Engineers
so here I have a set of input files and
typically these input files will be
distributed across several machines for
our example for the sake of Simplicity I
just have two input files and our task
here is to count the number of
occurrences of each unique word across
all the files so here our first file
contains the sentence this is an apple
and our second file contains apple is
Right In Color the mapper here counts
the number of times each word occurs
from the input splits in the form of key
value pairs where the key will be the
word and the value is the frequency of
the word so the word this occurs one
time the word is occurs one time the
word and apples one time and the word
Apple occurs one time and same with the
other input files where it says apple is
red in color the word Apple August one
time is occurs one time red August one
time in August one time and color okay
Us One Time by the way both these
operations both the mapping operations
are happening here in parallel now once
the mapping is done it is followed by a
shuffle phase which is a lot of times
ignored during the mapreduce discussions
but it's a critical step here we group
the values by keys in the form of key
value pairs so here we got a total of
six groups of key value Pairs and these
shuffling or grouping of keys makes the
job of reducer so easy and efficient
because now the same reducer is used for
the same key value pairs with the same
key now all the words present in the
data are combined into a single output
in the reducer phase and the output
shows the frequency of each word
here in the example we get the final
output of key value pairs as this
occurred one time the word is occur two
times anacort one times Apple Locker two
times red auger one times and occurred
one times and color dock at one times
the trickier part in the context of
system design interviews is to be able
to identify this pattern of mapreduce
that is given a problem can you solve
that problem using mapreduce is that the
real use case so that is something you
need to be able to identify and which
comes by solving more and more problems
so for example you are given a large
data set of millions of YouTube videos
and their metadata information such as
the likes the comments the engagement
rate the length of the video and whatnot
and let's say you are asked to find all
the videos which have certain number of
likes so for this you need to pass
through the entire metadata information
of all these videos and then finally
produce an output file and likewise
whenever you are given a large data set
where you have to deduce or analyze the
last data set where say the files are
distributed across many machines I
needed to analyze all the files
collectively somehow that is a key hint
of applying mapreduce with that I hope
this video made sense and for any future
system design problems you will be able
to identify whether to apply mapreduce
or not at least you should be able to
have a discussion with your interviewer
to check if applying mapreduce makes
sense on a particular problem
[Music]
thank you
foreign
Browse More Related Video
002 Hadoop Overview and History
Hadoop In 5 Minutes | What Is Hadoop? | Introduction To Hadoop | Hadoop Explained |Simplilearn
Big Data In 5 Minutes | What Is Big Data?| Big Data Analytics | Big Data Tutorial | Simplilearn
Hadoop and it's Components Hdfs, Map Reduce, Yarn | Big Data For Engineering Exams | True Engineer
Hadoop Ecosystem Explained | Hadoop Ecosystem Architecture And Components | Hadoop | Simplilearn
Learn Apache Spark in 10 Minutes | Step by Step Guide
5.0 / 5 (0 votes)