005 Understanding Big Data Problem
Summary
TLDRThis script dives into the challenges of handling big data, using a scenario where a programmer at a major stock exchange is tasked with calculating the maximum closing price of every stock symbol from a 1TB dataset. The discussion covers storage solutions like NAS and SAN, the limitations of HDDs, and the potential of SSDs. It introduces the concept of parallel computation and data replication for fault tolerance. The script concludes with an introduction to Hadoop, a framework for distributed computing that efficiently manages large datasets across clusters of inexpensive hardware, offering scalability and a solution to the complex problem presented.
Takeaways
- π The video discusses the challenges of handling big data, particularly focusing on a scenario where one is asked to calculate the maximum closing price of every stock symbol traded on a major exchange like NYSE or NASDAQ.
- πΎ The script highlights the importance of storage, noting that with only 20 GB of free space on a workstation, a 1 terabyte dataset requires a more robust solution like a NAS or SAN server.
- π The video emphasizes the time it takes to transfer large datasets, using the example of a 1 terabyte dataset taking approximately 2 hours and 22 minutes to transfer from a hard disk drive.
- π It introduces the concept of using SSDs over HDDs to significantly reduce data transfer time, but also acknowledges the higher cost of SSDs as a potential barrier.
- π€ The script poses the question of how to reduce computation time for such a large dataset, suggesting parallel processing as a potential solution.
- π’ The idea of dividing the dataset into 100 equal parts and processing them in parallel on 100 nodes is presented to theoretically reduce computation time.
- π‘ The video discusses the network bandwidth issue that arises when multiple nodes try to transfer data simultaneously, suggesting local storage on each node as a solution.
- π The importance of data replication to prevent data loss in case of hard disk failure is highlighted, with the suggestion of keeping three copies of each data block.
- π The script introduces Hadoop as a framework for distributed computing that addresses both storage and computation complexities in big data scenarios.
- π οΈ Hadoop's HDFS and MapReduce components are explained as solutions for managing data blocks and performing computations across multiple nodes.
- π The video concludes by emphasizing Hadoop's ability to scale horizontally, allowing for the addition of more nodes to a cluster to further reduce execution time.
Q & A
What is the main problem presented in the script?
-The main problem is calculating the maximum closing price of every stock symbol traded in a major exchange like the New York Stock Exchange or Nasdaq, given a data set size of one terabyte.
Why is the data set size a challenge for the workstation?
-The workstation has only 20 GB of free space, which is insufficient to handle a one terabyte data set, necessitating the use of a server or SAN for storage.
What does NASS stand for and what is its purpose?
-NASS stands for Network Attached Storage and SAN stands for Storage Area Network. They are used for storing large data sets and can be accessed by any computer on the network with the proper permissions.
What are the two main challenges in solving the big data problem presented?
-The two main challenges are storage and computation. Storage is addressed by using a server or SAN, while computation requires an optimized program and efficient data transfer.
What is the average data access rate for a traditional hard disk drive?
-The average data access rate for a traditional hard disk drive is about 122 megabytes per second.
How long would it take to read a 1 terabyte file from a hard disk drive?
-It would take approximately 2 hours and 22 minutes to read a 1 terabyte file from a hard disk drive.
What is the business user's reaction to the initial ETA of three hours?
-The business user is shocked by the three-hour ETA, as they were hoping for a much quicker turnaround time, ideally within 30 minutes.
What is an SSD and how does it compare to an HDD in terms of speed?
-An SSD is a Solid-State Drive, which is much faster than an HDD because it does not have moving parts and is based on flash memory. However, SSDs are also more expensive.
What is the proposed solution to reduce computation time for the big data problem?
-The proposed solution is to divide the data set into 100 equal-sized chunks and use 100 nodes to compute the data in parallel, which theoretically reduces the data access and computation time significantly.
Why is storing data locally on each node's hard disk a better approach?
-Storing data locally on each node's hard disk allows for true parallel reading and eliminates the network bandwidth issue, as each node can access its own data without relying on the network.
What is the role of Hadoop in solving the big data problem?
-Hadoop is a framework for distributed processing of large data sets across clusters of commodity computers. It has two core components: HDFS for storage-related complexities and MapReduce for computational complexities, making it an efficient solution for handling big data.
What does Hadoop offer that makes it a suitable solution for big data problems?
-Hadoop offers a scalable, distributed processing framework that can handle large data sets efficiently. It uses commodity hardware, making it cost-effective and adaptable to various cluster sizes, from small to very large.
Outlines
π Introduction to Big Data Challenges
The script introduces the concept of big data and its challenges, using a scenario where an employee at a major stock exchange is tasked with calculating the maximum closing price for every stock symbol ever traded since the exchange's inception. The data set size is a staggering one terabyte, which is too large for a workstation with only 20 GB of free space. The script outlines the need for storage and computation solutions, highlighting the initial steps taken to address the problem, such as moving the data to a server with more storage capacity and the considerations for computation time.
πΎ Storage and Computation Strategies
This paragraph delves into the specifics of data storage and computation. It discusses the limitations of traditional hard disk drives (HDD) in terms of data access rates and the time it would take to read a one terabyte file. The script then explores the idea of using solid-state drives (SSD) for faster data access but acknowledges the cost implications. It also introduces the concept of parallel computation by dividing the data set into smaller chunks and processing them across multiple nodes, while addressing potential issues such as network bandwidth and data replication for fault tolerance.
π Distributed Computing with Hadoop
The final paragraph introduces Hadoop as a solution to the challenges of big data processing. It explains that Hadoop consists of two core components: the Hadoop Distributed File System (HDFS) for storage-related tasks, such as data block management and replication, and MapReduce for computational tasks, which includes the processing and consolidation of results from multiple nodes. The script emphasizes the flexibility of Hadoop to scale horizontally by adding more nodes to the cluster, allowing for the efficient processing of large data sets across clusters of commodity computers.
Mindmap
Keywords
π‘Big Data
π‘Data Storage
π‘Computation
π‘Network Attached Storage (NAS)
π‘Storage Area Network (SAN)
π‘Data Access Rate
π‘Solid-State Drives (SSD)
π‘Parallel Computation
π‘Hadoop
π‘Distributed File System
π‘MapReduce
π‘Commodity Computers
Highlights
Introduction to the challenges of big data, specifically calculating the maximum closing price of every stock symbol ever traded.
The problem of handling a 1 terabyte dataset with only 20 GB of free space on a workstation.
Utilizing NAS and SAN servers for large-scale data storage solutions.
The importance of data access rate and the time it takes to read large datasets from HDDs.
The concept of estimating an ETA for data processing tasks, including data transfer and computation time.
The impracticality of using SSDs for big data due to their higher cost compared to HDDs.
Proposing a solution to reduce computation time by dividing the dataset and using parallel processing.
Challenges with parallel data access and network bandwidth limitations.
The idea of storing data locally on each node to achieve true parallel reads and reduce network strain.
Addressing data loss and corruption through data replication across multiple nodes.
The complexity of coordinating data distribution and computation in a distributed system.
Introducing Hadoop as a framework for distributed processing of large datasets.
Explanation of HDFS and its role in managing storage complexities in Hadoop.
MapReduce as a programming model for handling computational complexities in Hadoop.
Hadoop's ability to horizontally scale by adding more nodes to the cluster to reduce execution time.
The flexibility of Hadoop to work with commodity computers, not requiring specialized hardware.
The practicality of starting with a small Hadoop cluster and scaling up as needed.
Conclusion on the efficiency of Hadoop in solving big data problems and its adaptability to various cluster sizes.
Transcripts
[Music]
hey guys welcome back now you know the
key characteristics of Big Data there's
now time to understand the challenges of
problems that come with big data set so
in this lesson let's take a sample big
data problem analyze it and see how we
can arrive at a solution together ready
imagine you work at one of the major
exchanges like New York Stock Exchange
or Nasdaq one morning someone from your
wrist Department stops by your desk and
ask you to calculate the maximum closing
price of every stock symbol that is ever
traded in the exchange since inception
also assume the size of the data set you
were given as one terabyte so your data
set would look like this so each line in
this data set is an information about a
stock for a given date immediately the
business user who gave this problem ask
you for an ETA and when he can expect
the results Wow there is a lot to think
here so you ask him to give you some
time and you start to work what would be
your next steps you have two things to
figure out
the first one is storage and second one
is computation let's talk about storage
first so your workstation has only 20 GB
of free space but the size of the data
said is 1 terabyte so you go to your
storage team and ask them to copy the
data set to an a server or even a SAN
server nass stands for network attached
storage and sand stands for storage area
network so once the data set is copied
you ask them to give you the location of
the data set so an a source an is
connected to your network so any
computer with access to the network can
access the data providing if their
permission to see the data so for good
the data is stored and you have access
to the data now you set out to solve the
next problem which is computation you're
a Java programmer so you wrote an
optimized Java program to parse the data
set and perform the computation
everything looks good and you are now
ready to execute the program against the
data set you realize it's already known
the business user who gave you this
request stop by for an ETA that's an
interesting question isn't it so you
start to think what is the ETA for
this whole operation to complete and you
come up with the Rizal set for the
program to work on the data set first
the data set needs to be copied from the
storage to the working memory or Ram so
how long does it take to copy a one
terabyte data set from storage let's
take our traditional hard disk drive
that is the one that is connected to a
laptop or workstation except for right
HDDs
a hard disk drive have magnetic platters
in which the data is stored when you
request to read data at the head in the
hard disk first position itself on the
platter and start transferring the data
from the platter to the head the speed
in which the data is transferred from
the platter to the head is called the
data access rate this is very important
so listen carefully write average data
access rates in hash CDs is usually
about 122 megabytes per second so if you
do the math to read a 1 terabyte file
from a hard disk drive you need 2 hours
and 22 minutes Wow now that is for a HDD
that is connected to your workstation
when you transfer a file from an a
server or from your san even right you
should know the transfer rate of the
hard disk drives in the NASA for now we
will assume it is same as the regular
HDD which is 122 megabytes per second
and hence it would take 2 hours and 22
minutes now what about the computation
time since you have not executed the
program yet at least once you cannot say
for sure plus your data comes from a
storage server that is attached to the
network so you have to consider the
network bandwidth also so with all that
in mind you give him an ETA of about
three hours but it could be easily over
three hours since you're not sure about
the computation time your business user
is so shocked to hear three hours for an
ETA so he has the next question can we
get it sooner than three hours say maybe
in 30 minutes you know there is no way
you can execute the results in 30
minutes of course the business cannot
right for three hours especially in
finance with time is money right so
let's work this problem together how can
we calculate the result in less than 30
minutes let's break this down majority
of the time you spend in calculating the
result set will be attributed to two
tasks first is transferring the data
from storage our hard disk drive which
is about two and a half hours and the
second task is the computation time
right that is the time to perform the
actual calculation by your program let's
say it's going to take about 60 minutes
it could be more or it could be less I
have a crazy idea what if we replace HDD
by SSD SSD stands for solid-state drives
SSDs are very powerful alternative for
HDD SSD does not have magnetic platters
or heads they do not have any moving
components and it's based on flash
memory so it is extremely fast sounds
great so why don't we use SSD in place
of HDD by doing that we can
significantly reduce the time it would
take to read the data from the storage
but here is the problem SSD comes with a
price
they usually 5 to 6 times in price than
your regular HDD although the price
continues to go down given the data
volume that we are talking about with
respect to Big Data it is not a viable
option right now so for now we are stuck
with hard disk drives
let's talk about how we can reduce the
computation time hypothetically we think
the program will take 60 minutes to
complete also assume your program is
already optimized for execution so what
can be done next any ideas I have a
crazy idea how about dividing the one
terabyte data set into 100 equal sized
chunks or blocks and have hundred
computers our hundred nodes do the
computation parallely in theory this
means you cut the data access rate by a
factor of 100 and also the computation
time by a factor of 100 so with this
idea you can
during the data access time - less than
two minutes on computation time in less
than one minute so that sounds great it
is a promising idea so let's explore
even further there are a couple of
issues here if you have more than one
chunk of your data set stored in the
same harddrive you cannot get a true
parallel read because there is only one
head in your hard disk which does the
actual read but for the sake of the
argument let's assume you get a true
parallel read which means you have
hundred nodes trying to read data at the
same time now assuming the data can be
read parallel e you will now have 100
times 122 megabytes per second of data
flowing through the network
imagine this what would happen when each
one of your family member at home starts
to stream their favorite TV show hour
movie at the same time using a single
internet connection at your home it
would result in a very poor streaming
experience with lot of buffering no one
in the family can enjoy their show right
why do you have essentially done is
choked up your network the download
speed requested by each one of the
device's combinely exceeded the download
speed offered by the internet connection
resulting in a poor service this is
exactly what will happen here when
hundred nodes trying to transfer the
data over the network at the same time
so how can we solve this why do we have
to rely on a storage which is attached
to the network why don't we bring the
data closer to the computation that is
why don't we store the data locally in
each nodes hard disk so you would store
block 1 of data and node 1 block 2 of
data and node 2 etc something like this
now we can achieve a true parallel read
on all 100 nodes and also we have
eliminated the network bandwidth issue
perfect that's a significant improvement
or design right now let's talk about
something which is very important how
many of you have suffered data loss due
to a hard disk failure I myself have
suffered twice it is not a fun situation
right I'm sure
most of you at least once faced a hard
drive failure so how can you protect
your data from hard disk failure or data
corruption etc let's take an example
let's say you have a photo of your loved
ones and you treasure that picture in
your mind there is no way you can lose
that picture how would you protect it
you would keep copies of your picture in
different places right maybe one in your
personal laptop one copy in Picasa one
copying your external hard drive you get
the idea so if your laptop crashes you
can still get that picture from Picasa
or your external hard drive so let's do
this
why don't we copy each block of data to
two more notes in other words we can
replicate the block in two more notes so
in total we have three copies of each
block take a look at this node one has
blocked one seven and ten no two has
blocked seven eleven and forty-two node
3 has blocks one seven and ten so if
block one is unavailable in node two due
to a hard disk failure or corruption in
the block it can be easily fetched from
node 3 so this means that node one two
and three must have access to one
another and they should be connected in
a network right conceptually this is
great but there are some challenges
implementing it let's think about this
how does node one knows that note 3 has
block 1 and who decides block 7 for
instance should be stored in node one
two and three first of all who will
break the one terabyte in two hundred
blocks so as you can see this solution
doesn't look that easy isn't it and
that's just the storage part of it
computation brings other challenges node
one can only compute the maximum closed
price from just block one similarly no 2
can only compute the maximum closed
price from block 2 this brings up a
problem because for example data for
stock GE can be in block 1
can also be in block two and could also
be on block 82 for instance right so you
have to consolidate the result from all
the nodes together to compute the final
result so who's going to coordinate all
that the solution we are proposing is
distributed computing and as we are
seeing there are several complexities
involved in implementing the solution
both at the storage layer and also at
the computation layer the answer to all
these open questions and complexities is
Hadoop Hadoop offers a framework for
distributed computing so Hadoop has two
core components HDFS and MapReduce HDFS
stands for a Hadoop distributed file
system and it takes care of all your
storage related complexities like
splitting your data set into blocks
replicating each block to more than one
node and also keep track of which block
is stored on which node etc MapReduce is
a programming model and Hadoop
implements MapReduce and it takes care
of all the computational complexities so
Hadoop framework takes care of bringing
all the intermediate results from every
single node to offer a consolidated
output so what is Hadoop Hadoop is a
framework for distributed processing of
large data sets across clusters of
commodity computers the last two words
in the definition is what makes her even
more special commodity computers that
means all the hundred nodes that we have
in the cluster does not have to have any
specialized hardware their enterprise
grade server nodes with a processor hard
disk and RAM in each of them that's it
there is nothing more special about that
but don't confuse commodity computers
with cheap hardware commodity computers
mean inexpensive hardware and not cheap
hardware now you know what Hadoop is and
how it can offer an efficient solution
to your maximum closed price problem
against a one terabyte data set now you
can go back to the business and propose
hadoop to solve the problem and to
chief the execution time that your users
are expecting but if you propose a
hundred node cluster to your business
expect to get some crazy looks but
that's the beauty of hurdle you don't
need to have a hundred node cluster we
have seen successful huddle production
environments from small ten node cluster
all the way to hundred two thousand node
cluster you can simply even start with a
ten node cluster and if you want to
reduce the execution time even further
all you have to do is add more nodes to
your cluster that's simple in other
words her loop will horizontally scale
so now you know what is hurdle and
conceptually how it solves the problem
of big datasets but that let's wrap this
lesson and move on to the next lesson
Browse More Related Video
002 Hadoop Overview and History
Hadoop In 5 Minutes | What Is Hadoop? | Introduction To Hadoop | Hadoop Explained |Simplilearn
Hadoop and it's Components Hdfs, Map Reduce, Yarn | Big Data For Engineering Exams | True Engineer
HDFS- All you need to know! | Hadoop Distributed File System | Hadoop Full Course | Lecture 5
Big Data In 5 Minutes | What Is Big Data?| Big Data Analytics | Big Data Tutorial | Simplilearn
What is DATABASE SHARDING?
5.0 / 5 (0 votes)