005 Understanding Big Data Problem

dd ddd
15 Apr 201814:25

Summary

TLDRThis script dives into the challenges of handling big data, using a scenario where a programmer at a major stock exchange is tasked with calculating the maximum closing price of every stock symbol from a 1TB dataset. The discussion covers storage solutions like NAS and SAN, the limitations of HDDs, and the potential of SSDs. It introduces the concept of parallel computation and data replication for fault tolerance. The script concludes with an introduction to Hadoop, a framework for distributed computing that efficiently manages large datasets across clusters of inexpensive hardware, offering scalability and a solution to the complex problem presented.

Takeaways

  • πŸ“ˆ The video discusses the challenges of handling big data, particularly focusing on a scenario where one is asked to calculate the maximum closing price of every stock symbol traded on a major exchange like NYSE or NASDAQ.
  • πŸ’Ύ The script highlights the importance of storage, noting that with only 20 GB of free space on a workstation, a 1 terabyte dataset requires a more robust solution like a NAS or SAN server.
  • πŸ”„ The video emphasizes the time it takes to transfer large datasets, using the example of a 1 terabyte dataset taking approximately 2 hours and 22 minutes to transfer from a hard disk drive.
  • πŸš€ It introduces the concept of using SSDs over HDDs to significantly reduce data transfer time, but also acknowledges the higher cost of SSDs as a potential barrier.
  • πŸ€” The script poses the question of how to reduce computation time for such a large dataset, suggesting parallel processing as a potential solution.
  • πŸ”’ The idea of dividing the dataset into 100 equal parts and processing them in parallel on 100 nodes is presented to theoretically reduce computation time.
  • πŸ’‘ The video discusses the network bandwidth issue that arises when multiple nodes try to transfer data simultaneously, suggesting local storage on each node as a solution.
  • πŸ”’ The importance of data replication to prevent data loss in case of hard disk failure is highlighted, with the suggestion of keeping three copies of each data block.
  • πŸ”„ The script introduces Hadoop as a framework for distributed computing that addresses both storage and computation complexities in big data scenarios.
  • πŸ› οΈ Hadoop's HDFS and MapReduce components are explained as solutions for managing data blocks and performing computations across multiple nodes.
  • 🌐 The video concludes by emphasizing Hadoop's ability to scale horizontally, allowing for the addition of more nodes to a cluster to further reduce execution time.

Q & A

  • What is the main problem presented in the script?

    -The main problem is calculating the maximum closing price of every stock symbol traded in a major exchange like the New York Stock Exchange or Nasdaq, given a data set size of one terabyte.

  • Why is the data set size a challenge for the workstation?

    -The workstation has only 20 GB of free space, which is insufficient to handle a one terabyte data set, necessitating the use of a server or SAN for storage.

  • What does NASS stand for and what is its purpose?

    -NASS stands for Network Attached Storage and SAN stands for Storage Area Network. They are used for storing large data sets and can be accessed by any computer on the network with the proper permissions.

  • What are the two main challenges in solving the big data problem presented?

    -The two main challenges are storage and computation. Storage is addressed by using a server or SAN, while computation requires an optimized program and efficient data transfer.

  • What is the average data access rate for a traditional hard disk drive?

    -The average data access rate for a traditional hard disk drive is about 122 megabytes per second.

  • How long would it take to read a 1 terabyte file from a hard disk drive?

    -It would take approximately 2 hours and 22 minutes to read a 1 terabyte file from a hard disk drive.

  • What is the business user's reaction to the initial ETA of three hours?

    -The business user is shocked by the three-hour ETA, as they were hoping for a much quicker turnaround time, ideally within 30 minutes.

  • What is an SSD and how does it compare to an HDD in terms of speed?

    -An SSD is a Solid-State Drive, which is much faster than an HDD because it does not have moving parts and is based on flash memory. However, SSDs are also more expensive.

  • What is the proposed solution to reduce computation time for the big data problem?

    -The proposed solution is to divide the data set into 100 equal-sized chunks and use 100 nodes to compute the data in parallel, which theoretically reduces the data access and computation time significantly.

  • Why is storing data locally on each node's hard disk a better approach?

    -Storing data locally on each node's hard disk allows for true parallel reading and eliminates the network bandwidth issue, as each node can access its own data without relying on the network.

  • What is the role of Hadoop in solving the big data problem?

    -Hadoop is a framework for distributed processing of large data sets across clusters of commodity computers. It has two core components: HDFS for storage-related complexities and MapReduce for computational complexities, making it an efficient solution for handling big data.

  • What does Hadoop offer that makes it a suitable solution for big data problems?

    -Hadoop offers a scalable, distributed processing framework that can handle large data sets efficiently. It uses commodity hardware, making it cost-effective and adaptable to various cluster sizes, from small to very large.

Outlines

00:00

πŸ“ˆ Introduction to Big Data Challenges

The script introduces the concept of big data and its challenges, using a scenario where an employee at a major stock exchange is tasked with calculating the maximum closing price for every stock symbol ever traded since the exchange's inception. The data set size is a staggering one terabyte, which is too large for a workstation with only 20 GB of free space. The script outlines the need for storage and computation solutions, highlighting the initial steps taken to address the problem, such as moving the data to a server with more storage capacity and the considerations for computation time.

05:00

πŸ’Ύ Storage and Computation Strategies

This paragraph delves into the specifics of data storage and computation. It discusses the limitations of traditional hard disk drives (HDD) in terms of data access rates and the time it would take to read a one terabyte file. The script then explores the idea of using solid-state drives (SSD) for faster data access but acknowledges the cost implications. It also introduces the concept of parallel computation by dividing the data set into smaller chunks and processing them across multiple nodes, while addressing potential issues such as network bandwidth and data replication for fault tolerance.

10:01

πŸ”„ Distributed Computing with Hadoop

The final paragraph introduces Hadoop as a solution to the challenges of big data processing. It explains that Hadoop consists of two core components: the Hadoop Distributed File System (HDFS) for storage-related tasks, such as data block management and replication, and MapReduce for computational tasks, which includes the processing and consolidation of results from multiple nodes. The script emphasizes the flexibility of Hadoop to scale horizontally by adding more nodes to the cluster, allowing for the efficient processing of large data sets across clusters of commodity computers.

Mindmap

Keywords

πŸ’‘Big Data

Big Data refers to data sets that are so large and complex that traditional data processing software is inadequate to deal with them. In the video, Big Data is the central theme, with the script discussing the challenges of handling such vast amounts of data, like calculating the maximum closing price of stocks from a one-terabyte dataset.

πŸ’‘Data Storage

Data Storage is the means by which digital information is collected, retained, and protected for future use. The script addresses the issue of storing a one-terabyte dataset, emphasizing the limitations of a workstation with only 20 GB of free space, and the need for solutions like Network Attached Storage (NAS) or Storage Area Network (SAN).

πŸ’‘Computation

Computation in this context refers to the process of performing calculations or processing data. The script discusses the challenges of computation when dealing with big data, such as the time it would take to process a one-terabyte dataset using a Java program optimized for the task.

πŸ’‘Network Attached Storage (NAS)

NAS is a file-level computer data storage server that connects to a computer network providing data access to a heterogeneous group of clients. The script mentions NAS as a solution for storing large datasets, allowing any computer with network access to retrieve the data if authorized.

πŸ’‘Storage Area Network (SAN)

SAN is a dedicated high-speed network that provides block-level access to data from multiple storage devices. In the script, SAN is considered as an alternative to NAS for storing the large dataset, highlighting the need for high-speed data transfer capabilities.

πŸ’‘Data Access Rate

Data Access Rate is the speed at which data is transferred from a storage device to a computer's memory. The script calculates the time required to read a one-terabyte file from a Hard Disk Drive (HDD) using the average data access rate of 122 megabytes per second, which is crucial for understanding the time constraints of data processing.

πŸ’‘Solid-State Drives (SSD)

SSDs are storage devices that use flash memory to store data, offering faster read and write speeds compared to traditional HDDs. The script suggests using SSDs to reduce data read times significantly, but also notes the higher cost associated with SSDs, making it less viable for large-scale data storage in the context of big data.

πŸ’‘Parallel Computation

Parallel Computation is a method of performing calculations in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones that can be solved at the same time. The script proposes dividing the dataset into 100 chunks and processing them in parallel across 100 nodes to reduce computation time.

πŸ’‘Hadoop

Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is introduced in the script as a solution to manage the complexities of storage and computation for big data, with its core components being the Hadoop Distributed File System (HDFS) and MapReduce.

πŸ’‘Distributed File System

A Distributed File System (DFS) is a system that manages files across multiple machines, providing a unified namespace and location transparency. In the context of the video, HDFS is a type of DFS that Hadoop uses to store files in a distributed fashion, handling the complexities of data block management and replication.

πŸ’‘MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large datasets. The script explains that Hadoop implements MapReduce to handle computational complexities, allowing for the processing of data in parallel across multiple nodes and the consolidation of results.

πŸ’‘Commodity Computers

Commodity Computers refer to inexpensive, mass-produced, interchangeable hardware components. The script emphasizes that Hadoop can utilize commodity computers for its clusters, meaning there is no need for specialized or high-end hardware to process big data, which makes the solution more accessible and cost-effective.

Highlights

Introduction to the challenges of big data, specifically calculating the maximum closing price of every stock symbol ever traded.

The problem of handling a 1 terabyte dataset with only 20 GB of free space on a workstation.

Utilizing NAS and SAN servers for large-scale data storage solutions.

The importance of data access rate and the time it takes to read large datasets from HDDs.

The concept of estimating an ETA for data processing tasks, including data transfer and computation time.

The impracticality of using SSDs for big data due to their higher cost compared to HDDs.

Proposing a solution to reduce computation time by dividing the dataset and using parallel processing.

Challenges with parallel data access and network bandwidth limitations.

The idea of storing data locally on each node to achieve true parallel reads and reduce network strain.

Addressing data loss and corruption through data replication across multiple nodes.

The complexity of coordinating data distribution and computation in a distributed system.

Introducing Hadoop as a framework for distributed processing of large datasets.

Explanation of HDFS and its role in managing storage complexities in Hadoop.

MapReduce as a programming model for handling computational complexities in Hadoop.

Hadoop's ability to horizontally scale by adding more nodes to the cluster to reduce execution time.

The flexibility of Hadoop to work with commodity computers, not requiring specialized hardware.

The practicality of starting with a small Hadoop cluster and scaling up as needed.

Conclusion on the efficiency of Hadoop in solving big data problems and its adaptability to various cluster sizes.

Transcripts

play00:00

[Music]

play00:13

hey guys welcome back now you know the

play00:16

key characteristics of Big Data there's

play00:19

now time to understand the challenges of

play00:22

problems that come with big data set so

play00:25

in this lesson let's take a sample big

play00:27

data problem analyze it and see how we

play00:30

can arrive at a solution together ready

play00:33

imagine you work at one of the major

play00:35

exchanges like New York Stock Exchange

play00:37

or Nasdaq one morning someone from your

play00:40

wrist Department stops by your desk and

play00:43

ask you to calculate the maximum closing

play00:45

price of every stock symbol that is ever

play00:48

traded in the exchange since inception

play00:50

also assume the size of the data set you

play00:53

were given as one terabyte so your data

play00:55

set would look like this so each line in

play00:57

this data set is an information about a

play00:59

stock for a given date immediately the

play01:02

business user who gave this problem ask

play01:04

you for an ETA and when he can expect

play01:07

the results Wow there is a lot to think

play01:09

here so you ask him to give you some

play01:12

time and you start to work what would be

play01:14

your next steps you have two things to

play01:17

figure out

play01:17

the first one is storage and second one

play01:20

is computation let's talk about storage

play01:23

first so your workstation has only 20 GB

play01:25

of free space but the size of the data

play01:28

said is 1 terabyte so you go to your

play01:30

storage team and ask them to copy the

play01:32

data set to an a server or even a SAN

play01:35

server nass stands for network attached

play01:37

storage and sand stands for storage area

play01:40

network so once the data set is copied

play01:43

you ask them to give you the location of

play01:44

the data set so an a source an is

play01:47

connected to your network so any

play01:48

computer with access to the network can

play01:51

access the data providing if their

play01:53

permission to see the data so for good

play01:55

the data is stored and you have access

play01:58

to the data now you set out to solve the

play02:01

next problem which is computation you're

play02:03

a Java programmer so you wrote an

play02:05

optimized Java program to parse the data

play02:07

set and perform the computation

play02:10

everything looks good and you are now

play02:12

ready to execute the program against the

play02:14

data set you realize it's already known

play02:17

the business user who gave you this

play02:19

request stop by for an ETA that's an

play02:22

interesting question isn't it so you

play02:24

start to think what is the ETA for

play02:27

this whole operation to complete and you

play02:29

come up with the Rizal set for the

play02:33

program to work on the data set first

play02:35

the data set needs to be copied from the

play02:38

storage to the working memory or Ram so

play02:41

how long does it take to copy a one

play02:44

terabyte data set from storage let's

play02:48

take our traditional hard disk drive

play02:50

that is the one that is connected to a

play02:52

laptop or workstation except for right

play02:55

HDDs

play02:56

a hard disk drive have magnetic platters

play02:59

in which the data is stored when you

play03:01

request to read data at the head in the

play03:03

hard disk first position itself on the

play03:05

platter and start transferring the data

play03:08

from the platter to the head the speed

play03:12

in which the data is transferred from

play03:14

the platter to the head is called the

play03:16

data access rate this is very important

play03:18

so listen carefully write average data

play03:21

access rates in hash CDs is usually

play03:24

about 122 megabytes per second so if you

play03:28

do the math to read a 1 terabyte file

play03:31

from a hard disk drive you need 2 hours

play03:34

and 22 minutes Wow now that is for a HDD

play03:39

that is connected to your workstation

play03:41

when you transfer a file from an a

play03:44

server or from your san even right you

play03:46

should know the transfer rate of the

play03:48

hard disk drives in the NASA for now we

play03:51

will assume it is same as the regular

play03:53

HDD which is 122 megabytes per second

play03:56

and hence it would take 2 hours and 22

play03:59

minutes now what about the computation

play04:02

time since you have not executed the

play04:04

program yet at least once you cannot say

play04:07

for sure plus your data comes from a

play04:10

storage server that is attached to the

play04:12

network so you have to consider the

play04:13

network bandwidth also so with all that

play04:16

in mind you give him an ETA of about

play04:18

three hours but it could be easily over

play04:21

three hours since you're not sure about

play04:23

the computation time your business user

play04:26

is so shocked to hear three hours for an

play04:29

ETA so he has the next question can we

play04:32

get it sooner than three hours say maybe

play04:34

in 30 minutes you know there is no way

play04:37

you can execute the results in 30

play04:38

minutes of course the business cannot

play04:41

right for three hours especially in

play04:42

finance with time is money right so

play04:44

let's work this problem together how can

play04:46

we calculate the result in less than 30

play04:49

minutes let's break this down majority

play04:53

of the time you spend in calculating the

play04:55

result set will be attributed to two

play04:57

tasks first is transferring the data

play05:00

from storage our hard disk drive which

play05:02

is about two and a half hours and the

play05:05

second task is the computation time

play05:06

right that is the time to perform the

play05:09

actual calculation by your program let's

play05:12

say it's going to take about 60 minutes

play05:14

it could be more or it could be less I

play05:17

have a crazy idea what if we replace HDD

play05:21

by SSD SSD stands for solid-state drives

play05:25

SSDs are very powerful alternative for

play05:29

HDD SSD does not have magnetic platters

play05:32

or heads they do not have any moving

play05:35

components and it's based on flash

play05:38

memory so it is extremely fast sounds

play05:41

great so why don't we use SSD in place

play05:44

of HDD by doing that we can

play05:47

significantly reduce the time it would

play05:50

take to read the data from the storage

play05:52

but here is the problem SSD comes with a

play05:56

price

play05:56

they usually 5 to 6 times in price than

play05:59

your regular HDD although the price

play06:01

continues to go down given the data

play06:04

volume that we are talking about with

play06:05

respect to Big Data it is not a viable

play06:08

option right now so for now we are stuck

play06:10

with hard disk drives

play06:11

let's talk about how we can reduce the

play06:14

computation time hypothetically we think

play06:16

the program will take 60 minutes to

play06:18

complete also assume your program is

play06:20

already optimized for execution so what

play06:23

can be done next any ideas I have a

play06:29

crazy idea how about dividing the one

play06:32

terabyte data set into 100 equal sized

play06:36

chunks or blocks and have hundred

play06:40

computers our hundred nodes do the

play06:42

computation parallely in theory this

play06:45

means you cut the data access rate by a

play06:48

factor of 100 and also the computation

play06:51

time by a factor of 100 so with this

play06:54

idea you can

play06:55

during the data access time - less than

play06:57

two minutes on computation time in less

play07:00

than one minute so that sounds great it

play07:03

is a promising idea so let's explore

play07:05

even further there are a couple of

play07:07

issues here if you have more than one

play07:09

chunk of your data set stored in the

play07:12

same harddrive you cannot get a true

play07:14

parallel read because there is only one

play07:16

head in your hard disk which does the

play07:19

actual read but for the sake of the

play07:21

argument let's assume you get a true

play07:23

parallel read which means you have

play07:25

hundred nodes trying to read data at the

play07:27

same time now assuming the data can be

play07:30

read parallel e you will now have 100

play07:33

times 122 megabytes per second of data

play07:37

flowing through the network

play07:39

imagine this what would happen when each

play07:42

one of your family member at home starts

play07:44

to stream their favorite TV show hour

play07:47

movie at the same time using a single

play07:49

internet connection at your home it

play07:51

would result in a very poor streaming

play07:53

experience with lot of buffering no one

play07:56

in the family can enjoy their show right

play07:58

why do you have essentially done is

play08:00

choked up your network the download

play08:02

speed requested by each one of the

play08:04

device's combinely exceeded the download

play08:07

speed offered by the internet connection

play08:08

resulting in a poor service this is

play08:11

exactly what will happen here when

play08:13

hundred nodes trying to transfer the

play08:15

data over the network at the same time

play08:17

so how can we solve this why do we have

play08:23

to rely on a storage which is attached

play08:25

to the network why don't we bring the

play08:28

data closer to the computation that is

play08:30

why don't we store the data locally in

play08:33

each nodes hard disk so you would store

play08:36

block 1 of data and node 1 block 2 of

play08:39

data and node 2 etc something like this

play08:41

now we can achieve a true parallel read

play08:45

on all 100 nodes and also we have

play08:49

eliminated the network bandwidth issue

play08:51

perfect that's a significant improvement

play08:54

or design right now let's talk about

play08:56

something which is very important how

play08:59

many of you have suffered data loss due

play09:01

to a hard disk failure I myself have

play09:04

suffered twice it is not a fun situation

play09:07

right I'm sure

play09:08

most of you at least once faced a hard

play09:10

drive failure so how can you protect

play09:14

your data from hard disk failure or data

play09:16

corruption etc let's take an example

play09:18

let's say you have a photo of your loved

play09:21

ones and you treasure that picture in

play09:23

your mind there is no way you can lose

play09:25

that picture how would you protect it

play09:28

you would keep copies of your picture in

play09:31

different places right maybe one in your

play09:33

personal laptop one copy in Picasa one

play09:36

copying your external hard drive you get

play09:38

the idea so if your laptop crashes you

play09:41

can still get that picture from Picasa

play09:44

or your external hard drive so let's do

play09:47

this

play09:47

why don't we copy each block of data to

play09:51

two more notes in other words we can

play09:54

replicate the block in two more notes so

play09:58

in total we have three copies of each

play10:00

block take a look at this node one has

play10:05

blocked one seven and ten no two has

play10:08

blocked seven eleven and forty-two node

play10:11

3 has blocks one seven and ten so if

play10:16

block one is unavailable in node two due

play10:19

to a hard disk failure or corruption in

play10:21

the block it can be easily fetched from

play10:23

node 3 so this means that node one two

play10:27

and three must have access to one

play10:30

another and they should be connected in

play10:32

a network right conceptually this is

play10:34

great but there are some challenges

play10:36

implementing it let's think about this

play10:39

how does node one knows that note 3 has

play10:42

block 1 and who decides block 7 for

play10:45

instance should be stored in node one

play10:47

two and three first of all who will

play10:50

break the one terabyte in two hundred

play10:52

blocks so as you can see this solution

play10:55

doesn't look that easy isn't it and

play10:57

that's just the storage part of it

play11:00

computation brings other challenges node

play11:04

one can only compute the maximum closed

play11:06

price from just block one similarly no 2

play11:09

can only compute the maximum closed

play11:12

price from block 2 this brings up a

play11:16

problem because for example data for

play11:19

stock GE can be in block 1

play11:22

can also be in block two and could also

play11:25

be on block 82 for instance right so you

play11:28

have to consolidate the result from all

play11:31

the nodes together to compute the final

play11:33

result so who's going to coordinate all

play11:36

that the solution we are proposing is

play11:39

distributed computing and as we are

play11:41

seeing there are several complexities

play11:43

involved in implementing the solution

play11:46

both at the storage layer and also at

play11:49

the computation layer the answer to all

play11:52

these open questions and complexities is

play11:55

Hadoop Hadoop offers a framework for

play11:59

distributed computing so Hadoop has two

play12:02

core components HDFS and MapReduce HDFS

play12:06

stands for a Hadoop distributed file

play12:08

system and it takes care of all your

play12:11

storage related complexities like

play12:13

splitting your data set into blocks

play12:15

replicating each block to more than one

play12:18

node and also keep track of which block

play12:21

is stored on which node etc MapReduce is

play12:24

a programming model and Hadoop

play12:25

implements MapReduce and it takes care

play12:28

of all the computational complexities so

play12:31

Hadoop framework takes care of bringing

play12:33

all the intermediate results from every

play12:35

single node to offer a consolidated

play12:38

output so what is Hadoop Hadoop is a

play12:41

framework for distributed processing of

play12:43

large data sets across clusters of

play12:46

commodity computers the last two words

play12:49

in the definition is what makes her even

play12:52

more special commodity computers that

play12:55

means all the hundred nodes that we have

play12:58

in the cluster does not have to have any

play13:01

specialized hardware their enterprise

play13:03

grade server nodes with a processor hard

play13:06

disk and RAM in each of them that's it

play13:09

there is nothing more special about that

play13:11

but don't confuse commodity computers

play13:13

with cheap hardware commodity computers

play13:16

mean inexpensive hardware and not cheap

play13:19

hardware now you know what Hadoop is and

play13:23

how it can offer an efficient solution

play13:25

to your maximum closed price problem

play13:28

against a one terabyte data set now you

play13:31

can go back to the business and propose

play13:33

hadoop to solve the problem and to

play13:35

chief the execution time that your users

play13:37

are expecting but if you propose a

play13:40

hundred node cluster to your business

play13:42

expect to get some crazy looks but

play13:45

that's the beauty of hurdle you don't

play13:47

need to have a hundred node cluster we

play13:49

have seen successful huddle production

play13:51

environments from small ten node cluster

play13:53

all the way to hundred two thousand node

play13:57

cluster you can simply even start with a

play14:00

ten node cluster and if you want to

play14:02

reduce the execution time even further

play14:04

all you have to do is add more nodes to

play14:07

your cluster that's simple in other

play14:09

words her loop will horizontally scale

play14:12

so now you know what is hurdle and

play14:14

conceptually how it solves the problem

play14:17

of big datasets but that let's wrap this

play14:21

lesson and move on to the next lesson

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Big DataHadoopData StorageData ComputationSSD vs HDDParallel ProcessingNetwork BandwidthData ReplicationDistributed ComputingFinancial Analysis