005 Understanding Big Data Problem

dd ddd

15 Apr 201814:25

Summary

TLDRThis script dives into the challenges of handling big data, using a scenario where a programmer at a major stock exchange is tasked with calculating the maximum closing price of every stock symbol from a 1TB dataset. The discussion covers storage solutions like NAS and SAN, the limitations of HDDs, and the potential of SSDs. It introduces the concept of parallel computation and data replication for fault tolerance. The script concludes with an introduction to Hadoop, a framework for distributed computing that efficiently manages large datasets across clusters of inexpensive hardware, offering scalability and a solution to the complex problem presented.

Takeaways

📈 The video discusses the challenges of handling big data, particularly focusing on a scenario where one is asked to calculate the maximum closing price of every stock symbol traded on a major exchange like NYSE or NASDAQ.
💾 The script highlights the importance of storage, noting that with only 20 GB of free space on a workstation, a 1 terabyte dataset requires a more robust solution like a NAS or SAN server.
🔄 The video emphasizes the time it takes to transfer large datasets, using the example of a 1 terabyte dataset taking approximately 2 hours and 22 minutes to transfer from a hard disk drive.
🚀 It introduces the concept of using SSDs over HDDs to significantly reduce data transfer time, but also acknowledges the higher cost of SSDs as a potential barrier.
🤔 The script poses the question of how to reduce computation time for such a large dataset, suggesting parallel processing as a potential solution.
🔢 The idea of dividing the dataset into 100 equal parts and processing them in parallel on 100 nodes is presented to theoretically reduce computation time.
💡 The video discusses the network bandwidth issue that arises when multiple nodes try to transfer data simultaneously, suggesting local storage on each node as a solution.
🔒 The importance of data replication to prevent data loss in case of hard disk failure is highlighted, with the suggestion of keeping three copies of each data block.
🔄 The script introduces Hadoop as a framework for distributed computing that addresses both storage and computation complexities in big data scenarios.
🛠️ Hadoop's HDFS and MapReduce components are explained as solutions for managing data blocks and performing computations across multiple nodes.
🌐 The video concludes by emphasizing Hadoop's ability to scale horizontally, allowing for the addition of more nodes to a cluster to further reduce execution time.

Q & A

What is the main problem presented in the script?
-The main problem is calculating the maximum closing price of every stock symbol traded in a major exchange like the New York Stock Exchange or Nasdaq, given a data set size of one terabyte.
Why is the data set size a challenge for the workstation?
-The workstation has only 20 GB of free space, which is insufficient to handle a one terabyte data set, necessitating the use of a server or SAN for storage.
What does NASS stand for and what is its purpose?
-NASS stands for Network Attached Storage and SAN stands for Storage Area Network. They are used for storing large data sets and can be accessed by any computer on the network with the proper permissions.
What are the two main challenges in solving the big data problem presented?
-The two main challenges are storage and computation. Storage is addressed by using a server or SAN, while computation requires an optimized program and efficient data transfer.
What is the average data access rate for a traditional hard disk drive?
-The average data access rate for a traditional hard disk drive is about 122 megabytes per second.
How long would it take to read a 1 terabyte file from a hard disk drive?
-It would take approximately 2 hours and 22 minutes to read a 1 terabyte file from a hard disk drive.
What is the business user's reaction to the initial ETA of three hours?
-The business user is shocked by the three-hour ETA, as they were hoping for a much quicker turnaround time, ideally within 30 minutes.
What is an SSD and how does it compare to an HDD in terms of speed?
-An SSD is a Solid-State Drive, which is much faster than an HDD because it does not have moving parts and is based on flash memory. However, SSDs are also more expensive.
What is the proposed solution to reduce computation time for the big data problem?
-The proposed solution is to divide the data set into 100 equal-sized chunks and use 100 nodes to compute the data in parallel, which theoretically reduces the data access and computation time significantly.
Why is storing data locally on each node's hard disk a better approach?
-Storing data locally on each node's hard disk allows for true parallel reading and eliminates the network bandwidth issue, as each node can access its own data without relying on the network.
What is the role of Hadoop in solving the big data problem?
-Hadoop is a framework for distributed processing of large data sets across clusters of commodity computers. It has two core components: HDFS for storage-related complexities and MapReduce for computational complexities, making it an efficient solution for handling big data.
What does Hadoop offer that makes it a suitable solution for big data problems?
-Hadoop offers a scalable, distributed processing framework that can handle large data sets efficiently. It uses commodity hardware, making it cost-effective and adaptable to various cluster sizes, from small to very large.