Spark Tutorial For Beginners | Big Data Spark Tutorial | Apache Spark Tutorial | Simplilearn

Simplilearn
13 Jul 201715:40

Summary

TLDRApache Spark, an open-source cluster computing framework, was developed to overcome the limitations of Hadoop's MapReduce. It excels in real-time processing, trivial operations, and handling large data on networks, offering up to 100 times faster performance for certain applications. Spark's components, including Spark Core, RDDs, Spark SQL, Streaming, MLlib, and GraphX, provide a unified platform for diverse data processing tasks, from batch to real-time analytics, machine learning, and graph processing. Its in-memory processing capabilities and support for multiple languages enhance developer experience and enable versatile data analysis.

Takeaways

  • πŸš€ Apache Spark was developed at UC Berkeley's AMP Lab in 2009 and became an open-source project in 2010 under the Berkeley Software Distribution license.
  • πŸ”„ In 2013, the project was donated to the Apache Software Foundation and the license was changed to Apache 2.0, with Spark becoming an Apache top-level project in 2014.
  • πŸ† Databricks, founded by the creators of Apache Spark, used Spark to set a world record in large-scale sorting in November 2014 and now provides commercial support and certification for Spark.
  • πŸ” Spark is a next-generation real-time and batch processing framework that can be compared with MapReduce, another data processing framework in Hadoop.
  • πŸ“ˆ Batch processing in Spark involves processing large amounts of data in a single run over a time period, typically used for heavy data load, generating reports, and managing data workflow offline.
  • πŸ”₯ Real-time processing in Spark occurs instantaneously on data entry or command receipt, with applications like fraud detection requiring stringent response time constraints.
  • 🚧 The limitations of MapReduce, such as its suitability primarily for batch processing and not for real-time processing, lack of support for trivial operations, and issues with large data on the network, led to the creation of Spark.
  • πŸ’» Spark is an open-source cluster computing framework that addresses the limitations of MapReduce, offering real-time processing, support for trivial operations, and efficient handling of large data on a network.
  • 🌐 Spark's performance is significantly faster than MapReduce for certain applications, thanks to its in-memory processing capabilities, making it suitable for machine learning algorithms.
  • πŸ› οΈ A Spark project includes components like Spark Core and RDDs, Spark SQL, Spark Streaming, MLlib (Machine Learning Library), and GraphX, each serving different computational needs from basic I/O to advanced analytics.
  • πŸ”‘ In-memory processing in Spark allows for faster data access and improved performance, reducing the need for disk-based storage and enabling more efficient data compression and query execution.

Q & A

  • What was the original purpose behind the development of Spark?

    -Spark was developed at UC Berkeley's AMP Lab in 2009 to address the limitations of the MapReduce framework and to provide a more efficient data processing framework for both batch and real-time processing.

  • When did Spark become an open source project?

    -Spark became an open source project in 2010 under the Berkeley Software Distribution license.

  • What significant change occurred in 2013 regarding the Spark project?

    -In 2013, the project was donated to the Apache Software Foundation and its license was changed to Apache 2.0.

  • Why did Spark become an Apache Top-Level Project in February 2014?

    -Spark became an Apache Top-Level Project in February 2014 due to its growing popularity and the recognition of its capabilities in the big data processing domain.

  • What is the difference between batch processing and real-time processing as mentioned in the script?

    -Batch processing involves processing a large amount of data in a single run over a time period without manual intervention, typically used for offline data workflows like generating reports. Real-time processing, on the other hand, occurs instantaneously on data entry or command receipt and requires stringent response time constraints, such as in fraud detection.

  • What limitations of MapReduce did Spark aim to overcome?

    -Spark aimed to overcome limitations such as the slow processing time for large data sets, the complexity of writing trivial operations like filter and join, issues with large data on the network, unsuitability for online transaction processing (OLTP), and the inability to handle iterative program execution and graph processing efficiently.

  • What are the main components of a Spark project?

    -The main components of a Spark project include Spark Core and Resilient Distributed Data Sets (RDDs), Spark SQL, Spark Streaming, the Machine Learning Library (MLlib), and GraphX.

  • How does Spark Core and its RDDs simplify the complexity of programming?

    -Spark Core and RDDs simplify programming by providing basic input/output functionalities, distributed task dispatching, and scheduling. RDDs abstract the complexity by allowing data to be partitioned across machines and manipulated through transformations similar to local data collections.

  • What is Spark SQL and how does it support data manipulation?

    -Spark SQL is a component that resides on top of Spark Core, introducing Schema RDD, a new data abstraction that supports semi-structured and structured data. It allows data to be manipulated using domain-specific languages like Java, Scala, and Python and supports SQL queries through JDBC or ODBC interfaces.

  • How does Spark Streaming differ from traditional batch processing?

    -Spark Streaming leverages the fast scheduling capability of Spark Core for streaming analytics by ingesting data in small batches and performing RDD transformations on them. This design allows the same application code set written for batch analytics to be used for streaming analytics.

  • What advantages does Spark offer over MapReduce in terms of performance and versatility?

    -Spark offers up to 100 times faster performance for certain applications due to its in-memory processing capabilities, making it suitable for machine learning algorithms. It is also more versatile, being suitable for real-time processing, trivial operations, processing larger data on a network, OLTP, graphs, and iterative execution.

Outlines

00:00

πŸš€ Introduction to Apache Spark and Its Evolution

Apache Spark is a data processing framework that originated at UC Berkeley's AMP Lab in 2009. It was open-sourced in 2010 under the Berkeley Software Distribution license. In 2013, it was donated to the Apache Software Foundation, transitioning to the Apache 2.0 license. By February 2014, Spark had become an Apache top-level project. It is recognized for its ability to handle both real-time and batch processing, distinguishing it from MapReduce, which is limited to batch processing. Spark's creators used it to set a world record in large-scale sorting by November 2014. Now, it is supported commercially by Databricks, the company founded by Spark's creators, and is a next-generation framework that overcomes the limitations of MapReduce, such as its inability to handle real-time processing, complex operations, large data on the network, online transaction processing, and iterative program execution.

05:01

🌟 Core Components and Capabilities of Apache Spark

Apache Spark's core components include Spark Core and Resilient Distributed Datasets (RDDs), which form the foundation of the project by providing basic I/O functionalities, distributed task dispatching, and scheduling. RDDs are a fundamental programming abstraction, offering a simplified complexity of programming by allowing data manipulation similar to local data collections. Spark SQL introduces Schema RDD, a new data abstraction for semi-structured and structured data, supporting SQL and various domain-specific languages. Spark Streaming enables real-time streaming analytics by leveraging Spark Core's fast scheduling capability. MLlib, the machine learning library, applies common statistical and machine learning algorithms, and GraphX is a distributed graph processing framework that provides an API and runtime for graph computations. In-memory processing is highlighted as a key feature, allowing for faster performance and enabling Spark to handle large datasets efficiently.

10:02

πŸ› οΈ Spark's Advantages and Its Place in the Hadoop Ecosystem

Spark is favored over MapReduce for its performance and versatility, offering a rewarding development experience with support for multiple languages like Java, Scala, and Python. It simplifies the development process by allowing the use of lambda functions and closures. In contrast, the Hadoop ecosystem, which uses MapReduce for batch analytics, is limited to batch processing and requires extensive setup for different types of data processing. Spark, however, supports a variety of workloads, including streaming, iterative algorithms, and batch applications, all on the same engine. It also integrates with Hadoop by creating distributed datasets from files stored in Hadoop's file systems or other supported storage systems. The unification provided by Spark simplifies the learning curve for developers and allows for easy management of applications across different systems.

15:04

πŸ“ˆ The Impact of In-Memory Processing and Spark's Future Prospects

In-memory processing allows for faster data access and improved performance, which is crucial for interactive data exploration and analysis. Spark's in-memory capabilities provide speed and efficiency, making it suitable for complex applications and various processing types. It also supports the development of distributed applications that combine different processing models, such as real-time data categorization using machine learning. The IT team benefits from maintaining a single system, as Spark integrates tightly with various components for different workloads. For those aspiring to become big data experts, the Simply Learn channel offers educational content and certification opportunities in big data, highlighting the growing importance and demand for expertise in this field.

Mindmap

Keywords

πŸ’‘Apache Spark

Apache Spark is an open-source cluster-computing framework that was developed to address the limitations of Hadoop's MapReduce. It is designed for both real-time and batch processing of large-scale data. In the video, Spark is highlighted for its ability to perform up to 100 times faster than MapReduce for certain applications, making it suitable for machine learning algorithms and other data-intensive tasks.

πŸ’‘Batch Processing

Batch processing refers to the execution of a large number of operations or transactions in a single run over a period of time without any manual intervention. The video explains that in batch processing, data is pre-selected and fed using command line parameters and scripts, often used for generating reports and managing data workflows offline, such as creating daily or hourly reports for decision making.

πŸ’‘Real-time Processing

Real-time processing is the instantaneous processing of data upon entry or command receipt, requiring stringent response time constraints. The video uses fraud detection as an example of real-time processing, highlighting its importance in scenarios where immediate data processing is crucial.

πŸ’‘MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large datasets. The video discusses the limitations of MapReduce, such as its suitability primarily for batch processing and its inability to handle real-time processing efficiently, as well as its complexity when dealing with trivial operations like filter and join.

πŸ’‘Databricks

Databricks is a company founded by the creators of Apache Spark, which provides commercial support and certification for Spark. The video mentions that the engineering team at Databricks used Spark to set a world record in large-scale sorting, demonstrating the framework's capabilities.

πŸ’‘Resilient Distributed Datasets (RDDs)

RDDs are the fundamental data structure in Spark, representing a fault-tolerant collection of elements that can be operated on in parallel. The video describes RDDs as being created by applying transformations on existing RDDs or by referencing external datasets, simplifying the complexity of programming by allowing operations similar to those on local data collections.

πŸ’‘Spark SQL

Spark SQL is a Spark module that provides support for structured and semi-structured data, allowing users to run SQL queries or use the DataFrame API for data manipulation. The video explains that Spark SQL introduces Schema RDD, a new data abstraction that supports SQL with JDBC or ODBC interfaces.

πŸ’‘Spark Streaming

Spark Streaming is a component of Spark that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. The video describes how Spark Streaming leverages the fast scheduling capability of Spark Core for streaming analytics, allowing the same application code set written for batch analytics to be used for streaming analytics as well.

πŸ’‘MLlib

MLlib is Spark's machine learning library, which provides a distributed machine learning framework. The video notes that MLlib is nine times faster than Apache Mahout's disk-based version due to its memory-based architecture and supports various common statistical and machine learning algorithms.

πŸ’‘GraphX

GraphX is a component of Spark designed for graph processing, providing an API and optimized runtime for the Pregel abstraction, which is used for large-scale graph processing. The video explains that GraphX allows for the computation of graphs and can model the Pregel abstraction for graph processing tasks.

πŸ’‘In-memory Processing

In-memory processing refers to the practice of loading data into the memory of a system for faster data access and processing. The video discusses the advantages of in-memory processing, such as increased speed of processing, reduced memory requirements for queries, and the ability to implement compression algorithms that decrease the in-memory size.

Highlights

Spark was developed at UC Berkeley's AMP Lab in 2009 and became an open-source project in 2010.

In 2013, Spark was donated to the Apache Software Foundation and its license changed to Apache 2.0.

Spark became an Apache Top-Level Project in February 2014.

Databricks, founded by Spark creators, used Spark to set a world record in large-scale sorting in 2014.

Spark supports both real-time and batch processing, unlike MapReduce which is limited to batch processing.

Batch processing in Spark is used for operations like generating reports and managing data workflows offline.

Real-time processing in Spark is instantaneous and crucial for applications like fraud detection.

MapReduce's limitations include its unsuitability for real-time processing and complex operations like filter and join.

Spark addresses MapReduce's limitations, offering superior performance for real-time processing and network data processing.

Spark provides up to 100 times faster performance in certain applications due to its in-memory processing capabilities.

Spark's components include Spark Core, RDDs, Spark SQL, Spark Streaming, MLlib, and GraphX.

RDDs are Spark's fundamental programming abstraction, simplifying distributed data processing.

Spark SQL introduces Schema RDD for structured and semi-structured data manipulation.

Spark Streaming enables the use of the same batch analytics code for streaming analytics.

MLlib is Spark's distributed machine learning framework, offering performance improvements over other frameworks.

GraphX is a distributed graph processing framework in Spark, providing APIs and runtime for graph computations.

In-memory processing in Spark allows for faster data access and reduced memory requirements compared to disk-based systems.

Spark's in-memory capabilities provide a significant speed advantage for machine learning algorithms.

Spark supports multiple development languages, including Java, Scala, and Python, enhancing developer experience.

Spark's lambda functions and closures allow for inline function definitions, simplifying code comprehension.

Spark can replace Hadoop's MapReduce for batch processing, offering speed and versatility advantages.

Spark's unification feature simplifies development by allowing the use of one platform for various processing types.

Spark's integration with Hadoop systems provides flexibility and support for various data storage formats.

Spark's performance and ease of use make it a preferred choice for big data processing over MapReduce.

Transcripts

play00:02

[Music]

play00:09

spark as a data processing framework was

play00:11

developed at uc berkeley's amp lab by

play00:14

mate

play00:16

in 2009

play00:17

in 2010 it became an open source project

play00:21

under a berkeley software distribution

play00:23

license in the year 2013 the project was

play00:26

donated to the apache software

play00:28

foundation and the license was changed

play00:30

to apache 2.0

play00:33

in february 2014

play00:35

spark became an apache top level project

play00:38

by november 2014

play00:41

spark was used by the engineering team

play00:43

at databricks a company founded by the

play00:45

creators of apache spark

play00:48

to set a world record in large scale

play00:50

sorting

play00:51

now databricks provides commercial

play00:53

support and certification for taking the

play00:56

spark programming test

play00:58

at present spark exists as a next

play01:00

generation real time and batch

play01:02

processing framework

play01:04

let's try to understand what batch and

play01:06

real-time processing mean

play01:08

we will use this information in the

play01:09

subsequent slides to compare spark with

play01:11

mapreduce both of which are data

play01:13

processing framework in hadoop

play01:16

in case of batch processing a large

play01:18

amount of data or transactions are

play01:20

processed in a single run over a time

play01:22

period

play01:23

the associated jobs generally run

play01:25

entirely without any manual intervention

play01:28

additionally the entire data is

play01:30

pre-selected and fed using command line

play01:32

parameters and scripts

play01:34

in typical cases batch processing is

play01:37

used to execute multiple operations

play01:39

handle heavy data load

play01:41

generate reports and manage data

play01:43

workflow which is offline an example is

play01:46

to create daily or hourly reports to aid

play01:49

decision making on the other hand

play01:51

real-time processing occurs

play01:53

instantaneously on data entry or command

play01:55

receipt

play01:56

it needs to execute within stringent

play01:58

response time constraints

play02:00

an example is fraud detection

play02:03

the need for spark was created by the

play02:05

limitations of mapreduce which is

play02:07

another data processing framework in

play02:08

hadoop let's see what these limitations

play02:11

are

play02:12

mapreduce is suitable for batch

play02:13

processing where data is processed as a

play02:15

periodic job thus it takes time to

play02:18

process data and provide results if the

play02:20

data is high depending on the amount of

play02:22

data and the number of nodes in the

play02:24

cluster to complete a job

play02:26

it just takes minutes to process the

play02:28

data however it is not a good choice for

play02:30

real-time processing

play02:32

mapreduce is also not suitable for

play02:34

writing trivial operations such as

play02:36

filter and join

play02:38

to write such operations you might need

play02:40

to rewrite the jobs using the mapreduce

play02:42

framework which becomes complex because

play02:45

of the key value pattern

play02:46

this pattern is required to be followed

play02:48

in reducer and mapper codes

play02:51

mapreduce doesn't work so well with

play02:53

large data on the network

play02:55

the reason is that it takes a lot of

play02:56

time to copy the data which may cause

play02:59

network bandwidth issues it works on the

play03:01

data locality principle and hence works

play03:04

well on the node where the data resides

play03:07

mapreduce is also unsuitable for online

play03:09

transaction processing or oltp which

play03:13

includes a large number of short

play03:14

transactions

play03:16

since it works on the batch-oriented

play03:17

framework it lacks latency of seconds or

play03:20

sub-seconds

play03:21

additionally mapreduce is unfit for

play03:24

processing graphs

play03:26

graphs represent the structures to

play03:27

explore relationships between various

play03:29

points

play03:31

for example finding common friends in

play03:33

social media sites like facebook hadoop

play03:35

has the apache giraffe library for such

play03:38

cases

play03:39

it runs on top of mapreduce and adds to

play03:42

the complexity another important

play03:44

limitation is its unsuitability for

play03:46

iterative program execution

play03:49

some use cases like k-means need such

play03:52

execution where data needs to be

play03:54

processed again and again to refine

play03:56

results

play03:57

mapreduce runs from the start every time

play04:00

as it is stateless executor

play04:03

spark is an open source cluster

play04:04

computing framework which addresses all

play04:07

of the limitations of mapreduce

play04:09

it is suitable for real-time processing

play04:12

trivial operations and processing larger

play04:14

data on a network

play04:16

it is also suitable for oltp

play04:19

graphs and iterative execution

play04:22

compared to the disk based two-stage

play04:25

mapreduce of hadoop

play04:27

spark provides up to 100 times faster

play04:29

performance for a few applications

play04:31

within memory primitives

play04:34

fast performance makes it suitable for

play04:36

machine learning algorithms as it allows

play04:39

programs to load data into the memory of

play04:41

a cluster and create the data constantly

play04:44

a spark project comprises various

play04:46

components such as spark core and

play04:49

resilient distributed data sets or rdds

play04:52

spark sql spark streaming machine

play04:56

learning library or ml lib and graphics

play05:01

let's discuss the components of spark

play05:04

the first component spark core and rdds

play05:07

are the foundation of the entire spark

play05:09

project

play05:11

they provide basic input output

play05:13

functionalities distributed task

play05:16

dispatching and scheduling

play05:18

let's look at rdd closely rdds are the

play05:21

basic programming abstraction and is a

play05:24

collection of data that is partitioned

play05:26

across machines logically

play05:28

rdds can be created by applying

play05:31

coarse-grained transformations on the

play05:33

existing rdds

play05:35

or by referencing external data sets the

play05:38

examples of these transformations are

play05:40

reduce

play05:41

join filter and map

play05:44

the abstraction of rdds is exposed

play05:46

similarly as in process and local

play05:49

collections through a language

play05:51

integrated application programming

play05:53

interface or api

play05:56

in python java and scala as a result of

play05:59

the rdd abstraction the complexity of

play06:02

programming is simplified as the manner

play06:04

in which applications change rdds is

play06:07

similar to changing local data

play06:09

collections

play06:10

the second component is spark sql which

play06:13

resides on the top of spark core

play06:16

it introduces schema rdd which is a new

play06:19

data abstraction and supports

play06:21

semi-structured and structured data

play06:24

schema rdd can be manipulated in any of

play06:26

the provided domain specific languages

play06:29

such as java scala and python by the

play06:32

spark sql spark sql also supports sql

play06:36

with open database connectivity or java

play06:38

database connectivity commonly known as

play06:41

odbc or jdbc server and command line

play06:45

interfaces

play06:46

the third component is spark streaming

play06:49

spark streaming leverages the fast

play06:51

scheduling capability of spark core for

play06:53

streaming analytics ingesting data in

play06:56

small batches

play06:57

and performing rdd transformations on

play07:00

them

play07:01

with this design the same application

play07:03

code set written for batch analytics can

play07:06

be used on a single engine for streaming

play07:08

analytics

play07:10

the fourth component of spark is machine

play07:12

learning library

play07:14

also known as ml lib

play07:16

it lies on top of spark and is a

play07:19

distributed machine learning framework

play07:21

ml lib applies various common

play07:24

statistical and machine learning

play07:25

algorithms

play07:27

with its memory-based architecture it is

play07:29

nine times faster than the apache

play07:32

mahouts hadoop disk based version

play07:35

in addition the library performs even

play07:38

better than valpal wabit or vw

play07:42

the vw project is a fast out of core

play07:45

learning system sponsored by microsoft

play07:48

the last component graphics also lies on

play07:52

the top of spark and is a distributed

play07:54

graph processing framework

play07:56

for the computation of graphs it

play07:58

provides an api and an optimized runtime

play08:01

for the pregal abstraction

play08:04

pragel is a system for large-scale graph

play08:07

processing

play08:08

the api can also model the pragal

play08:10

abstraction we discussed earlier that

play08:13

spark provides up to 100 times faster

play08:15

performance for a few applications

play08:18

within memory primitives

play08:20

let's discuss the application of

play08:22

in-memory processing using

play08:24

column-centric databases

play08:27

in column-centric databases similar

play08:29

information can be stored together and

play08:31

hence data can be stored with more

play08:33

compression and efficiency

play08:35

it also permits the storage of large

play08:37

amounts of data in the same space

play08:40

thereby reducing the amount of memory

play08:42

required for performing a query

play08:44

it also increases the speed of

play08:46

processing

play08:47

in an in-memory database the entire

play08:49

information is loaded into memory

play08:52

eliminating the need for indices

play08:54

aggregates

play08:55

optimized databases star schemas and

play08:58

cubes

play08:59

with the use of in-memory tools

play09:01

compression algorithms can be

play09:03

implemented that decrease the in-memory

play09:05

size even beyond what is required for

play09:07

hard disks

play09:09

users querying data loaded in memory is

play09:11

different from caching

play09:13

in memory processing also helps to avoid

play09:16

performance bottlenecks and slow

play09:18

database access

play09:20

caching is a popular method for speeding

play09:22

up the performance of a query where

play09:24

caches are subsets of a very particular

play09:27

organized data which are already defined

play09:30

within memory tools

play09:32

data analysis can be flexible in size

play09:34

and can be accessed within seconds by

play09:36

concurrent users with an excellent

play09:38

analytics potential this is possible as

play09:40

data lies completely in memory

play09:43

in theoretical terms this leads to data

play09:46

access improvement that is 10 000 to one

play09:49

million times fast when compared to a

play09:51

disc

play09:53

in addition it reduces the performance

play09:55

tuning needed by i t professionals and

play09:57

therefore provides faster data access

play09:59

for end users

play10:01

with in-memory processing it is possible

play10:04

to access visually rich dashboards and

play10:06

existing data sources

play10:08

this ability is provided by several

play10:10

vendors

play10:11

in turn in memory processing allows end

play10:14

users and business analytics to create

play10:16

customized queries and reports without

play10:19

any need of extensive expertise or

play10:20

training

play10:21

we have already discussed that spark

play10:23

provides performance which in turn

play10:25

offers developers a rewarding experience

play10:29

spark is chosen over mapreduce mainly

play10:31

for its performance advantages and

play10:33

versatility

play10:34

apart from these another critical

play10:36

advantage is its development experience

play10:39

along with language flexibility

play10:41

spark provides support to various

play10:43

development languages like java scala

play10:45

and python and will likely support r as

play10:48

well

play10:50

in addition spark has the capability to

play10:52

define functions in line

play10:54

with the temporary exception of java a

play10:57

common element in these languages is

play10:59

that they provide methods to express

play11:01

operations using lambda functions and

play11:03

closures

play11:05

using lambda closures

play11:07

you can use the application core logic

play11:08

to define the functions inline which

play11:11

helps to create easy to comprehend codes

play11:13

and preserve application flow

play11:16

let's look at mapreduce in the hadoop

play11:18

ecosystem the hadoop ecosystem which

play11:21

allows you to store large files on

play11:22

various machines

play11:24

uses mapreduce for batch analytics that

play11:26

is as easy as it is distributed in

play11:28

nature

play11:29

on the other hand apache spark supports

play11:32

both real-time and batch processing

play11:35

in hadoop third-party support is also

play11:37

available

play11:38

for example by using etl talented tools

play11:42

various batch oriented workflows can be

play11:44

designed

play11:46

in addition it supports pig and hive

play11:48

queries that enable non-java developers

play11:50

to use and prepare batch workflows using

play11:53

sql scripts

play11:55

you can perform every type of data

play11:57

processing using spark that you can

play11:59

execute in hadoop

play12:01

for batch processing spark batch can be

play12:03

used over hadoop mapreduce

play12:06

for structured data analysis spark sql

play12:08

can be implemented using sql

play12:11

for machine learning analysis the

play12:13

machine learning library can be used for

play12:15

clustering recommendation and

play12:16

classification

play12:18

for interactive sql analysis spark sql

play12:21

can be used instead of impala in

play12:24

addition for real-time streaming data

play12:26

analysis spark streaming can be used in

play12:28

place of a specialized library like

play12:30

storm

play12:32

spark has three main advantages which

play12:34

are

play12:34

provide speed capability

play12:37

combines a various processing types

play12:39

and supports hadoop

play12:42

the feature of speed is critical to

play12:43

process large data sets as this implies

play12:46

the difference of waiting for hours or

play12:48

minutes and exploring the data

play12:50

interactively

play12:51

spark has extended the mapreduce model

play12:53

to support computations like stream

play12:55

processing and interactive queries

play12:58

supporting run computations in memory

play13:00

with respect to speed

play13:02

also its related system is more

play13:03

effective when compared to mapreduce to

play13:06

run complex applications on a disk

play13:09

this adds to the speed capability of

play13:11

spark

play13:12

spark covers various workloads that

play13:14

require different distributed systems

play13:16

such as streaming iterative algorithms

play13:19

and batch applications

play13:21

as these workloads are supported on the

play13:22

same engine combining different

play13:24

processing types is easy

play13:26

spark is normally required in the

play13:28

production of data analysis pipelines

play13:31

the combination feature

play13:32

also allows easy management of separate

play13:35

tools

play13:36

spark is capable of creating distributed

play13:38

data sets from any file that is stored

play13:41

in the hadoop distributed file system or

play13:43

any other supported storage systems

play13:46

you must note that spark does not need

play13:48

hadoop

play13:49

it just supports the storage systems

play13:51

that implement the apis of hadoop and

play13:53

sequence files parquet avro text files

play13:58

and all other input output formats of

play14:00

hadoop

play14:01

now the question is why unification

play14:03

matters

play14:04

unification not only provides developers

play14:06

with the advantage of learning only one

play14:08

platform but also allows users to take

play14:11

their apps everywhere

play14:13

the graphic shows the apps and the

play14:14

systems that can be combined with spark

play14:17

a spark project includes various closely

play14:20

integrated components for distributing

play14:22

scheduling and monitoring applications

play14:24

with many computational tasks across a

play14:27

computing cluster or various worker

play14:29

machines

play14:30

the spark core engine is general purpose

play14:33

and fast

play14:35

as a result it empowers various higher

play14:37

level components that are specialized

play14:39

for different workloads like machine

play14:41

learning or sql these components can

play14:44

inter-operate closely

play14:46

another important advantage is that it

play14:48

integrates tightly allowing you to

play14:50

create applications that easily combine

play14:52

different processing models

play14:54

an example is the ability to write an

play14:56

application using machine learning to

play14:58

categorize data in real time as it is

play15:01

ingested from sources of streaming

play15:03

additionally it allows analysts to query

play15:06

the data which results through sql

play15:09

moreover data scientists and engineers

play15:11

can access the same data through the

play15:13

python shell for ad-hoc analysis and in

play15:16

standalone batch applications

play15:18

for all this the it team needs to

play15:20

maintain one system only

play15:27

hey want to become an expert in big data

play15:30

then subscribe to the simply learn

play15:31

channel and click here to watch more

play15:33

such videos to nerd up and get certified

play15:35

in big data click here

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Apache SparkBig DataData ProcessingReal-TimeBatch ProcessingHadoop EcosystemIn-Memory ComputingMachine LearningData AnalyticsSpark SQL