What is Apache Hive? : Understanding Hive

BigDataElearning
26 Dec 201705:23

Summary

TLDRApache Hive is a pivotal component of the Big Data landscape, developed by Facebook and now maintained by the Apache Software Foundation. It bridges the gap between traditional SQL and the Hadoop ecosystem by allowing users to execute SQL-like queries (HQL) on large datasets stored in HDFS. Hive is adept for OLAP and analytics but not suited for OLTP or real-time data processing due to inherent latency. It supports various file formats, offers compression techniques, and enables the use of UDFs. Unlike traditional RDBMS, Hive operates on a 'schema on read' model, suitable for handling massive data volumes with scalable storage on HDFS.

Takeaways

  • 🐝 **What is Hive?** - Apache Hive is a data warehouse software that facilitates reading, writing, and managing large datasets in distributed storage using SQL-like queries.
  • πŸš€ **Development and Maintenance** - Originally developed by Facebook, Hive is now maintained by the Apache Software Foundation and widely used by companies like Netflix and Amazon.
  • πŸ” **Purpose of Hive** - It was created to make Hadoop more accessible to SQL users by allowing them to run queries using a SQL-like language called HiveQL.
  • 🧩 **Integration with Hadoop** - Hive interfaces with the Hadoop ecosystem and HDFS file system, converting HiveQL queries into MapReduce jobs for processing.
  • πŸ“Š **Use Cases for Hive** - It's ideal for OLAP (Online Analytical Processing), providing a scalable and flexible platform for querying large datasets stored in HDFS.
  • ❌ **Limitations of Hive** - Not suitable for OLTP (Online Transaction Processing), real-time updates, or scenarios requiring low-latency data retrieval due to the overhead of converting Hive scripts to MapReduce jobs.
  • πŸ“š **Features of Hive** - Supports various file formats, stores metadata in RDBMS, offers compression techniques, and allows for user-defined functions (UDFs) to extend functionality.
  • πŸ”„ **Schema Enforcement** - Hive operates on a 'schema on read' model, which means it doesn't enforce schema during data ingestion but rather when data is queried.
  • πŸ“ˆ **Scalability** - Unlike traditional RDBMS, which typically have a storage capacity limit of around 10 terabytes, Hive can handle storage of hundreds of petabytes due to its integration with HDFS.
  • πŸ“‹ **Comparison with RDBMS** - Hive differs from traditional RDBMS in that it doesn't support OLTP and enforces schema on read rather than on write, making it more suitable for big data analytics than transactional databases.

Q & A

  • What is Apache Hive?

    -Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. It was originally developed by Facebook and is now maintained by the Apache Software Foundation.

  • Why was Hive developed?

    -Hive was developed to provide a SQL-like interface (Hive Query Language or HQL) for users to interact with the Hadoop ecosystem, making it easier for those familiar with SQL to work with large volumes of data stored in Hadoop.

  • How does Hive complement the Hadoop ecosystem?

    -Hive complements the Hadoop ecosystem by allowing users to run SQL-like queries on large datasets in Hadoop, which are then converted into MapReduce jobs for processing.

  • What are the main use cases for Hive?

    -Hive is primarily used for OLAP (Online Analytical Processing), allowing for scalable, fast, and flexible data analysis on large datasets residing on the Hadoop Distributed File System (HDFS).

  • When is Hive not suitable for use?

    -Hive is not suitable for OLTP (Online Transaction Processing), real-time updates or queries, or scenarios requiring low-latency data retrieval due to the inherent latency in converting Hive scripts into MapReduce jobs.

  • What file formats does Hive support?

    -Hive supports various file formats including Sequence File, Text File, Avro, ORC, and RC file formats.

  • Where does Hive store its metadata?

    -Hive stores its metadata in an RDBMS like Apache Derby, which allows for metadata management and query optimization.

  • What are some of the finest features of Hive?

    -Hive features include support for different file formats, compression techniques, user-defined functions (UDFs), and specialized joins to improve query performance.

  • How does Hive's schema enforcement differ from traditional RDBMS?

    -Hive enforces schema on read, meaning data can be inserted without checking the schema, which is verified only upon reading. Traditional RDBMS enforce schema on write, verifying data during insertion.

  • What is the storage capacity difference between Hive and traditional RDBMS?

    -Hive can store hundreds of petabytes of data in HDFS, whereas traditional RDBMS typically have a storage capacity of around 10 terabytes.

  • Does Hive support OLTP operations?

    -No, Hive does not support OLTP operations as it is designed for batch processing and analytics rather than transactional processing.

Outlines

00:00

πŸ’‘ Introduction to Apache Hive

Apache Hive is a prominent data warehouse software project designed to facilitate querying and managing large datasets residing in distributed storage using SQL. Originally developed by Facebook, it is now overseen by the Apache Software Foundation. Hive serves as an interface for Hadoop, allowing users accustomed to SQL to interact with the Hadoop ecosystem seamlessly. It enables the execution of SQL-like queries, known as Hive Query Language (HQL), which are then translated into MapReduce or other processing frameworks like Tez or Spark for execution. The video script introduces Hive's purpose, its development history, and its adoption by major companies like Netflix and Amazon. It also sets the stage for further exploration of Hive's capabilities and its use cases in the context of the Hadoop ecosystem.

05:02

πŸš€ Hive's Features and Comparison with Traditional RDBMS

The script delves into the practical applications and limitations of Hive. It is highlighted as an optimal tool for Online Analytical Processing (OLAP) due to its scalability, speed, and flexibility. However, Hive is not suited for Online Transaction Processing (OLTP), real-time updates, or low-latency data retrieval scenarios. The script enumerates Hive's capabilities, such as support for various file formats, metadata storage in RDBMS like Derby, and compression techniques. It also touches on Hive's use of user-defined functions (UDF) and specialized joins to enhance query performance. A key differentiator between Hive and traditional RDBMS is the concept of 'schema on read' versus 'schema on write', with Hive being more flexible in data ingestion. The script concludes by contrasting Hive's ability to handle massive datasets with the more limited storage capacities of traditional RDBMS and sets the agenda for the next video, which will cover Hive installation.

Mindmap

Keywords

πŸ’‘Apache Hive

Apache Hive is an open-source data warehouse software that facilitates reading, writing, and managing large datasets in distributed storage using SQL queries. It was originally developed by Facebook and is now maintained by the Apache Software Foundation. In the video, Hive is introduced as a popular component of the Big Data landscape, designed to complement the Hadoop file system and enable users to interact with large datasets using SQL-like queries.

πŸ’‘Hadoop

Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is integral to the discussion in the video as Hive is designed to work with Hadoop's ecosystem, particularly with its file system, HDFS, to manage and process large volumes of data.

πŸ’‘Hive Query Language (HQL)

HQL is a SQL-like language used in Hive that allows users to write queries to extract data from Hadoop. It is a key feature highlighted in the video, as it enables users familiar with SQL to interact with the Hadoop ecosystem without needing to learn a new programming model. The script mentions that HQL queries are converted into MapReduce jobs by Hive.

πŸ’‘OLAP (Online Analytical Processing)

OLAP refers to a category of business intelligence (BI) software that allows users to analyze data from multiple perspectives. In the context of the video, Hive is highlighted as a tool for OLAP, enabling scalable, fast, and flexible analysis of large datasets residing on HDFS.

πŸ’‘OLTP (Online Transaction Processing)

OLTP is a class of software that supports transaction-oriented applications that require high performance and fast response times. The video script contrasts Hive with traditional RDBMS by noting that Hive is not suitable for OLTP, as it is designed for batch processing and not for real-time transaction processing.

πŸ’‘Schema on Read

Schema on read is a concept where the schema is applied when the data is read, rather than when it is written. This is a key feature of Hive, allowing for flexible data ingestion without strict schema enforcement at the time of data insertion, as opposed to traditional RDBMS which enforce schema on write.

πŸ’‘Metadata

Metadata in the context of Hive refers to data about data, such as table names, column names, and data types. The video mentions that Hive stores metadata in an RDBMS like Derby, which is crucial for managing and querying the data stored in Hadoop.

πŸ’‘Compression Techniques

Compression techniques are methods used to reduce the size of data for storage and transmission. The video script highlights that Hive supports various compression techniques, such as snappy and gzip, which are important for efficiently managing large datasets in Hadoop.

πŸ’‘User-Defined Functions (UDF)

UDFs are custom functions that users can define and use in their Hive queries to perform operations that are not supported by the standard functions provided by Hive. The video mentions that users can plug-in MapReduce scripts into Hive queries using UDFs, which extends the functionality of Hive.

πŸ’‘MapReduce

MapReduce is a programming model for processing large datasets with a parallel, distributed algorithm on a cluster. In the video, it is explained that Hive converts HQL queries into MapReduce jobs, which are then executed on the Hadoop cluster to process the data.

πŸ’‘Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. While not explicitly mentioned in the script, it is relevant to the discussion of Hive as Spark can also be used to process data in Hadoop, and Hive can be integrated with Spark for improved performance over MapReduce.

Highlights

Apache Hive is a popular data warehouse component in the Big Data landscape.

Hive was originally developed by Facebook and is now maintained by the Apache Software Foundation.

Hive is used by major companies like Netflix and Amazon.

Hive provides SQL-like interface to interact with the Hadoop ecosystem.

Hive was developed to ease the transition for organizations with traditional SQL-based data warehouses.

Hive allows users to write SQL-like queries called HQL to extract data from Hadoop.

Hive can be used for OLAP (Online Analytical Processing) and is scalable, fast, and flexible.

Hive is not suitable for OLTP (Online Transaction Processing) or real-time updates.

Hive supports various file formats like Sequence File, Text File, Avro, and ORC.

Hive stores metadata in RDBMS like Derby database.

Hive offers compression techniques for efficient data storage and retrieval.

Users can write SQL-like queries which Hive converts into MapReduce, Tez, or Spark jobs.

Hive supports User-Defined Functions (UDF) to extend query capabilities.

Hive offers specialized joins to improve query performance.

Hive enforces schema on read, allowing data insertion without immediate schema validation.

Hive can store hundreds of petabytes of data due to its integration with HDFS.

Traditional RDBMS enforces schema on write, verifying data during insertion.

Hive is not designed for OLTP, unlike traditional RDBMS systems.

The video concludes with a summary of Hive's development, use cases, features, and differences from traditional RDBMS.

Transcripts

play00:01

what is hive

play00:05

in this video you will get a quick

play00:07

overview of Apache hive one of the most

play00:10

popular data warehouse components on the

play00:12

Big Data landscape it's mainly used to

play00:15

complement the Hadoop file system with

play00:17

its interface hive was originally

play00:20

developed by Facebook and is now

play00:22

maintained as Apache hive by Apache

play00:24

Software Foundation it is used and

play00:27

developed by biggies such as Netflix and

play00:29

Amazon as well

play00:32

let's start understanding how to do this

play00:35

we'll need to get a grip on the

play00:37

following things

play00:38

why was I've developed how and when it

play00:41

can be used

play00:43

when hive cannot be used

play00:46

finest features evolve

play00:48

hive versus traditional RDBMS

play00:54

why was hive developed

play00:57

the Hadoop ecosystem is not just

play01:00

scalable but also cost-effective when it

play01:03

comes to processing large volumes of

play01:04

data it is also a fairly new framework

play01:07

that packs a lot of punch

play01:09

however organizations with traditional

play01:12

data warehouses are based on sequel with

play01:15

users and developers that rely on sequel

play01:17

queries for extracting data it makes

play01:20

getting used to the Hadoop ecosystem an

play01:22

uphill task and that is exactly why hive

play01:25

was developed hive provides sequel

play01:28

intellect so that users can write sequel

play01:30

like queries called hql or hive query

play01:34

language to extract the data from Hadoop

play01:37

these sequel like queries will be

play01:39

converted into MapReduce jobs by the

play01:41

hive component and that is how it talks

play01:43

to Hadoop ecosystem and HDFS file system

play01:49

how and when hive can be used

play01:55

hive can be used for OLAP online

play01:57

analytic processing it is scalable fast

play02:00

and flexible it is a great platform for

play02:04

the sequel users to write sequel like

play02:06

queries to interact with the large

play02:08

datasets that reside on HDFS file system

play02:14

when hive cannot be used

play02:19

here is what hive cannot be used for

play02:22

not a relational database it cannot be

play02:25

used for OLTP online transaction

play02:28

processing it cannot be used for

play02:30

real-time updates or queries it cannot

play02:33

be used for scenarios where low latency

play02:35

data retrieval is expected because there

play02:38

is a latency in converting the hive

play02:40

scripts into MapReduce scripts by hive

play02:45

some of the finest features of hive

play02:50

it supports different file formats like

play02:52

sequence file text file

play02:54

Avro file format Oh RC file and RC file

play03:00

metadata gets stored in our DBMS like

play03:03

Derby database

play03:05

I've provided a lot of compression

play03:07

techniques queries on the compressed

play03:09

data such as snappy compression and gzip

play03:12

compression

play03:14

users can write sequel like queries that

play03:16

hive converts into MapReduce or tez or

play03:19

sparks jobs to query against Hadoop

play03:22

datasets

play03:23

users can plug-in MapReduce scripts into

play03:26

the hive queries using UDF user-defined

play03:29

functions

play03:32

specialized joins are available that

play03:34

help to improve the query performance

play03:37

if you don't understand any of the above

play03:39

terms that's fine we will look into the

play03:41

above features in detail in our upcoming

play03:43

videos

play03:47

five versus traditional RDBMS

play03:51

hive enforces schema on read schema on

play03:55

read allows the component to insert data

play03:58

without checking for the type or schema

play04:00

definition of the table it verifies the

play04:02

data only when data is read from the

play04:05

table

play04:06

traditional RDBMS enforce schema on

play04:09

right schema on right includes verifying

play04:12

if the data is inserted as per the table

play04:14

definition and schema definition during

play04:16

the right phase itself this is how our

play04:19

DBMS databases like my sequel or Oracle

play04:22

servers work

play04:25

hive allows you to store hundreds of

play04:27

petabytes of data because hive stores

play04:29

data in HDFS which has a scalable

play04:32

storage space

play04:34

our DBMS have a Mac storage capacity

play04:37

around 10 terabytes of data and querying

play04:40

such large data is not an easy task

play04:44

five doesn't support OLTP

play04:47

our DBMS supports OLTP

play04:53

to quickly summarize in this video we

play04:55

learned why hive was developed

play04:57

we also learned how and where hive can

play04:59

best work its magic and where it is not

play05:01

such a great fit then we went through

play05:04

some great features that come with hive

play05:06

finally we explored the striking

play05:08

difference between the hive and

play05:10

traditional RDBMS

play05:15

in the next video we will see how to

play05:17

install hive

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Apache HiveBig DataData WarehouseOLAPHadoopFacebookNetflixAmazonSchema on ReadRDBMS