What is a Data Warehouse?

IBM Technology
4 Jun 202108:20

Summary

TLDRLuv Aggarwal, a Data Platform Solution Engineer at IBM, explains the concept of an Enterprise Data Warehouse (EDW), distinguishing it from data lakes and data marts. EDWs are organized collections of clean business data, crucial for decision-making. They can be deployed on-premises, in the cloud, or through a hybrid approach. Aggarwal highlights the benefits and challenges of each deployment method, emphasizing the importance of EDWs in enterprise architecture.

Takeaways

  • šŸ‘‹ Introduction: Luv Aggarwal, a Data Platform Solution Engineer for IBM, introduces the topic of enterprise data warehouses (EDW).
  • šŸ“š Definition of EDW: An enterprise data warehouse is a large, organized collection of clean business data designed to support decision-making within an organization.
  • šŸ” Distinction from Data Lakes: Data lakes store raw, unstructured data for later cleaning and organization, unlike the more purpose-specific data warehouses.
  • šŸŖ Data Marts: A data mart is a subset of a data warehouse, focused on a specific business domain, such as finance.
  • šŸ”‘ Single Source of Truth: The data warehouse serves as a single source of truth, integrating data from various source systems.
  • šŸ”„ Data Transformation: Data is transformed from raw to high-quality, analytics-optimized data through ETL processes.
  • šŸ› ļø Source Systems: The data in a warehouse can come from diverse systems like CRMs, ERP systems, and supply chain databases.
  • šŸ¤– User Roles: Users of a data warehouse include business analysts, data scientists, and data engineers who leverage the data for analytics and machine learning.
  • šŸ­ Deployment Options: Data warehouses can be deployed on-premises, in the cloud, or through a hybrid approach combining both.
  • šŸ’¾ On-Premises Benefits: On-premises deployment offers control, local network speeds, high availability, and regulatory compliance, but requires upfront investment and maintenance.
  • ā˜ļø Cloud Benefits: Cloud-based data warehouses offer scalability, resource efficiency, and automatic upgrades, but may have performance and cost unpredictability.
  • šŸŒ Hybrid Approach: A hybrid approach combines the benefits of on-premises and cloud deployments, allowing for flexibility in use-cases and disaster recovery.

Q & A

  • What is Luv Aggarwal's professional role?

    -Luv Aggarwal is a Data Platform Solution Engineer for IBM.

  • What is the primary purpose of a data warehouse?

    -A data warehouse is a large collection of organized and clean business data, designed to help an organization make decisions.

  • How does a data warehouse differ from a data lake?

    -A data warehouse is more purpose-specific and contains organized and clean data, while a data lake is a place to store raw, structured, and unstructured data for later cleaning and organization.

  • What is a data mart in the context of data warehousing?

    -A data mart is a subset of a data warehouse that is specific to a particular business domain, such as a finance data mart.

  • What is the role of ETL in the context of data warehousing?

    -ETL, or Extract, Transform, and Load, is the process used to convert raw data from various source systems into high-quality, optimized data for analytics within the data warehouse.

  • What types of data can be found in a data warehouse?

    -A data warehouse can contain various types of data, including customer data from CRMs, sales data, ERP system data, supply chain data, and more.

  • Who are the typical users of a data warehouse?

    -Typical users of a data warehouse include business analysts, data scientists, and data engineers who leverage the data for analytics, business intelligence, and machine learning.

  • What are the three common deployment methods for a data warehouse?

    -The three common deployment methods for a data warehouse are on-premises, cloud-based, and a hybrid approach combining both on-premises and cloud.

  • What are the benefits of having an on-premises data warehouse?

    -Benefits of an on-premises data warehouse include maintaining complete control over the tech stack, leveraging local network speeds, high availability, and strict governance and regulatory compliance.

  • What are the advantages of a cloud-based data warehouse?

    -Advantages of a cloud-based data warehouse include freeing up resources to focus on analytics tasks, easy scalability without needing to procure new hardware, and automatic upgrades.

  • What is the hybrid approach to data warehouse deployment and why is it chosen?

    -The hybrid approach combines on-premises and cloud data warehouses, chosen for exploring new cloud-born use-cases and for disaster recovery and backup scenarios.

Outlines

00:00

šŸ“š Introduction to Enterprise Data Warehouses

In this introductory paragraph, Luv Aggarwal, a Data Platform Solution Engineer at IBM, sets the stage for a discussion on enterprise data warehouses (EDWs). He mentions the growing complexity of data management solutions over the past two decades and clarifies the distinction between data lakes, data warehouses, and data marts. Aggarwal emphasizes that a data warehouse is a purpose-specific, organized collection of clean business data, in contrast to a data lake, which is a repository for raw data of various formats. He also touches on the concept of a data mart, which is a more domain-specific subset of a data warehouse, such as a finance data mart. The paragraph establishes the data warehouse as a critical component for organizational decision-making, highlighting its role as a single source of truth derived from multiple source systems and optimized for analytics through ETL processes.

05:04

šŸ¢ Deployment Options for Data Warehouses

This paragraph delves into the various deployment options for data warehouses, focusing on three primary methods: on-premises, cloud-based, and hybrid approaches. On-premises deployment can be configured on commodity hardware using either MPP or SMP architecture or through a purpose-built appliance. The benefits include maintaining control over the tech stack, leveraging local network speeds, and ensuring high availability and compliance. However, this method requires an upfront investment and ongoing maintenance. Cloud-based data warehouses offer the advantage of freeing up resources for analytics tasks, easy scalability, and automatic upgrades, but they may suffer from performance issues and unexpected costs. The hybrid approach combines the best of both worlds, allowing for exploration of new cloud-born data sources and robust disaster recovery solutions. The paragraph concludes by acknowledging the vast topic of enterprise data warehouses and their place within an overall enterprise architecture, inviting viewers to engage with the content and explore IBM's data solutions further.

Mindmap

Keywords

šŸ’”Data Platform Solution Engineer

A Data Platform Solution Engineer is a professional who specializes in designing, implementing, and managing data platforms. In the context of the video, Luv Aggarwal holds this role at IBM, indicating a deep understanding of data solutions and their application within an enterprise setting. The role is central to the video's theme as it deals with the complexities and advancements in data warehousing.

šŸ’”Enterprise Data Warehouse (EDW)

An Enterprise Data Warehouse (EDW) is a system used to report and analyze data from one or more business systems. The video emphasizes the EDW as a large, organized collection of clean business data, which is crucial for decision-making within an organization. The EDW is distinguished from data lakes and data marts, highlighting its purpose-specific nature and its role as a single source of truth.

šŸ’”Data Lake

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. In the video, the data lake is described as a place to dump all sorts of raw, structured, and unstructured data for later cleaning and organization, contrasting it with the more structured and purpose-specific nature of a data warehouse.

šŸ’”Data Mart

A data mart is a subset of a data warehouse that is designed to cater to a specific business unit or group. The script uses the example of a finance data mart to illustrate how a data mart is more focused than a general data warehouse, which serves the entire organization.

šŸ’”ETL (Extract, Transform, Load)

ETL refers to the process of extracting data from various sources, transforming it into a suitable format, and loading it into a target system, such as a data warehouse. The video mentions ETL tools as a means to convert raw data into high-quality, analytics-optimized data, which is essential for the functionality of an EDW.

šŸ’”CRM (Customer Relationship Management)

CRM systems are used to manage a company's interaction with current and potential customers. In the context of the video, CRM data is one of the types of data that can be included in a data warehouse, highlighting the importance of customer information in business analytics.

šŸ’”ERP (Enterprise Resource Planning)

ERP systems integrate various aspects of an organization's management, such as resources, inventory, and human resources. The video script mentions ERP as a source of data for the data warehouse, indicating the comprehensive nature of data that an EDW can include.

šŸ’”Business Analyst

A business analyst is a professional who analyzes data and uses it to help make business decisions. The video identifies business analysts as one of the key users of a data warehouse, emphasizing their role in leveraging data for analytics and decision-making.

šŸ’”Data Scientist

A data scientist is an expert in analyzing and interpreting complex digital data to aid decision-making. The script mentions data scientists as users of the data warehouse, indicating the advanced analytical techniques they may apply to the data.

šŸ’”On-Premises

An on-premises deployment refers to installing and running a data warehouse within the organization's own physical infrastructure. The video discusses the benefits of on-premises deployment, such as control over the tech stack and local network speeds, as well as the challenges like upfront investment and maintenance.

šŸ’”Cloud-Based Data Warehouse

A cloud-based data warehouse is a data warehouse that is delivered as a service via a public cloud provider. The video outlines the benefits of cloud-based deployment, such as resource optimization and scalability, while also noting potential drawbacks like performance issues and unexpected costs.

šŸ’”Hybrid Approach

A hybrid approach combines both on-premises and cloud-based data warehouse deployments. The video script explains that this approach allows organizations to leverage the benefits of both deployment methods, such as exploring new cloud-based use-cases while maintaining mission-critical on-premises workloads.

Highlights

Introduction of the speaker, Luv Aggarwal, a Data Platform Solution Engineer for IBM.

Explaining the growth and complexity of data warehouses over the past 20+ years.

Clarifying the difference between data lakes, data warehouses, and data marts.

Describing data warehouses as purpose-specific collections of clean and organized business data.

Data marts are subsets of data warehouses specific to a particular business domain.

Focusing on the data warehouse as the single source of truth across multiple knowledge domains.

Data in the warehouse comes from various source systems and is transformed for analytics.

Data types in source systems include transactional systems, relational databases, and cover various business domains.

Data warehouse users include business analysts, data scientists, and data engineers.

Users leverage data sets for analytics and machine learning using built-in tools or external platforms.

Three common deployment methods for data warehouses: on-premises, cloud-based, and hybrid.

On-premises deployment can be configured using MPP or SMP architecture.

Benefits of on-premises data warehouses include control, local network speeds, and high availability.

Challenges of on-premises deployment include upfront investment and ongoing maintenance.

Cloud-based data warehouses offer managed SaaS, easy scaling, and automatic upgrades.

Potential drawbacks of cloud-based data warehouses include performance hits and unanticipated costs.

Hybrid approach combines the benefits of on-premises and cloud deployments.

Hybrid deployment allows for exploring new use-cases and disaster recovery scenarios.

Enterprise data warehouses fit into overall enterprise architecture and support various analytical tasks.

Transcripts

play00:00

Hey, what's up, everyone? MyĀ  name is Luv Aggarwal and I'mĀ Ā 

play00:03

a Data Platform Solution Engineer for IBM.

play00:06

Data warehouses. Their prevalence acrossĀ  enterprises has grown significantlyĀ Ā 

play00:10

over the past 20+ years. But withĀ  multiple modern advancements,Ā Ā 

play00:15

the numerous options out thereĀ  are now much more complex.

play00:19

So, let's talk about what an enterprise dataĀ  warehouse, or "EDW", is. So, first and foremost,Ā Ā 

play00:25

there's often confusion between "data lakes"Ā  and "data warehouses" and even "data marts".Ā Ā 

play00:46

So, I like to think of a data warehouse as beingĀ  more purpose-specific than a data lake. So,Ā Ā 

play00:52

while a data lake is a great place to dump allĀ  sorts of raw, structured and unstructured dataĀ Ā 

play00:57

in a quick way to clean and organize later, aĀ  data warehouse, on the other hand, is a largeĀ Ā 

play01:02

collection of organized and clean business data,Ā  ready to help an organization make decisions.Ā Ā 

play01:09

And a data mart is like a subset of aĀ  data warehouse that's more specific to aĀ Ā 

play01:14

particular business domain. So, for example,Ā  you could have a finance data mart.

play01:19

But for today, let's focus on the data warehouse.Ā Ā 

play01:22

So, we'll get rid of data lakes and data marts,Ā  and we'll make this a little bit bigger.

play01:22

But for today, we'll focus on the data warehouse.Ā  So, let's get rid of data lakes and data marts,Ā Ā 

play01:24

and make our data warehouseĀ  a little bit bigger.

play01:27

So, the data warehouse serves as the single sourceĀ  of truth for an organization across multipleĀ Ā 

play01:32

knowledge domains. And data in the warehouseĀ  comes from multiple different source systems.Ā Ā 

play01:43

And is transformed from rawĀ  data to high quality data,Ā Ā 

play01:48

optimized for analytics via various differentĀ  ETL, or "Extract, Transform and Load" tools.

play01:58

So, as I mentioned, data that'sĀ  in our source systems can be inĀ Ā 

play02:04

different types. It could be transactionalĀ  systems, it can be relational databases,Ā Ā 

play02:08

and they can cover a wideĀ  variety of business domains.

play02:12

So, the data could cover things like customerĀ  data from our CRMs. We could have sales data.Ā Ā 

play02:22

We could have data from our ERP systems.Ā  We could even have supply chain data.Ā Ā 

play02:30

And the list goes on and on. Right.

play02:34

So, once data has been cleaned, transformed andĀ Ā 

play02:38

loaded into our data warehouse, it'sĀ  now ready for us to expose to our users,Ā Ā 

play02:45

who can then start to take it and do analyticsĀ  and machine learning on these data sets.

play02:52

So, who are our users? Our users can be folksĀ  like business analysts. We can have dataĀ Ā 

play03:03

scientists. We could even have data engineers. AndĀ  these folks can now start leveraging these dataĀ Ā 

play03:16

sets, either using the built-in analytics tools inĀ  the data warehouse or using a variety of differentĀ Ā 

play03:25

business intelligence or predictiveĀ  analytics and machine learning platforms.

play03:34

OK, so now that we know what anĀ  enterprise data warehouse is,Ā Ā 

play03:38

let's talk about the different waysĀ  in which it can be implemented.Ā Ā 

play03:42

So, three common ways in which aĀ  data warehouse can be deployed.

play03:46

The first way is on-premises. Now,Ā  a couple different ways in which anĀ Ā 

play03:52

on-prem data warehouse can be configured,Ā  we could have our data warehouse running onĀ Ā 

play03:59

commodity hardware. Now, this could be set upĀ  and structured using either MPP, or "MassivelyĀ Ā 

play04:08

Parallel Processing", architecture where we justĀ  add more compute nodes as our workload grows,Ā Ā 

play04:15

or using SMP, or "Symmetric Multi-Processing",Ā  architecture where, typically, we have aĀ Ā 

play04:23

tightly coupled, multi-CPU system that sharesĀ  resources from one common operating system.

play04:30

Now, the other way is through aĀ  purpose-built appliance format.Ā Ā 

play04:38

Now, this is typically an integratedĀ  stack of CPU, memory storage software,Ā Ā 

play04:46

all purpose-built and optimized for a dataĀ  warehouse workload from a single vendor.

play04:51

So, what are some of the benefitsĀ  of having an on-prem data warehouse?Ā Ā 

play04:56

Well, first you get to maintain completeĀ  control over the entire tech stack, right?Ā Ā 

play05:03

Second, you can leverage your local networkĀ  speeds and perhaps avoid some bandwidth challengesĀ Ā 

play05:11

typically associated with the cloud. You can alsoĀ  leverage high availability, and we can maintainĀ Ā 

play05:20

strict governance and regulatory compliance, butĀ  on the other hand, an on-prem data warehouse doesĀ Ā 

play05:27

come with an upfront investment and theĀ  need for ongoing support and maintenance.

play05:33

Now, the other way in which aĀ  data warehouse can be deployedĀ Ā 

play05:36

is through a cloud-based data warehouse,Ā  where our data warehouse is delivered asĀ Ā 

play05:43

a managed to SaaS offering via theĀ  multiple public cloud providers.

play05:50

So, moving data warehouses to the cloud isĀ  the next frontier for a lot of enterprisesĀ Ā 

play05:56

and for valid reasons. Some of the benefitsĀ  include being able to free up resourcesĀ Ā 

play06:03

to focus on other high value analytics tasks,Ā  right, instead of just managing systems.

play06:10

Another benefit can also be theĀ  ability to scale easily. Right,Ā Ā 

play06:15

because we don't have to goĀ  out and procure new hardwareĀ Ā 

play06:19

and we get to leverage automatic upgrades. Right.Ā  Now, on the other hand, oftentimes a cloud-basedĀ Ā 

play06:31

data warehouse can take a performance hit due toĀ  how it's fine tuned for that specific workload,Ā Ā 

play06:37

and there can be some unanticipated high costsĀ  due to how cloud data warehouse is scaled.

play06:44

OK, the third option is actually a hybridĀ  approach. So, this takes the best of on-premĀ Ā 

play06:54

and cloud and brings them together. And a lotĀ  of enterprises choose to run both their on-premĀ Ā 

play06:59

and cloud data warehouses in conjunction. And thisĀ  can be done for a couple of different reasons.

play07:05

So, one benefit can be that this allows us toĀ  explore new use-cases. Right. So as an enterprise,Ā Ā 

play07:13

we may have certain data sources thatĀ  were born in the cloud. So, it can beĀ Ā 

play07:18

beneficial to start leveraging a cloud dataĀ  warehouse for analytics against those use-casesĀ Ā 

play07:24

while still maintaining their missionĀ  critical workloads on-prem.

play07:30

Another benefit can be for a disasterĀ  recovery and backup scenario.Ā Ā 

play07:38

This is where we would use both our environmentsĀ  in conjunction for DR and backup reasons.

play07:44

So, if we take a step back, we can see thatĀ  we've barely started to scratch the surface ofĀ Ā 

play07:49

enterprise data warehouses and how they fit intoĀ  an overall enterprise architecture. But I hopeĀ Ā 

play07:55

this video has given us a good idea of how dataĀ  warehouses fit in and what they're used for. ThankĀ Ā 

play08:02

you. If you have any questions, please drop us aĀ  line below. If you want to see more videos likeĀ Ā 

play08:08

this in the future, please like and subscribe. AndĀ  don't forget, if you want to learn more about anyĀ Ā 

play08:13

of the IBM data solutions we've discussed today,Ā  please feel free to check out the link below.

Rate This
ā˜…
ā˜…
ā˜…
ā˜…
ā˜…

5.0 / 5 (0 votes)

Related Tags
Data WarehousingIBM SolutionsEnterprise AnalyticsETL ProcessesCloud MigrationOn-Prem DeploymentData LakesData MartsBusiness IntelligenceData GovernanceSaaS OfferingsData TransformationAnalytic ToolsMachine LearningData CleanlinessHybrid ApproachDisaster RecoveryRegulatory CompliancePerformance OptimizationResource ManagementPredictive Analytics