What is a Data Warehouse?

IBM Technology
4 Jun 202108:20

Summary

TLDRLuv Aggarwal, a Data Platform Solution Engineer at IBM, explains the concept of an Enterprise Data Warehouse (EDW), distinguishing it from data lakes and data marts. EDWs are organized collections of clean business data, crucial for decision-making. They can be deployed on-premises, in the cloud, or through a hybrid approach. Aggarwal highlights the benefits and challenges of each deployment method, emphasizing the importance of EDWs in enterprise architecture.

Takeaways

  • 👋 Introduction: Luv Aggarwal, a Data Platform Solution Engineer for IBM, introduces the topic of enterprise data warehouses (EDW).
  • 📚 Definition of EDW: An enterprise data warehouse is a large, organized collection of clean business data designed to support decision-making within an organization.
  • 🔍 Distinction from Data Lakes: Data lakes store raw, unstructured data for later cleaning and organization, unlike the more purpose-specific data warehouses.
  • 🏪 Data Marts: A data mart is a subset of a data warehouse, focused on a specific business domain, such as finance.
  • 🔑 Single Source of Truth: The data warehouse serves as a single source of truth, integrating data from various source systems.
  • 🔄 Data Transformation: Data is transformed from raw to high-quality, analytics-optimized data through ETL processes.
  • 🛠️ Source Systems: The data in a warehouse can come from diverse systems like CRMs, ERP systems, and supply chain databases.
  • 🤖 User Roles: Users of a data warehouse include business analysts, data scientists, and data engineers who leverage the data for analytics and machine learning.
  • 🏭 Deployment Options: Data warehouses can be deployed on-premises, in the cloud, or through a hybrid approach combining both.
  • 💾 On-Premises Benefits: On-premises deployment offers control, local network speeds, high availability, and regulatory compliance, but requires upfront investment and maintenance.
  • ☁️ Cloud Benefits: Cloud-based data warehouses offer scalability, resource efficiency, and automatic upgrades, but may have performance and cost unpredictability.
  • 🌐 Hybrid Approach: A hybrid approach combines the benefits of on-premises and cloud deployments, allowing for flexibility in use-cases and disaster recovery.

Q & A

  • What is Luv Aggarwal's professional role?

    -Luv Aggarwal is a Data Platform Solution Engineer for IBM.

  • What is the primary purpose of a data warehouse?

    -A data warehouse is a large collection of organized and clean business data, designed to help an organization make decisions.

  • How does a data warehouse differ from a data lake?

    -A data warehouse is more purpose-specific and contains organized and clean data, while a data lake is a place to store raw, structured, and unstructured data for later cleaning and organization.

  • What is a data mart in the context of data warehousing?

    -A data mart is a subset of a data warehouse that is specific to a particular business domain, such as a finance data mart.

  • What is the role of ETL in the context of data warehousing?

    -ETL, or Extract, Transform, and Load, is the process used to convert raw data from various source systems into high-quality, optimized data for analytics within the data warehouse.

  • What types of data can be found in a data warehouse?

    -A data warehouse can contain various types of data, including customer data from CRMs, sales data, ERP system data, supply chain data, and more.

  • Who are the typical users of a data warehouse?

    -Typical users of a data warehouse include business analysts, data scientists, and data engineers who leverage the data for analytics, business intelligence, and machine learning.

  • What are the three common deployment methods for a data warehouse?

    -The three common deployment methods for a data warehouse are on-premises, cloud-based, and a hybrid approach combining both on-premises and cloud.

  • What are the benefits of having an on-premises data warehouse?

    -Benefits of an on-premises data warehouse include maintaining complete control over the tech stack, leveraging local network speeds, high availability, and strict governance and regulatory compliance.

  • What are the advantages of a cloud-based data warehouse?

    -Advantages of a cloud-based data warehouse include freeing up resources to focus on analytics tasks, easy scalability without needing to procure new hardware, and automatic upgrades.

  • What is the hybrid approach to data warehouse deployment and why is it chosen?

    -The hybrid approach combines on-premises and cloud data warehouses, chosen for exploring new cloud-born use-cases and for disaster recovery and backup scenarios.

Outlines

00:00

📚 Introduction to Enterprise Data Warehouses

In this introductory paragraph, Luv Aggarwal, a Data Platform Solution Engineer at IBM, sets the stage for a discussion on enterprise data warehouses (EDWs). He mentions the growing complexity of data management solutions over the past two decades and clarifies the distinction between data lakes, data warehouses, and data marts. Aggarwal emphasizes that a data warehouse is a purpose-specific, organized collection of clean business data, in contrast to a data lake, which is a repository for raw data of various formats. He also touches on the concept of a data mart, which is a more domain-specific subset of a data warehouse, such as a finance data mart. The paragraph establishes the data warehouse as a critical component for organizational decision-making, highlighting its role as a single source of truth derived from multiple source systems and optimized for analytics through ETL processes.

05:04

🏢 Deployment Options for Data Warehouses

This paragraph delves into the various deployment options for data warehouses, focusing on three primary methods: on-premises, cloud-based, and hybrid approaches. On-premises deployment can be configured on commodity hardware using either MPP or SMP architecture or through a purpose-built appliance. The benefits include maintaining control over the tech stack, leveraging local network speeds, and ensuring high availability and compliance. However, this method requires an upfront investment and ongoing maintenance. Cloud-based data warehouses offer the advantage of freeing up resources for analytics tasks, easy scalability, and automatic upgrades, but they may suffer from performance issues and unexpected costs. The hybrid approach combines the best of both worlds, allowing for exploration of new cloud-born data sources and robust disaster recovery solutions. The paragraph concludes by acknowledging the vast topic of enterprise data warehouses and their place within an overall enterprise architecture, inviting viewers to engage with the content and explore IBM's data solutions further.

Mindmap

Keywords

💡Data Platform Solution Engineer

A Data Platform Solution Engineer is a professional who specializes in designing, implementing, and managing data platforms. In the context of the video, Luv Aggarwal holds this role at IBM, indicating a deep understanding of data solutions and their application within an enterprise setting. The role is central to the video's theme as it deals with the complexities and advancements in data warehousing.

💡Enterprise Data Warehouse (EDW)

An Enterprise Data Warehouse (EDW) is a system used to report and analyze data from one or more business systems. The video emphasizes the EDW as a large, organized collection of clean business data, which is crucial for decision-making within an organization. The EDW is distinguished from data lakes and data marts, highlighting its purpose-specific nature and its role as a single source of truth.

💡Data Lake

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. In the video, the data lake is described as a place to dump all sorts of raw, structured, and unstructured data for later cleaning and organization, contrasting it with the more structured and purpose-specific nature of a data warehouse.

💡Data Mart

A data mart is a subset of a data warehouse that is designed to cater to a specific business unit or group. The script uses the example of a finance data mart to illustrate how a data mart is more focused than a general data warehouse, which serves the entire organization.

💡ETL (Extract, Transform, Load)

ETL refers to the process of extracting data from various sources, transforming it into a suitable format, and loading it into a target system, such as a data warehouse. The video mentions ETL tools as a means to convert raw data into high-quality, analytics-optimized data, which is essential for the functionality of an EDW.

💡CRM (Customer Relationship Management)

CRM systems are used to manage a company's interaction with current and potential customers. In the context of the video, CRM data is one of the types of data that can be included in a data warehouse, highlighting the importance of customer information in business analytics.

💡ERP (Enterprise Resource Planning)

ERP systems integrate various aspects of an organization's management, such as resources, inventory, and human resources. The video script mentions ERP as a source of data for the data warehouse, indicating the comprehensive nature of data that an EDW can include.

💡Business Analyst

A business analyst is a professional who analyzes data and uses it to help make business decisions. The video identifies business analysts as one of the key users of a data warehouse, emphasizing their role in leveraging data for analytics and decision-making.

💡Data Scientist

A data scientist is an expert in analyzing and interpreting complex digital data to aid decision-making. The script mentions data scientists as users of the data warehouse, indicating the advanced analytical techniques they may apply to the data.

💡On-Premises

An on-premises deployment refers to installing and running a data warehouse within the organization's own physical infrastructure. The video discusses the benefits of on-premises deployment, such as control over the tech stack and local network speeds, as well as the challenges like upfront investment and maintenance.

💡Cloud-Based Data Warehouse

A cloud-based data warehouse is a data warehouse that is delivered as a service via a public cloud provider. The video outlines the benefits of cloud-based deployment, such as resource optimization and scalability, while also noting potential drawbacks like performance issues and unexpected costs.

💡Hybrid Approach

A hybrid approach combines both on-premises and cloud-based data warehouse deployments. The video script explains that this approach allows organizations to leverage the benefits of both deployment methods, such as exploring new cloud-based use-cases while maintaining mission-critical on-premises workloads.

Highlights

Introduction of the speaker, Luv Aggarwal, a Data Platform Solution Engineer for IBM.

Explaining the growth and complexity of data warehouses over the past 20+ years.

Clarifying the difference between data lakes, data warehouses, and data marts.

Describing data warehouses as purpose-specific collections of clean and organized business data.

Data marts are subsets of data warehouses specific to a particular business domain.

Focusing on the data warehouse as the single source of truth across multiple knowledge domains.

Data in the warehouse comes from various source systems and is transformed for analytics.

Data types in source systems include transactional systems, relational databases, and cover various business domains.

Data warehouse users include business analysts, data scientists, and data engineers.

Users leverage data sets for analytics and machine learning using built-in tools or external platforms.

Three common deployment methods for data warehouses: on-premises, cloud-based, and hybrid.

On-premises deployment can be configured using MPP or SMP architecture.

Benefits of on-premises data warehouses include control, local network speeds, and high availability.

Challenges of on-premises deployment include upfront investment and ongoing maintenance.

Cloud-based data warehouses offer managed SaaS, easy scaling, and automatic upgrades.

Potential drawbacks of cloud-based data warehouses include performance hits and unanticipated costs.

Hybrid approach combines the benefits of on-premises and cloud deployments.

Hybrid deployment allows for exploring new use-cases and disaster recovery scenarios.

Enterprise data warehouses fit into overall enterprise architecture and support various analytical tasks.

Transcripts

play00:00

Hey, what's up, everyone? My  name is Luv Aggarwal and I'm  

play00:03

a Data Platform Solution Engineer for IBM.

play00:06

Data warehouses. Their prevalence across  enterprises has grown significantly  

play00:10

over the past 20+ years. But with  multiple modern advancements,  

play00:15

the numerous options out there  are now much more complex.

play00:19

So, let's talk about what an enterprise data  warehouse, or "EDW", is. So, first and foremost,  

play00:25

there's often confusion between "data lakes"  and "data warehouses" and even "data marts".  

play00:46

So, I like to think of a data warehouse as being  more purpose-specific than a data lake. So,  

play00:52

while a data lake is a great place to dump all  sorts of raw, structured and unstructured data  

play00:57

in a quick way to clean and organize later, a  data warehouse, on the other hand, is a large  

play01:02

collection of organized and clean business data,  ready to help an organization make decisions.  

play01:09

And a data mart is like a subset of a  data warehouse that's more specific to a  

play01:14

particular business domain. So, for example,  you could have a finance data mart.

play01:19

But for today, let's focus on the data warehouse.  

play01:22

So, we'll get rid of data lakes and data marts,  and we'll make this a little bit bigger.

play01:22

But for today, we'll focus on the data warehouse.  So, let's get rid of data lakes and data marts,  

play01:24

and make our data warehouse  a little bit bigger.

play01:27

So, the data warehouse serves as the single source  of truth for an organization across multiple  

play01:32

knowledge domains. And data in the warehouse  comes from multiple different source systems.  

play01:43

And is transformed from raw  data to high quality data,  

play01:48

optimized for analytics via various different  ETL, or "Extract, Transform and Load" tools.

play01:58

So, as I mentioned, data that's  in our source systems can be in  

play02:04

different types. It could be transactional  systems, it can be relational databases,  

play02:08

and they can cover a wide  variety of business domains.

play02:12

So, the data could cover things like customer  data from our CRMs. We could have sales data.  

play02:22

We could have data from our ERP systems.  We could even have supply chain data.  

play02:30

And the list goes on and on. Right.

play02:34

So, once data has been cleaned, transformed and  

play02:38

loaded into our data warehouse, it's  now ready for us to expose to our users,  

play02:45

who can then start to take it and do analytics  and machine learning on these data sets.

play02:52

So, who are our users? Our users can be folks  like business analysts. We can have data  

play03:03

scientists. We could even have data engineers. And  these folks can now start leveraging these data  

play03:16

sets, either using the built-in analytics tools in  the data warehouse or using a variety of different  

play03:25

business intelligence or predictive  analytics and machine learning platforms.

play03:34

OK, so now that we know what an  enterprise data warehouse is,  

play03:38

let's talk about the different ways  in which it can be implemented.  

play03:42

So, three common ways in which a  data warehouse can be deployed.

play03:46

The first way is on-premises. Now,  a couple different ways in which an  

play03:52

on-prem data warehouse can be configured,  we could have our data warehouse running on  

play03:59

commodity hardware. Now, this could be set up  and structured using either MPP, or "Massively  

play04:08

Parallel Processing", architecture where we just  add more compute nodes as our workload grows,  

play04:15

or using SMP, or "Symmetric Multi-Processing",  architecture where, typically, we have a  

play04:23

tightly coupled, multi-CPU system that shares  resources from one common operating system.

play04:30

Now, the other way is through a  purpose-built appliance format.  

play04:38

Now, this is typically an integrated  stack of CPU, memory storage software,  

play04:46

all purpose-built and optimized for a data  warehouse workload from a single vendor.

play04:51

So, what are some of the benefits  of having an on-prem data warehouse?  

play04:56

Well, first you get to maintain complete  control over the entire tech stack, right?  

play05:03

Second, you can leverage your local network  speeds and perhaps avoid some bandwidth challenges  

play05:11

typically associated with the cloud. You can also  leverage high availability, and we can maintain  

play05:20

strict governance and regulatory compliance, but  on the other hand, an on-prem data warehouse does  

play05:27

come with an upfront investment and the  need for ongoing support and maintenance.

play05:33

Now, the other way in which a  data warehouse can be deployed  

play05:36

is through a cloud-based data warehouse,  where our data warehouse is delivered as  

play05:43

a managed to SaaS offering via the  multiple public cloud providers.

play05:50

So, moving data warehouses to the cloud is  the next frontier for a lot of enterprises  

play05:56

and for valid reasons. Some of the benefits  include being able to free up resources  

play06:03

to focus on other high value analytics tasks,  right, instead of just managing systems.

play06:10

Another benefit can also be the  ability to scale easily. Right,  

play06:15

because we don't have to go  out and procure new hardware  

play06:19

and we get to leverage automatic upgrades. Right.  Now, on the other hand, oftentimes a cloud-based  

play06:31

data warehouse can take a performance hit due to  how it's fine tuned for that specific workload,  

play06:37

and there can be some unanticipated high costs  due to how cloud data warehouse is scaled.

play06:44

OK, the third option is actually a hybrid  approach. So, this takes the best of on-prem  

play06:54

and cloud and brings them together. And a lot  of enterprises choose to run both their on-prem  

play06:59

and cloud data warehouses in conjunction. And this  can be done for a couple of different reasons.

play07:05

So, one benefit can be that this allows us to  explore new use-cases. Right. So as an enterprise,  

play07:13

we may have certain data sources that  were born in the cloud. So, it can be  

play07:18

beneficial to start leveraging a cloud data  warehouse for analytics against those use-cases  

play07:24

while still maintaining their mission  critical workloads on-prem.

play07:30

Another benefit can be for a disaster  recovery and backup scenario.  

play07:38

This is where we would use both our environments  in conjunction for DR and backup reasons.

play07:44

So, if we take a step back, we can see that  we've barely started to scratch the surface of  

play07:49

enterprise data warehouses and how they fit into  an overall enterprise architecture. But I hope  

play07:55

this video has given us a good idea of how data  warehouses fit in and what they're used for. Thank  

play08:02

you. If you have any questions, please drop us a  line below. If you want to see more videos like  

play08:08

this in the future, please like and subscribe. And  don't forget, if you want to learn more about any  

play08:13

of the IBM data solutions we've discussed today,  please feel free to check out the link below.

Rate This

5.0 / 5 (0 votes)

相关标签
Data WarehousingIBM SolutionsEnterprise AnalyticsETL ProcessesCloud MigrationOn-Prem DeploymentData LakesData MartsBusiness IntelligenceData GovernanceSaaS OfferingsData TransformationAnalytic ToolsMachine LearningData CleanlinessHybrid ApproachDisaster RecoveryRegulatory CompliancePerformance OptimizationResource ManagementPredictive Analytics
您是否需要英文摘要?