What is a good model for data governance? | Amazon Web Services
Summary
TLDRIn this masterclass, Kevin Lewis delves into data governance with AWS, emphasizing the holistic approach required for effective data management. He outlines key practices including data profiling, cataloging, lineage, and quality management, crucial for aligning data with business initiatives. Lewis stresses the importance of collaboration between IT and business, the role of data stewards, and the need for a strategic roadmap to prioritize and scale data governance efforts.
Takeaways
- 📚 Data governance is not just about data cataloging or access rights; it's a holistic approach to managing data effectively for business initiatives.
- 🔍 Data profiling is crucial for systematically examining data to identify issues that could hinder the success of business initiatives.
- 🗂️ A robust data catalog is essential for making data easily accessible and well-documented for end users and application developers.
- 🌐 Data lineage is important for understanding the origins and transformations of data, which is crucial for data transparency and trust.
- 🛠️ Data quality management involves addressing specific data issues that could impede targeted business initiatives, often requiring a partnership between IT and business.
- 🔗 Data integration is necessary for combining data from various sources coherently, which is not just a technical process but also involves field-by-field alignment.
- 🎯 Master data management focuses on entities like customers, suppliers, and products, ensuring that data about the same entity is consistent across systems.
- 🛡️ Protecting data involves implementing basic security measures, access controls, and compliance with regulations to safeguard data privacy and integrity.
- 🔄 Data lifecycle management considers the cost-effective storage of data over time, balancing the need for access with the desire to optimize storage costs.
- 📈 The success of data governance lies in its ability to support specific business initiatives and improve overall data management capabilities incrementally.
Q & A
What is the main focus of the masterclass on Data Governance with AWS?
-The main focus is on the data governance capabilities and data management aspects that are crucial for preparing data to be successful with business initiatives.
Why is it a mistake to equate data governance with just a data catalog or access rights?
-Equating data governance with just a data catalog or access rights is a mistake because it overlooks the holistic approach required for effective data management. Data governance includes understanding, protecting, and curating data, which involves more than just cataloging or access control.
What are the three broad buckets that encompass data governance capabilities?
-The three broad buckets of data governance capabilities are understanding the data, protecting the data, and curating the data.
What is data profiling and why is it important?
-Data profiling is the systematic examination of data through statistics and other elements to identify any issues that may hinder the success of business initiatives. It's important for understanding the data and ensuring it's in the right condition to support business initiatives.
How does data cataloging fit into a data governance program?
-A data catalog is an important part of a data governance program as it helps make data easily accessible and well-documented for end users and application developers, facilitating the use of data for projects.
What is data lineage and why is it significant in data governance?
-Data lineage refers to understanding the history and origins of data, including which data sources it came from and how it has been transformed. It's significant for tracing data's journey and ensuring its reliability and trustworthiness.
Why is the partnership between IT and the business crucial in data governance?
-The partnership between IT and the business is crucial because it combines technical expertise with business knowledge, helping to identify and address data quality issues, prioritize initiatives, and ensure that data supports business goals effectively.
What role does data quality management play in curating data?
-Data quality management plays a critical role in curating data by identifying and addressing data quality issues that could impede business initiatives. It involves prioritizing issues, establishing data quality rules, and setting up proactive monitoring and reporting.
How does data integration differ from master data management?
-Data integration involves combining data from various sources to create a coherent whole, while master data management takes on special responsibilities for certain entities like customers, suppliers, and products, ensuring that the master data is in the necessary condition for integration and is managed effectively across systems.
What are the key aspects of protecting data in a data governance program?
-Protecting data in a data governance program involves implementing basic security measures, establishing access controls, ensuring compliance with regulations, and managing the data lifecycle to store data in the most cost-effective way over time.
Why is it important to prioritize data management practices based on targeted business initiatives?
-Prioritizing data management practices based on targeted business initiatives ensures that resources are focused on the most critical areas first, leading to more effective data management and better support for specific business goals. It also helps in building momentum and capability over time.
Outlines
👋 Introduction to Data Governance with AWS
Kevin Lewis introduces the topic of data governance, emphasizing its importance in supporting business initiatives. He explains that data governance is often mistaken as limited to certain areas like data catalogs or access rights, which are vital but not comprehensive. A holistic approach to managing data is essential to ensure it's ready for business needs, categorized into understanding, protecting, and curating data.
🔍 Understanding Data: Profiling, Catalog, and Lineage
This section discusses the 'understanding the data' aspect of governance, starting with data profiling, which involves examining data through statistics to identify issues that might affect business goals. Kevin highlights the importance of IT-business collaboration, especially with data stewards, in evaluating how data problems impact business initiatives. He also covers data cataloging for easy data access and data lineage to track data's origin and transformations.
🛠 Curating Data: Addressing Data Quality and Integration
Kevin delves into 'curating data,' focusing on data quality management, which involves identifying and addressing data quality issues. Prioritization is key to handling data that impacts business objectives. Tools and processes, coupled with business knowledge, help resolve issues. He also explains data integration—merging data from various sources—and master data management, which involves ensuring consistent and organized data across systems.
🛡 Protecting Data: Security, Compliance, and Lifecycle Management
In the 'protecting data' category, Kevin emphasizes security by determining who can access what data, both automatically and project-based, with input from data owners. Compliance with regulations is also critical. He further explains data lifecycle management, which helps store data efficiently based on business needs, ensuring it's available when necessary without overspending on storage.
📊 Prioritizing Data Management: Where to Start?
Kevin advises against starting with a specific data management practice, such as master data management or data quality management, for its own sake. Instead, businesses should prioritize based on their specific initiatives. He highlights the importance of targeting business use cases to determine which data management practices to implement first, ensuring the capability can scale over time.
📈 Holistic Data Management: Avoiding a Narrow Approach
A case study is discussed where a financial services company overly relied on a data catalog for governance. The lack of master data management and integration caused data fragmentation, making it difficult for users to join data across domains. This created inefficiencies, as teams had to repeatedly integrate core data for different projects. Kevin emphasizes the importance of holistic governance to prevent these recurring issues.
🔄 Building Momentum with Incremental Improvements
In the final section, Kevin explains how to plan and implement data management practices incrementally. Instead of tackling everything at once, businesses should prioritize data governance based on targeted use cases. Over time, this approach builds momentum, allowing for continuous improvement, reuse of data management practices, and enhanced coordination between projects.
Mindmap
Keywords
💡Data Governance
💡Data Catalog
💡Data Profiling
💡Data Steward
💡Data Quality Management
💡Data Integration
💡Master Data Management (MDM)
💡Data Security
💡Compliance
💡Data Lifecycle Management
💡Holistic Approach
Highlights
Introduction to the data governance capabilities and data management aspects within AWS.
The importance of a holistic approach to data governance, beyond just data cataloging or access rights.
The three broad buckets of data governance: understanding the data, protecting the data, and curating the data.
Data profiling as a systematic examination of data to identify potential issues.
The role of data stewards and the partnership between IT and business in data governance.
The necessity of a good data catalog for easy access and documentation of data.
Data lineage's importance in understanding the origins and transformations of data.
Addressing data quality issues and the prioritization of data quality management.
The collaboration between IT and business for effective data quality management.
Data integration processes and tools for combining data from various sources coherently.
Master data management's focus on special entities like customers, suppliers, and products.
Basic security measures for protecting data access and establishing data access policies.
The increasing importance of compliance with data regulations and privacy protection.
Data lifecycle management for cost-effective data storage and access over time.
The strategic approach to starting with data management practices based on targeted business initiatives.
The dangers of thinking too narrowly in data governance and the benefits of a holistic approach.
The example of a financial services organization's data governance program focused on data cataloging.
The importance of prioritizing data management practices to support targeted use cases.
The concept of building momentum and capability in data governance over time.
Transcripts
- So, hello, I'm Kevin Lewis
and welcome back to our masterclass
on Data Governance with AWS.
So, now we're going to talk
about the data governance capabilities
the data management aspects of data governance.
So, we'll go through those.
Now that we have talked
about how to get the program started,
we can get into a little bit more detail
about the things that the data governance program does
to prepare data to be successful with business initiatives.
Okay.
So, you know, let's start with a challenge that we see.
A lot of times you'll see data governance equated with
one or two very specific data management capabilities.
So, for example, a data catalog,
which is a very important element
of a data governance program,
but it doesn't equal data governance.
Okay? It's one important part of the program.
Or you'll see a data governance program equated
with access rights
and who can access what data
and associated with security and privacy
and compliance and those kinds of things.
And again, those are very, very important aspects
of a data governance program.
But, in order to be successful in ensuring the data is ready
for supported business initiatives
we need to think holistically
about, you know,
what it means to manage the data effectively,
what it means to make sure the data's
in the right condition to support
the the business initiatives.
So, we can organize this
under sort of three broad buckets,
understanding the data, protecting the data,
curating the data, okay?
So, and within each of these we can talk
about specific data management practices.
So, let's go through those.
So, when it comes to understanding the data,
let's start with data profiling.
The idea of data profiling is simply examining data
in a systematic way through statistics
and other elements to see is there
anything wrong with this data?
So, it's systematically looking at the data to look
for challenges that may hinder the success
of business initiatives.
So, here we can think about,
and through all of these data management practices,
we can think about the partnership again,
between IT and the business.
Because if I am profiling data,
and I'm using statistics to understand the data
and looking for issues with the data,
you know, it's important to
know we're never gonna get the data perfect.
You know, every single customer attribute is
never going to be 100% correct for every customer.
So, that's why it requires the business.
That's one reason why I acquire someone like a data steward
from the business to help me think through
not only is this data right or wrong
but even more importantly,
how is the this issue going to impact
the targeted business initiative.
So, it takes this world of potential problems
and brings it down into a much, much more narrow
and manageable scope.
And again, this is why it's so important to have
targeted business initiatives that you're cascading
from to make a decision.
It's not just to justify the program.
It's to get down into the details
as to what actual work needs to be done to ensure
the data is ready.
So, data profiling is the first one.
Then we can talk about data catalog again.
One part of a healthy data governance program is making sure
that the data is available for people who need access to it.
So, a good data catalog should allow the end user
and application developers to not have to hunt everywhere
for the data that they're looking for for their projects,
but it's available very easily,
and it's well documented, and it's accessible.
And then, there's data lineage.
So, when I'm looking at data,
I wanna know where did it come from?
Which data sources did it come from?
How is that data translated on its way
to the data that I'm actually looking at?
So, all of these are around understanding the data.
Then it comes to curating the data.
So, we talked about profiling the data,
looking for data quality issues.
Now what do we do about those data quality issues?
So, if there is a challenge with customer data attributes,
if there is a challenge with, you know, claims data
whatever it is, now I need to, first I've prioritized
and I've decided which issues I wanna address,
because there may be issues with the data
that's just out of scope and you have to leave it
the way it is for now.
But then, you focus on the specific issues
that are gonna get in the way of your targeted initiatives.
This is where data quality management comes into play.
And again, this is why we need a partnership
between IT and the business,
because there's tools that can help support this
but you need business knowledge
of the data itself and its role in the targeted initiative.
So, for data quality management,
we're gonna do things like get to the root
of what's causing it.
Maybe there is a technical issue bringing data in
from a source.
Maybe there's some mistakes in the translation
as data comes from a certain source.
Maybe there are issues in the business process.
So, when we talk about, you know,
we talked about claims adjudication earlier.
Well, when someone is reporting a claim,
are there freeform text fields
that are not descriptive enough to be able to provide
the information necessary for the targeted initiatives?
That is gonna require potentially training,
and it may require even monitoring
of the issue going forward.
So, when we've narrowed down our data quality work
to very specific issues, then I wanna set
up some data quality rules so that in production
I'm looking for these issues to emerge.
If I can't just fix them
through a technical change to feel 100% confident
that those data elements are gonna be correct,
then I want to alert
and provide some reporting around the issues
so that we can take action very proactively
in a closed loop.
So, that's data quality management.
Data integration.
So, we may need to collect data from a variety of sources.
Well, that data needs to fit together coherently.
So, you need processes and tools to do that.
So, going back to claims, for example.
Let's say I've had a merger,
and I have two business units
under two brands within one insurance company.
But yet, I need to do analysis on claims data
across the company.
And I need to link that claims data to policy data
and so on.
Well, I have to take data from System A claims data
and data from system B claims data,
and they're not automatically gonna match up.
But, if I need to do analysis on that,
then I need people to work together to make sure
that happens.
It is not just a technical process,
but it's actually going field by field
in terms of what I'm analyzing,
so that it links together in a coherent way.
So, that's data integration.
Master data management is, it's like data integration
but it's taking on, you know, special responsibilities
for certain entities like customers, suppliers, products.
These are entities that have special considerations
because the master data tends to show
up in a variety of systems about the same entity.
So information about the same customer will show up
in multiple systems.
And information about the same product will show up
in multiple systems,
and they'll be organized typically
into hierarchies and categories and that kind of thing.
So master data management takes on
those special characteristics
that you're gonna need to make sure
that the master data is
in a condition necessary to be integrated
from a variety of sources.
And then so that it's managed effectively and coherently.
And then, okay, so now we're in protect.
So, that's another, you know
classification of data management practices,
protecting the data.
And to protect the data,
we wanna do things like, you know, basic security.
So, who can have access to the data?
When should they have access to the data?
What roles should have access
to the data just as a natural course of their job?
So you have, again, a lot of information
about individual customers.
Which roles as a natural part of their responsibilities
in customer service or sales or whatever it is,
which data should they have automatic
and easy access to because of the role that they're in.
But then you have other, you know,
sort of project-based responsibilities.
Or, there's one particular activity
where there's a request to access certain data.
Again, this is where the data steward participates,
but also the data owner,
because we need to establish policies,
not only general policies around who has access
to what data automatically as a part of their job,
but to make specific decisions on a project by project basis
that maybe someone needs temporary access to sensitive data.
And that's where, you know,
the data owner is gonna be very helpful.
And then compliance, of course.
There's all kinds of regulations around data,
and it's just getting more and more important over time.
So, we need to make sure we're not only complying
to company policy, but also we're being conscious of,
you know, regulations that are looking to protect
the privacy of individuals.
So, we have to be very aware
of the regulations that are in place today
and the regulations that are proposed
and gonna be important going forward.
And then finally, in the protect category
we have the data management practice
of data lifecycle management.
So, this is simply
about being conscious about storing data
and over time how you wanna store that data
in the most cost effective way.
So, for example, you may need to keep information
about employees over years and years.
But, if the data is very old
you may not need instant access to that data.
So, data lifecycle management helps us think through
not only how long do I need to keep data,
but it also what mechanism should I use to store
that data so that it's available when I need it,
but we're also cost optimizing that
so that we can keep as much data as we need,
but we're not overspending on that capability
simply because we want to, you know,
keep that data around for a long time
just in case we need it for a specific purpose.
So, that's data lifecycle management.
And, and then with data lifecycle management,
you wanna think about, you know,
the business initiatives that you're supporting
and what needs to, you know,
what kind of archival requirements are there
for the data that you're working with in that initiative.
And really, I wanna reemphasize
all of these data management practices.
You know, the question comes up, where should I start?
And, it's very, very common.
When we run through all of these
data management capabilities.
Geez, you've talked about data quality management
and data integration and security and all of these things.
Hmm, let's start with master data management.
Let's start with data quality management.
And I have to reemphasize
that it's not a matter of which one do you start with.
It's a matter of what is your target?
What is the targeted business initiative?
What is the condition of that data?
So, you may find, for example
in a customer experience initiative
that customer data is fragmented to the extent
that the customer experience
across multiple channels is going to be hindered
because you have separate data stores
and unreconciled data of that customer
across your various channels.
So, it's not that you're gonna do master data management
because it's a good idea.
You're going to do it because it's absolutely essential.
You're not gonna do all of master data management.
You're gonna do master data management
for a segment of customer data that is necessary to succeed.
And you're gonna do it in such a way that it can scale,
so that you can extend one initiative after the other,
more customer data for master data management.
And then maybe take those same capabilities
and extend it into product and vendor,
but not because we wanna create a foundation
for any possible use case,
but because we are prioritizing based on
the targeted use cases and we're building it out.
Remember we talked about that every act
of data governance should do basically two things.
Number one, help you succeed
with a very targeted business initiative
and an application or analytic use case
within that initiative.
And two, it should help you extend your capability overall.
It should help you get better
and better over time with data management
across the board and help you integrate data better
across a variety of initiatives,
help you reuse data better.
So that's important with this.
So, let me talk about what happens
when we are thinking too narrowly.
We're not thinking holistically.
So, I was working with a financial services organization
and you know, it was one of those situations
where they had a data governance program,
but it was very heavily equated with data cataloging.
And they did have a very nice data catalog,
but the problem was that when application teams
or end users went to the data catalog to find
the data that they would need,
they could find data,
but it was not described very well.
It was very, very difficult to thread data together
across domains so that to join data again
from different business units.
So, you have, let's say mortgage and credit card data
and that data needs to be joined together
in some kind of holistic way.
Well, without things like master data management
and data integration, simply a data catalog is not
enough to be able to thread that data together for the end.
So, then the end user
and the application development teams were burdened
with having to thread it together for their projects.
And the worst part of it is that they would have to do it
over and over and over again
for the different projects that required threading
very core data together.
And then there was issues of trust
of the data that they found.
And so, quality wasn't as proactively dealt with.
So, the point I'm trying to make here is
that we need to think holistically,
but that does not mean
that we implement all of them all at once
or even any one of them completely.
We take pieces of each of these data management practices,
and we prioritize them,
and we'll talk about how to do this in a roadmap.
We prioritize them to support the very targeted use cases.
And little by little,
everything begins to fill in over time.
And this happens in real life.
You'll see a significant acceleration of projects,
because the best data management that you can do,
the most effective data management you can do
is the data management you don't have to do at all,
because it's been done before.
And again, we see this when we have
a well coordinated data governance program targeting
specific initiatives.
You build momentum over time.
You build capability and quality over time,
and then the work that you do simply extends
and enhances your data resource
rather than starting over again.
So, we'll talk more
about sort of how to plan that out,
and we'll get into more details
about the responsibilities
and how to distribute those to implement
these data management practices.
And so we'll see you in the next video.
(soft music)
Ver Más Videos Relacionados
5.0 / 5 (0 votes)