Data Governance Tutorial
Summary
TLDRThis tutorial delves into data governance, highlighting its critical role in organizations amid escalating data and stringent regulations. It distinguishes data governance from data management, emphasizing governance's focus on overarching policies and processes, while management deals with the practical execution. The video outlines the necessity of data governance for ensuring data quality, security, and compliance, and discusses the pivotal roles of data owners, stewards, and champions. It advises starting with a clear scope, documenting data sources, and maintaining data integrity for effective governance. The tutorial also stresses the importance of regular reviews to adapt to the evolving data landscape.
Takeaways
- π Data governance is increasingly important due to the exponential growth of data and new regulations around data management.
- π It encompasses the rules, processes, and accountability concerning data, aiming for its routine use, harmonization of sources, and controlled access.
- π€ Data governance involves defining data ownership and ensuring data is managed and updated correctly by those responsible.
- π€ The difference between data governance and data management is that governance sets the structure and rules, while management implements these rules in day-to-day operations.
- π’ Good data governance ensures quality data is accessible to the right people efficiently and avoids data redundancy or unauthorized access.
- π₯ Key roles in data governance include data owners, data stewards, subject matter experts, and data champions, each with specific responsibilities and areas of expertise.
- π The process of implementing data governance begins with identifying who's involved, defining the scope of data to be governed, and documenting available data sources.
- π Data mapping is crucial for understanding how different data sources relate and combine to form a complete picture of the data.
- π Metadata provides essential information about data, such as format and content, aiding in the understanding and use of data sets.
- π Data integrity is a critical aspect of data quality, focusing on maintaining the accuracy, validity, and consistency of data throughout its lifecycle.
- β»οΈ Data governance is not a one-time task but requires periodic review and updating to adapt to changes in data volume, types, and usage patterns.
Q & A
What is data governance?
-Data governance refers to the rules, processes, and accountability around data. It involves ensuring that data is used routinely, sources are harmonized, access is granted to those who need it, and ownership and management of data are clearly defined.
Why is data governance important for businesses?
-Data governance is crucial as it helps maintain the quality and security of data, ensures compliance with regulations, and allows data to be used effectively within an organization. It prevents issues like multiple databases with the same information or unauthorized access to systems.
How does data governance differ from data management?
-Data governance outlines the overall structure, rules, processes, and accountability for data use, while data management is the hands-on implementation of these governance rules. It involves the day-to-day tasks of ensuring the governance policies are followed.
What are the roles involved in data governance?
-Roles in data governance include data owners or sponsors, who have decision-making power and are accountable for data accuracy, data stewards who oversee the data on a day-to-day basis, subject matter experts who understand the data content, and data champions who promote good data practices.
What is the role of a data owner in data governance?
-A data owner has ultimate decision-making authority and accountability for the data they oversee. They ensure the data is correct, up-to-date, and that those working under them comply with data governance rules.
Why should an organization start with a specific scope when implementing data governance?
-Focusing on a specific scope when starting with data governance helps prioritize areas that are most critical or have regulatory compliance requirements. This approach increases the likelihood of successful implementation and avoids the overwhelm of trying to control all data aspects at once.
How does data mapping play a role in data governance?
-Data mapping helps understand how information in one data source relates to another, creating a more complete picture. It is essential for combining data from various sources and ensuring that the data is used accurately and consistently across the organization.
What is metadata and why is it important in data governance?
-Metadata is information about the data, such as format, content, and field descriptions. It provides a guide to understanding what data fields contain and how they should be interpreted, which is vital for maintaining data quality and consistency.
Can you explain the concept of data integrity in the context of data governance?
-Data integrity refers to the stability, accuracy, validity, and consistency of data throughout its lifecycle. In data governance, maintaining data integrity ensures that data remains reliable and trustworthy as it is accessed, moved, and used within systems.
Why is it necessary to periodically review data governance policies?
-Data governance policies need periodic review because data and its usage are constantly evolving. Regular checks help ensure that policies remain relevant and effective, adapting to changes in data volume, types, and user needs over time.
Outlines
π Introduction to Data Governance
Jen introduces the concept of data governance, emphasizing its growing importance due to the exponential growth of data and increased regulatory oversight. She outlines the goals of data governance, which include ensuring data is used routinely, harmonized, and accessible only to authorized individuals. Data governance also involves data ownership and ensuring data is managed and updated correctly. Successful governance considers the who, what, when, where, how, and why of data, controlling security, and ensuring compliance. The tutorial aims to differentiate data governance from data management, with the former focusing on the overarching structure and rules, and the latter on the practical implementation of these rules.
π Roles and Responsibilities in Data Governance
This section delves into the roles involved in data governance, particularly highlighting the data owner or sponsor who has decision-making power and is accountable for data accuracy. It discusses how larger organizations may have multiple data owners overseeing different data types, such as manufacturing or sales data. The paragraph also introduces other roles like data stewards, subject matter experts, and data champions, who work closely with the data and are crucial for effective data governance. The importance of a data governance committee in larger organizations is also mentioned, which is responsible for making decisions, resolving conflicts, and standardizing data usage across the organization.
π Getting Started with Data Governance
The paragraph discusses the practical steps for organizations to begin implementing data governance. It suggests starting with a focus on who is involved and what data needs to be governed. It advises against trying to control all data from the start, recommending instead to focus on top priorities, such as areas tied to regulatory compliance. The paragraph also stresses the importance of documenting available data sources, understanding how data is currently used, and involving those who are knowledgeable about the data in the governance process to avoid future complications.
π Data Mapping and Metadata
This section explains the importance of data mapping, which shows how information from different sources relates and combines to form a complete data set. It uses sales data as an example, illustrating how order information, customer data, and inventory data can be mapped together based on common identifiers. The paragraph also introduces metadata, which provides information about the data, such as format and content, and is crucial for understanding and maintaining data quality.
π Data Quality and Integrity
The focus here is on data integrity as a subtopic of data quality, which concerns the stability and reliability of data throughout its lifecycle. It discusses the importance of maintaining data accuracy, validity, and consistency to prevent data from becoming corrupted or inaccurate as it's used and moved within systems. The paragraph also touches on data scraping as a method to create structure where it may be lacking and to ensure that data used for critical decisions, like warranty claims, is accurate and relevant.
π Continuous Data Governance
The final paragraph emphasizes that data governance is not a one-time task but requires ongoing attention and periodic review. It stresses the need for policies and guidelines to evolve with changing data and business practices. The tutorial concludes with a call to action for viewers to apply the principles discussed, and it invites feedback and sharing of the tutorial with others who might benefit from the information.
Mindmap
Keywords
π‘Data Governance
π‘Data Management
π‘Data Ownership
π‘Data Stewards
π‘Data Quality
π‘Data Integrity
π‘Data Sources
π‘Data Mapping
π‘Metadata
π‘Data Scraping
Highlights
Data governance is crucial for every business due to the exponential growth of data and increased regulations.
Data governance encompasses the rules, processes, and accountability surrounding data usage.
The goal of data governance includes harmonizing data sources, controlling access, and ensuring data ownership and management.
Data governance is about more than just rules; it's also concerned with making data useful to the organization.
Data management is the implementation of data governance rules, focusing on the day-to-day work of data.
Good data governance ensures quality data is accessible to the right people in an efficient manner.
Data governance involves multiple roles, including data owners, stewards, subject matter experts, and champions.
Data owners have the ultimate decision-making power and accountability for data within an organization.
Data stewards and subject matter experts play a crucial role in understanding and guiding data usage.
A data governance committee is responsible for making decisions and resolving conflicts regarding data usage.
When starting with data governance, it's important to define the scope and prioritize areas like regulatory compliance.
Documenting data sources, understanding data usage, and mapping data are essential steps in data governance.
Data mapping shows how information from different sources relates and combines to form a complete data set.
Metadata provides information about the data, such as format and content, aiding in understanding and usage.
Data scraping is used to capture and relate information when direct mapping between data sources is not possible.
Data integrity is a key aspect of data quality, ensuring data remains accurate and consistent throughout its lifecycle.
Data governance policies should be regularly reviewed and updated to keep up with the evolving data landscape.
Transcripts
in the 15 years i've been working in
analytics i've seen a
growing focus on data governance in this
data governance tutorial i'll go over
what data governance is
why it's important for pretty much every
single business or organization
and what sets apart good data governance
from
poor or even non-existent data
governance at other companies
hi i'm jen i help people learn about
analytics skills and careers
check this video description for
additional resources
[Music]
as the amount of available data has
grown exponentially over the last decade
and more regulation has been put in
place
around data and information management
many organizations have started to think
more about data governance
so what exactly is this data governance
data governance is the rules
processes and accountability around data
there are multiple goals of data
governance you want the organization to
use data in a routine way for sources to
be harmonized
for the people that have access to it to
be people that
need to have access to the data and for
people that shouldn't have access
to not have access it also means
ownership
of the data who's responsible for it
being right
who's responsible for it being managed
and updated correctly
successful data governance considers the
who
what when where how and why of the data
that it's governing
while controlling the security of the
data and ensuring compliance
among many other things data governance
should also be
concerned with how can this data be made
useful to the organization
how can we do more than just have a
giant storage location for information
you may have also heard about data
management what's the difference between
data management and data governance
the main difference is data governance
outlines
the overall structure that should exist
it has the rules it has the processes
the accountability
it's more about what should happen how
should things happen
and data management is more about
implementing all of those rules it's the
hands-on
everyday work to ensure that that
governance the
that's been put in place is being
followed so it's the it
teams executing on it it's the
day-to-day
management of information and access
requirements and
whatnot that people may need when it
comes to the data
if you want to know more about data
management i'll link to a video
on my second channel for avant analytics
my consulting company
now let's talk about why data governance
matters
i mentioned that data governance isn't
just about the rules it's also about the
use of the data and making
it useful really good data governance
implementation means that quality data
is accessible to the right people and
only the right people
in an efficient way throughout the
organization
it means not having multiple databases
that have the same information
or access to systems for certain people
that shouldn't have them
it's making sure that these are in place
so that there's a consistent
understanding of
who has access why they have access and
what they're doing with that information
that exists
let's talk about getting started with
data governance what does that
actually look like practically for an
organization that's implementing it or
focusing more on it
when it comes to data governance one of
the first things that you want to think
about
is who's involved typically there are
multiple roles for a very structured
larger organization
that is implementing data governance or
has been working on it for a while
sometimes in smaller organizations or
ones that are new to data governance you
might see these roles overlap
let's talk about what each of those
roles are first though the first role is
the data
owner or this data sponsor these are
people that have ultimate
decision-making ability about the data
and have ultimate accountability for
that data being correct and up-to-date
typically these people are going to be
higher up within the organization
and have the ability to order or
ensure that those working beneath them
are complying with
what is outlined in the roles that are
defined as part of data governance
there are typically many different data
owners and sponsors when it comes to
data governance
often overseeing specific types of data
so in an organization you might have a
manufacturing data owner data sponsor
who's responsible for
maintaining everything related to a
product that's being manufactured
you may also have a sales data owner
or even maybe a segment of sales you
might have a commercial
and a residential data owner depending
on the different applications that your
company is working with
they're responsible for owning providing
and following whatever guidelines are
outlined for their set of data
you can still have multiple data owners
even for companies or organizations that
aren't dealing with physical products
for instance in a local government
organization you
may have one person that's responsible
for voter registration
data and one that's responsible for
real estate tax data and one that's
responsible for
home ownership data for the locale
so regardless of what scale what type of
product or service
that your organization has or is
providing
you can still have multiple different
data owners
if you're in a super small company if
you've got a dozen people maybe there is
just one person but even then sometimes
you have multiple
data owners if you have people
responsible for different
parts of the business in addition to
data owners
in larger organizations you'll also have
data stewards
subject matter experts and data
champions these are people that
are working more regularly with the data
that really understand the content of
the data
they should ideally be consulted in any
data governance project
because they understand the more on the
ground
work that's being done with this data
they understand the different
uses that people have for it why certain
people may need access
that an executive may not see an
immediate answer for
any time that these people are left out
of the data governance process
there usually end up being a lot of
headaches and hoops to jump through
in the practical application of using
that data
there's still room to question here
whether just because someone
had access in the past do they still
need to have access
but incorporating involving these people
that have more knowledge
can really help improve that process and
make sure that you're not backpedaling
and having to redo a lot of work later
on after you roll out these new rules
larger organizations also typically will
have a data governance committee
this group is ultimately responsible for
all the decisions that are made
if there's conflicts between different
groups they can help resolve them
if there's decisions that need to be
made or implementations that maybe need
made to standardize
how the data is used or stored or
accessed across the organization
this committee can act as a central
resource to make sure that
the data of different types isn't
implemented in a lot of different ways
across each different area so maybe
instead of having
one set of sales database with one
access point to
that and then a separate application
that deals with client information or
production information
the data governance committee can look
at it and say how can we
integrate these better how can we have
one location how can we make it
similar regardless of the type of data
you're looking for
which can usually lead to a more
streamlined process
overall and make it much simpler to
implement changes rather than dealing
with maybe a dozen different
types of systems to access data you
consolidate to
much fewer which can pull all of the
resources
into one location and make it easier to
teach people
how to access the data that they need to
access
even if it's contained within one system
this doesn't mean
automatically everybody has access the
nice thing with the actual
implementation is you can still have
different roles that allow
access to different pieces of data but
having that centralized location can
make it a lot easier
for individuals or groups within the
organization
that need to use data from multiple
different sources to complete their work
in addition to establishing who's
involved one of the very first steps
that you should take concerning data
governance
is to think about the scope of the data
that you want to govern
it's really tempting to say we want to
control it all
but the reality is unless you're a very
very very small company
it's usually not practical to try to
control
everything from the start instead
think about what your top priorities are
so
an easy solution for this is if you have
areas of data management that tie
to government or regulatory compliance
this is a great place to start with your
data governance because it's not just
about your company
or your organization it's about are you
meeting the requirements
of the law so focus on that area and
then as you have
pieces in place you can expand further
and further
but anytime that you try to take
everything within the scope of work that
you're doing
you are much more likely to fail it's
much more likely to take a lot longer
to make the same type of progress
because you're trying to take care of
everything at once
instead of one piece at a time an easy
comparison is
think if you tripped and fell down the
stairs if you
cut yourself and were bleeding profusely
and you had a broken arm and you hit
your head
you ideally yeah you would fix
everything at once
but the company that is going through
this with their data
you fix the thing that's going to hurt
you the most so if you fell down the
stairs and you're bleeding
you stop the bleeding that is the most
immediate pressing concern that doesn't
mean you
ignore the broken bone or the potential
head injury
but you take them in order of what is
the most serious that i deal with first
the same is true anytime you're working
with data what's the most immediate need
what's going to have the most immediate
consequence
negative consequence if i don't do
something about it
and then once that's taken care of you
can move on to the next thing
if you're not dealing with compliance
issues or something that's otherwise
urgent you still need to set some sort
of scope
in this case you can just pick an area
that may have
a lot of advantage to working with or
just pick an area
sometimes people get too hung up on
making sure they pick the right area
that they don't just take action
so if you're not sure what to do pick
something say that you want to
work on client data as the first step of
governance
and then you can move on to the next
step or pick manufacturing data whatever
you do
don't let it stop you from doing
something this can also sometimes inform
who is involved in the data governance
process up front if you're just getting
started
and you have people that you know are
eager and want to be involved
that can be a guideline for what you
pick to work on
or if you pick what to work on that
could inform who should be involved in
that process
now that you have the who and the what
it's time to move into more detail
document what data you have available
these are your data sources
what information is in this data where
does it come from
do you have multiple sources that are
providing the same information
who owns the data who's an expert in it
how often is it updated who checks to
make sure it's updated correctly
who accesses it and what do they use it
for when they do access it
answering these types of questions can
really help you make a more informed
decision on what rules
processes and accountability that you
put in place regarding
a specific type of data before you jump
right into making rules it's important
to understand
how people are currently using the
information
and why they're using it in that way
otherwise
again you end up with a poor
implementation
you end up making more work for people
that
still have to get their job done but now
they have someone who doesn't have any
idea what they're doing
making decisions about what they can and
can't have access to and how they're
going to access it
this doesn't mean you're not going to
bother people by the decisions you make
for data governance
there are going to be people that are
unhappy with the decisions you've made
but they tend to be a lot more receptive
to change and the organization is
typically a lot more receptive overall
when you've at least taken the time to
listen to them
account for their concerns factor that
into the decision
you're making and at least make an
informed decision
even if you know that makes things more
challenging for some individuals or some
teams or department it's easy for people
to think about how
they think the data should be used it's
a completely different story to know how
it
actually is being used it's rare that
there
aren't some surprises along the way of
how people are using
information sometimes because they can't
access what they really need and so
they're substituting
and making adjustments to existing
information
to be able to do the work that they need
to get done you may also find that as
you start exploring the data that's
available that there are multiple
sources for the same
data in this case you have a decision to
make
do you still retain the information from
multiple sources
which is your primary authoritative
source if there's a conflict
for instance a simple example of this is
in the automotive industry
if someone files a warranty claim for
their vehicle
there are multiple ways that the company
can get information about that
they can get mileage information based
on what's manually submitted on the
warranty claim how much the dealer or
the customer reports
in terms of mileage there was on the
vehicle at the time that the repair was
scheduled
however with newer vehicles that have a
lot of remote technology
they can also read this information off
the control units on the vehicle
so if there's a conflict there if
there's a difference between the mileage
that the dealer says and what the
vehicle says
unless there's a known issue with the
vehicle where the mileage would be
reported wrong
typically you're going to want to
prioritize what the vehicle says what
the control units say
automatically because there's usually
less room for error there
you can run into this with all sorts of
data where you'll have these conflicts
even if you don't see immediate
conflicts it's still good to set a
priority of what is your main source
what is going to be the authority when
it comes to
the accuracy of that data as you look
into the available data
in most areas you're going to find that
the data being used
doesn't just come from one singular
location it's usually made up of a
variety of different sources
for instance let's take sales data sales
data might sound like it's one
complete isolated thing however most
sales data consists of client
data information about who made the
purchase
it consists of actual sales data like
what was the sales date what was the
sale amount what was the exact order
that was placed and it often consists
of some sort of inventory or production
information
this isn't quite as typical for
something off the shelf
where client information isn't reported
but if you're working for an
organization that provides a service or
provides any sort of customized product
or even products that offer multiple
variants
this production information probably has
information on
if somebody orders shirt for instance
what color did they order what size did
they order
was their inventory to fulfill it so all
of these different pieces
are in themselves individual data sets
that are brought together to form the
one complete
set of data the one source of
information that we think of the sales
data to combine all of these we get into
data mapping data mapping tells us how
the data the information in one of our
sources
relates or maps to data in another
source
to combine to give a more complete
picture so in the example of that sales
data
we would have the order information
linked to customer information
probably based on a customer name or a
customer id
so if you look in the customer file you
have an id
or you have a name there that is
identical
and unique for an individual for a
company that
is doing the purchase and then in the
order you have that same unique
identifier that same unique name that
same unique number
so that when you match up you look for
the same thing in one
and two and that's how you tell the
systems to combine this information how
to map this information
same thing with the order and maybe
inventory data where
you have the part number that was
ordered the service that was ordered
and then in your inventory or your
service information
you have probably more detail so you
have
part number one in your order you have
part number one
in your inventory database and in your
inventory you talk more about the
details of that
and so you combine those then we have
cases where the mapping isn't direct
so to map customer to inventory there's
no direct mapping there's no
direct single relationship they only
relate or map
together because of that order
information
that's a fairly simple step to get there
sometimes it can be more complicated
sometimes there can be two
three or more steps in between to
connect these different pieces together
make them relevant make them relate to
each other another piece of information
that you'll typically have or should
have about your data is
metadata think of this as information
about the data
what type of format should it take by
default
what type of information is contained
within it so for instance let's talk
about order information
metadata would describe every field that
exists in that set of data
so if we have our date our metadata
would tell us
what format that it's in let's say it's
a
a month day year format then
what does it contain a short description
so
date of order from customer or date
order received
it gives a almost a dictionary of sorts
and
sometimes you'll see it called a data
dictionary which describes
the information that's contained within
that data set metadata and a data
dictionary
aren't always exactly the same but in
general they're giving you more
general information about what's
contained there so that everyone can
understand what content should exist
there
and does exist within those types of
data fields
ideally different tables of data or
different data sets will have
clear mapping of how they relate even if
they have to go through one or
two other tables in the intermediate to
connect
one to the other however that's not
always
reality and when that's not the reality
sometimes we need to use data scraping
to be able to
capture the right information for
instance
maybe we need to know the mileage of
that vehicle
when there was a warranty claim but what
if the warranty database doesn't have
mileage in that case maybe we ask that
someone
puts that information in a text box when
they submit the claim
data scraping is going and finding that
and automatically pulling it out
to see how it relates so how might it
relate on a warranty claim
well most vehicles have a fixed mileage
or
age limitation on the warranty so if
this process has to be done through data
scraping
we'll scrape out what the mileage is and
then
map that to the warranty coverage to see
is that mileage within the limits is the
age within the limits
does it qualify to be covered or does it
not meet the criteria is it too old does
it have too many miles those sorts of
things
so that's a simple example of data
scraping sometimes it can get more
complicated
in general think about it as a way to
try to create structure where
not much structure exists i mentioned
data quality
which has a lot of different subtopics
that could
easily be ours on their own i'm not
going to get into all of those today
but there is one area that i do want to
talk a bit about data integrity
is a subtopic within the data quality
area
and data integrity is not just
what's the overall quality of the data
but it's how stable is our data how
routine
is it can we always trust it how is it
updated how do we know that it's not
corrupted think of data integrity as how
well the accuracy
validity and consistency of data is
maintained across
its life cycle that is from the moment
that we first
collect that data does it remain the
same
does it remain consistent does it remain
to be true does it continue to be
accurate
as we move it around within our systems
as we put it into different tools as
different people start to use it
do we maintain that integrity of
information
so we don't essentially end up with the
telephone game that you might have
played in school where you whisper
something in some person's ear at the
start and by the time you've gone
through
15 different people the message out the
other end is very different than the
message at the start
the same thing can happen with data for
a variety of reasons
it could be system problems that create
this
challenge it can also be multiple people
being involved that don't understand the
context they don't understand
where why how the data was collected
and they make assumptions a lot of times
there's assumptions
at every step and by the time you get to
the end
then it's not really representative of
what you started with
this doesn't always have to be the case
and as long as you're aware of it
it's something that you can put more
things in place to check you could have
someone that checks
the source data and the end result data
to see
do they match do they properly convey
the right
information are they still accurate are
they still valid
really checking to make sure that you
don't end up with a completely different
picture than what you started with
maintaining data integrity and data
quality throughout the life cycle
isn't just a one-off thing you don't do
it and then it's done
it's also about having checks in place
to make sure everything continually
functions as expected
if you have data automatically pulled
into a system every day
what checks do you have in place to make
sure it all happened accurately how do
you catch when mistakes were made
so that somebody doesn't need to stumble
into a problem with it
and raise an alert how do you automate
some of that to help
ensure that that integrity that quality
is maintained
all along on an ongoing basis ultimately
all of this work of data governance
should lead to sets of rules
processes and policies that are applied
across the business to make sure that
you have
good data being used in a good way
throughout the organization
it's the right people accessing the
right data at the right time with the
right amount of accountability
this work should inform business
policies as well as data management
as with data integrity and data quality
data governance in general
is not a do it once and forget about it
forever sort of thing
it constantly needs to be rechecked as
fast as information is growing it's
exponentially growing every year
you may be getting more information
tomorrow than you were getting yesterday
and you may have different people doing
different things with it than they were
in the past so it's important to keep up
with that
if you set up data governance policies
now even if they're perfect
chances are very high that two years
from now they're not going to be perfect
something is going to have changed so be
aware of that that doesn't mean making
changes every day every week every month
but it does mean that you have some
periodic schedule that you come back and
review
and check and make sure that your
policies your rules your guidelines are
really keeping up with the information
that's there
rather than being really reactionary
it's already a little reactionary to
only follow up once a year
or whatever frequency it is but at least
then you only have a small thing you
need to react to instead of
five years from now two years from now
realizing that
none of the rules that you put in place
are being followed because they're no
longer relevant they no longer apply to
the information that's available
and how people really need to use that
data
to effectively run the business to
effectively do their jobs
i hope you enjoyed this data governance
tutorial if you did enjoy it please
consider
giving it a thumbs up and sharing it
with someone that you think may benefit
from it
thank you so much for watching
Browse More Related Video
Roles in the data governance domain - organizational roles and data governance roles
What is a good model for data governance? | Amazon Web Services
What is data management? Infographic video.
How do we plan a data governance roadmap? | Amazon Web Services
Data Fabric Explained
What Is Data Fabric | How Data Fabric Works | Data Fabric Explained | Intellipaat
5.0 / 5 (0 votes)