Data Lineage with Unity Catalog
Summary
TLDRIn this video, Pearl, a technical marketing engineer at Databricks, demonstrates how Unity Catalog captures real-time data lineage across all data objects. The lineage helps data teams perform root cause analysis, reducing debugging time and ensuring compliance. Pearl walks through a use case involving a loan dataset, showing how lineage is tracked from notebooks to dashboards, and across various data sources. The interactive lineage graph and detailed tracing capabilities are highlighted as key features for maintaining data quality and streamlining auditing processes.
Takeaways
- 🌟 Unity Catalog by Databricks captures real-time data lineage across data objects, aiding in debugging and compliance.
- 🔍 Data lineage helps in root cause analysis of errors in data pipelines, applications, dashboards, and machine learning models, saving significant time and effort.
- 📈 It provides end-to-end visibility into data flows and consumption within an organization, enhancing operational efficiency.
- 🛠️ Users can auto-capture runtime data lineage on Databricks clusters or SQL warehouses, tracking down to table and column levels.
- 🔑 Unity Catalog integrates with common permission models, facilitating secure and controlled access to data lineage information.
- 📚 The script demonstrates capturing lineage from a notebook, showcasing the process of data analysis and model building with the Lending Club dataset.
- 📊 A dashboard is created to display the status of bad and good loans, along with loan status distribution, providing visual insights into the data.
- 📈 The lineage graph in Unity Catalog is interactive, showing the flow from the volume to the model, all originating from a single notebook.
- 🔄 Lineage can be viewed at the column level, highlighting relationships between specific data points, such as 'bad loan' status.
- 🔗 Unity Catalog supports lineage across federated data sources, ensuring data from trusted and quality sources is maintained.
- 🛑 In case of workflow failures, data traceability is available for analyzing pipeline errors, impacting upstream and downstream processes.
Q & A
What is the purpose of Unity Catalog in the context of the video?
-Unity Catalog is used to automatically capture real-time data lineage across all data objects, which helps data teams perform root cause analysis of errors in their data pipelines, applications, dashboards, and machine learning models.
How does data lineage assist in reducing debugging time?
-Data lineage helps by tracing errors to their source, which significantly reduces the time spent on manual debugging, saving days or even months of effort.
What are the compliance and audit-related benefits of using data lineage?
-Data lineage helps organizations to be compliant and audit-ready by alleviating the operational overhead of manually creating trails of data flows for auditing and reporting purposes.
How does Unity Catalog provide visibility into data flows?
-Unity Catalog provides end-to-end visibility by offering data lineage that shows how data flows and is consumed within an organization.
What can customers autocapture using Unity Catalog on a Databricks cluster or SQL warehouse?
-Customers can autocapture runtime data lineage on a Databricks cluster or SQL warehouse and track lineage down to the table and column level.
What is the significance of the lineage graph in Unity Catalog?
-The lineage graph is an interactive tool that showcases the lineage from the volume to the data table, feature table, and models, providing a visual representation of the data flow.
How does Unity Catalog capture lineage at the column level?
-Unity Catalog allows users to view the relationship between specific columns, such as the 'bad loan' column in the loan features table and the 'loan status' column in the loan data table.
What is the role of Unity Catalog in ensuring data quality from federated data sources?
-Unity Catalog captures lineage across federated data sources, Delta tables, and views, ensuring that data comes from trusted and quality sources.
How does Unity Catalog assist with workflow data lineage?
-Unity Catalog captures data lineage for any workflow that reads or writes to Unity Catalog, providing traceability to analyze errors in the pipeline that may impact upstream or downstream processes.
What is the advantage of having data traceability in the event of a workflow failure?
-Data traceability allows data teams to automatically track sensitive data for compliance requirements, ensure data quality across all workloads, and conduct root cause analysis of errors in data pipelines.
How does the video demonstrate the use of Unity Catalog in a practical scenario?
-The video demonstrates the use of Unity Catalog by showing how a user can load a dataset into a volume, write it to a Delta table, perform exploratory data analysis, create a feature table, build and register a model, and create a dashboard, all while capturing data lineage.
Outlines
📊 Data Lineage with Unity Catalog
Pearl uu introduces herself as a technical marketing engineer at Databricks and outlines the purpose of the video: to demonstrate how Unity Catalog captures real-time data lineage across various data objects. Data lineage is crucial for data teams to perform root cause analysis on errors in data pipelines, applications, dashboards, and machine learning models, significantly reducing debugging time and manual effort. It also aids in compliance and audit readiness by automating the creation of data flow trails. Unity Catalog's built-in data lineage offers end-to-end visibility, allowing users to autocapture runtime data lineage on Databricks clusters or SQL warehouses and track it down to the table and column level. The video showcases the process of capturing lineage from a notebook, creating a feature table for a model, and registering it in Unity Catalog. It also highlights the interactive lineage graph, the ability to view lineage at the column level, and the capability to trace lineage across federated data sources and workflows, emphasizing the importance of data traceability for error analysis and compliance.
🔍 Enhancing Data Compliance and Quality
In this paragraph, the focus shifts to the benefits of data lineage for compliance, audit reporting, and ensuring data quality across all workloads. The script explains how teams can leverage Unity Catalog to automatically track sensitive data for compliance requirements and conduct root cause analysis of errors in data pipelines. This capability is essential for maintaining data integrity and addressing any issues that may arise, thus streamlining the process of identifying and rectifying errors in the data workflow.
Mindmap
Keywords
💡Technical Marketing Engineer
💡Unity Catalog
💡Data Lineage
💡Root Cause Analysis
💡Compliance
💡Data Pipelines
💡Databricks
💡Delta Table
💡Feature Table
💡Federated Data Sources
💡Workflow
Highlights
Introduction to Unity Catalog's automatic capture of real-time data lineage across data objects.
Explanation of how data lineage helps in root cause analysis and reduces debugging time significantly.
The importance of data lineage for compliance and audit readiness in organizations.
Unity Catalog's built-in data lineage feature offering end-to-end visibility into data flow and consumption.
Autocapture of runtime data lineage on Databricks cluster or SQL warehouse at table and column levels.
Utilization of common permission models from Unity Catalog for data lineage tracking.
Demonstration of lineage capture from a notebook using the Lending Club dataset.
Process of writing data to a Delta table for future use and exploratory data analysis.
Creation of a feature table for building models to estimate loan status.
Registration of models in Unity Catalog for offline access.
Creation of a dashboard to display loan status distribution and other relevant metrics.
Interactive lineage graph showcasing data flow from volume to models generated from a notebook.
Ability to view lineage at the column level and its relationship with specific data features.
Lineage capture across federated data sources and Delta tables from external databases like PostgreSQL.
Ensuring data quality and trustworthiness from federated sources through data lineage.
Workflow example demonstrating lineage capture for data transformation and alert generation.
Data traceability for error analysis in case of workflow failures, impacting upstream or downstream processes.
Automatic tracking of sensitive data for compliance and audit reporting using Unity Catalog.
Ensuring data quality across all workloads and conducting root cause analysis of errors in data pipelines.
Transcripts
hi my name is Pearl uu and I am a
technical marketing engineer here at
datab bricks in this video I'll show you
how Unity catalog automatically captures
real-time data lineage across all your
data objects run on data
breaks data lineage helps data teams
perform a root cause analysis of any
errors in their data pipelines
applications dashboards and machine
learning models by tracing the error to
its source this significantly reduces
the debugging time saving days or in
many cases months of manual effort data
lineage helps organizations be compliant
and audit ready thereby alleviating the
operational overhead of manually
creating the trails of data flows for
auditing reporting
purposes Unity catalog provides built-in
data lineage and offers endtoend visib
ibility into how data flows and is
consumed in your
organization customers can autocapture
runtime data lineage on a datab bricks
cluster or SQL warehouse and track
lineage down to the table and column
level and even leverage common
permission models from Unity catalog
first let's take a look at how lineage
is captured from a notebook I loaded my
Lending Club data set into a volume
called demo volume and then I'm writing
it to a Delta table called Loan Data for
future use after doing some exploratory
data analysis on the table I understand
that this data set is quite large and it
contains tons of information surrounding
loans to focus on a specific task from
the given data set I want to reasonably
estimate a specific loan status first
I'm going to create a feature table that
will include all the features I need to
build my model after creating testing
and training my model I can now register
it in unity
catalog offline I've also created a
dashboard that showcases the status of
bad loans and good loans and as well as
loan status distribution and
more now let's see how Unity catalog has
captured runtime data lineage for our
use case in the catalog Explorer I'm
going to locate my loan features table
that I created and in the lineage tab
I'm able to see upstream or Downstream
lineage from tables notebooks dashboards
or even models associated with my
features
table but the best part is the lineage
graph here this is an interactive graph
that showcases lineage from our volume
to our lone data table to our feature
table and then our models generated all
from that one notebook
there is also an ability to view lineage
at the column level here we are seeing
the relationship between the bad loan
column in the loan features table and
the loan status column in the loan data
table lineage can also be seen in any of
your Federated data sources here I have
data from a postgress instance whose
tables are being mirrored into my
Lakehouse I can then create a Delta
table using one of the table from
postgress we can see the lineage across
the data back in the lineage tab of the
Delta table we created here the ability
for data lineage between Federated data
sources Delta tables and even views
allows you to ensure that your data came
from trusted quality
sources lastly lineage can also be
captured for any workflow that reads or
writes to Unity
catalog here I've already create a
workflow that takes line item bronze
table and orders bronze table and is
transformed to the line item silver and
Order silver tables
respectively from there the line item
silver and Order silver tables are
joined to create an order status gold
table and an alert is generated off this
order silver table as well to finish off
our workflow a dashboard is refreshed
on the job details on the right hand
side I'm able to see the lineage
information associated with this
particular workflow or job here I can
see five Upstream tables being read and
five Downstream tables being written
this is really impactful because in the
event that a workflow fails as it does
here there is data traceability to
analyze any errors in the pipeline that
may impact team's Upstream or
Downstream with this knowledge your data
teams can now automatically track
sensitive data for compliance
requirements audit reporting ensure data
quality across all workloads and conduct
root cause analysis of errors in any of
your data Pipelines
Weitere ähnliche Videos ansehen
5.0 / 5 (0 votes)