Data Lineage with Unity Catalog

Databricks
16 Jan 202405:19

Summary

TLDRIn this video, Pearl, a technical marketing engineer at Databricks, demonstrates how Unity Catalog captures real-time data lineage across all data objects. The lineage helps data teams perform root cause analysis, reducing debugging time and ensuring compliance. Pearl walks through a use case involving a loan dataset, showing how lineage is tracked from notebooks to dashboards, and across various data sources. The interactive lineage graph and detailed tracing capabilities are highlighted as key features for maintaining data quality and streamlining auditing processes.

Takeaways

  • 🌟 Unity Catalog by Databricks captures real-time data lineage across data objects, aiding in debugging and compliance.
  • 🔍 Data lineage helps in root cause analysis of errors in data pipelines, applications, dashboards, and machine learning models, saving significant time and effort.
  • 📈 It provides end-to-end visibility into data flows and consumption within an organization, enhancing operational efficiency.
  • 🛠️ Users can auto-capture runtime data lineage on Databricks clusters or SQL warehouses, tracking down to table and column levels.
  • 🔑 Unity Catalog integrates with common permission models, facilitating secure and controlled access to data lineage information.
  • 📚 The script demonstrates capturing lineage from a notebook, showcasing the process of data analysis and model building with the Lending Club dataset.
  • 📊 A dashboard is created to display the status of bad and good loans, along with loan status distribution, providing visual insights into the data.
  • 📈 The lineage graph in Unity Catalog is interactive, showing the flow from the volume to the model, all originating from a single notebook.
  • 🔄 Lineage can be viewed at the column level, highlighting relationships between specific data points, such as 'bad loan' status.
  • 🔗 Unity Catalog supports lineage across federated data sources, ensuring data from trusted and quality sources is maintained.
  • 🛑 In case of workflow failures, data traceability is available for analyzing pipeline errors, impacting upstream and downstream processes.

Q & A

  • What is the purpose of Unity Catalog in the context of the video?

    -Unity Catalog is used to automatically capture real-time data lineage across all data objects, which helps data teams perform root cause analysis of errors in their data pipelines, applications, dashboards, and machine learning models.

  • How does data lineage assist in reducing debugging time?

    -Data lineage helps by tracing errors to their source, which significantly reduces the time spent on manual debugging, saving days or even months of effort.

  • What are the compliance and audit-related benefits of using data lineage?

    -Data lineage helps organizations to be compliant and audit-ready by alleviating the operational overhead of manually creating trails of data flows for auditing and reporting purposes.

  • How does Unity Catalog provide visibility into data flows?

    -Unity Catalog provides end-to-end visibility by offering data lineage that shows how data flows and is consumed within an organization.

  • What can customers autocapture using Unity Catalog on a Databricks cluster or SQL warehouse?

    -Customers can autocapture runtime data lineage on a Databricks cluster or SQL warehouse and track lineage down to the table and column level.

  • What is the significance of the lineage graph in Unity Catalog?

    -The lineage graph is an interactive tool that showcases the lineage from the volume to the data table, feature table, and models, providing a visual representation of the data flow.

  • How does Unity Catalog capture lineage at the column level?

    -Unity Catalog allows users to view the relationship between specific columns, such as the 'bad loan' column in the loan features table and the 'loan status' column in the loan data table.

  • What is the role of Unity Catalog in ensuring data quality from federated data sources?

    -Unity Catalog captures lineage across federated data sources, Delta tables, and views, ensuring that data comes from trusted and quality sources.

  • How does Unity Catalog assist with workflow data lineage?

    -Unity Catalog captures data lineage for any workflow that reads or writes to Unity Catalog, providing traceability to analyze errors in the pipeline that may impact upstream or downstream processes.

  • What is the advantage of having data traceability in the event of a workflow failure?

    -Data traceability allows data teams to automatically track sensitive data for compliance requirements, ensure data quality across all workloads, and conduct root cause analysis of errors in data pipelines.

  • How does the video demonstrate the use of Unity Catalog in a practical scenario?

    -The video demonstrates the use of Unity Catalog by showing how a user can load a dataset into a volume, write it to a Delta table, perform exploratory data analysis, create a feature table, build and register a model, and create a dashboard, all while capturing data lineage.

Outlines

00:00

📊 Data Lineage with Unity Catalog

Pearl uu introduces herself as a technical marketing engineer at Databricks and outlines the purpose of the video: to demonstrate how Unity Catalog captures real-time data lineage across various data objects. Data lineage is crucial for data teams to perform root cause analysis on errors in data pipelines, applications, dashboards, and machine learning models, significantly reducing debugging time and manual effort. It also aids in compliance and audit readiness by automating the creation of data flow trails. Unity Catalog's built-in data lineage offers end-to-end visibility, allowing users to autocapture runtime data lineage on Databricks clusters or SQL warehouses and track it down to the table and column level. The video showcases the process of capturing lineage from a notebook, creating a feature table for a model, and registering it in Unity Catalog. It also highlights the interactive lineage graph, the ability to view lineage at the column level, and the capability to trace lineage across federated data sources and workflows, emphasizing the importance of data traceability for error analysis and compliance.

05:01

🔍 Enhancing Data Compliance and Quality

In this paragraph, the focus shifts to the benefits of data lineage for compliance, audit reporting, and ensuring data quality across all workloads. The script explains how teams can leverage Unity Catalog to automatically track sensitive data for compliance requirements and conduct root cause analysis of errors in data pipelines. This capability is essential for maintaining data integrity and addressing any issues that may arise, thus streamlining the process of identifying and rectifying errors in the data workflow.

Mindmap

Keywords

💡Technical Marketing Engineer

A technical marketing engineer is a professional who combines technical expertise with marketing strategies to promote products or services effectively. In the context of the video, Pearl Uu, as a technical marketing engineer at Databricks, is responsible for demonstrating and explaining technical products to potential customers. This role is crucial for bridging the gap between the technical team and the market, ensuring that the product's capabilities are communicated clearly.

💡Unity Catalog

Unity Catalog is a feature within Databricks that provides a centralized repository for managing metadata about data assets. It is designed to capture and visualize data lineage, which is the tracking of data from its origin to its final state. In the video, Unity Catalog is showcased for its ability to automatically capture real-time data lineage across various data objects, which is essential for data teams to perform root cause analysis and ensure compliance.

💡Data Lineage

Data lineage refers to the documentation of the origins of data, how it is processed, and where it moves within an organization. It is critical for debugging data issues, ensuring compliance, and preparing for audits. In the video, data lineage is highlighted as a way to trace errors back to their source, significantly reducing debugging time and operational overhead associated with manual data flow tracking.

💡Root Cause Analysis

Root cause analysis is a systematic approach to identifying the underlying reasons for an issue or error. In the context of data pipelines, it involves tracing an error to its source to understand why it occurred. The video emphasizes how data lineage in Unity Catalog helps data teams perform root cause analysis, which is vital for maintaining the integrity and reliability of data pipelines.

💡Compliance

Compliance in the context of data management refers to adhering to regulations, standards, and policies that govern the handling and processing of data. The video mentions that data lineage helps organizations be compliant and audit-ready, which means that they can demonstrate to auditors how data is managed and protected, reducing the operational overhead of manually creating data flow trails.

💡Data Pipelines

Data pipelines are the workflows through which data flows from its origin to its destination, often involving various transformations and processing steps. In the video, data lineage is shown to be crucial for debugging errors in data pipelines, which can save significant time and effort in identifying and resolving issues.

💡Databricks

Databricks is a company that provides a unified analytics platform based on Apache Spark. It offers a collaborative environment for data science and engineering teams to work with data. In the video, Databricks is the platform where the Unity Catalog feature is demonstrated, showcasing its capabilities in capturing and visualizing data lineage.

💡Delta Table

A Delta table is a storage format in Databricks that supports ACID transactions and provides a scalable and performant way to store and query data. In the script, the creation of a Delta table called 'Loan Data' is mentioned, which is a key step in the data workflow and is used for further analysis and model building.

💡Feature Table

A feature table is a dataset that contains all the features needed for building a predictive model. In the video, the creation of a feature table is discussed as part of the process to estimate loan status, highlighting the importance of selecting relevant features for model training.

💡Federated Data Sources

Federated data sources refer to the ability to access and query data from various sources without moving the data into a central location. In the video, the script mentions data from a PostgreSQL instance being mirrored into the Lakehouse, demonstrating how lineage can be captured across federated sources.

💡Workflow

A workflow in the context of data processing refers to a sequence of steps or processes that data goes through from ingestion to transformation and analysis. The video script describes a workflow that involves reading from bronze tables, transforming data, and writing to silver and gold tables, illustrating how data lineage can be tracked throughout these processes.

Highlights

Introduction to Unity Catalog's automatic capture of real-time data lineage across data objects.

Explanation of how data lineage helps in root cause analysis and reduces debugging time significantly.

The importance of data lineage for compliance and audit readiness in organizations.

Unity Catalog's built-in data lineage feature offering end-to-end visibility into data flow and consumption.

Autocapture of runtime data lineage on Databricks cluster or SQL warehouse at table and column levels.

Utilization of common permission models from Unity Catalog for data lineage tracking.

Demonstration of lineage capture from a notebook using the Lending Club dataset.

Process of writing data to a Delta table for future use and exploratory data analysis.

Creation of a feature table for building models to estimate loan status.

Registration of models in Unity Catalog for offline access.

Creation of a dashboard to display loan status distribution and other relevant metrics.

Interactive lineage graph showcasing data flow from volume to models generated from a notebook.

Ability to view lineage at the column level and its relationship with specific data features.

Lineage capture across federated data sources and Delta tables from external databases like PostgreSQL.

Ensuring data quality and trustworthiness from federated sources through data lineage.

Workflow example demonstrating lineage capture for data transformation and alert generation.

Data traceability for error analysis in case of workflow failures, impacting upstream or downstream processes.

Automatic tracking of sensitive data for compliance and audit reporting using Unity Catalog.

Ensuring data quality across all workloads and conducting root cause analysis of errors in data pipelines.

Transcripts

play00:00

hi my name is Pearl uu and I am a

play00:03

technical marketing engineer here at

play00:04

datab bricks in this video I'll show you

play00:07

how Unity catalog automatically captures

play00:10

real-time data lineage across all your

play00:13

data objects run on data

play00:16

breaks data lineage helps data teams

play00:19

perform a root cause analysis of any

play00:22

errors in their data pipelines

play00:24

applications dashboards and machine

play00:27

learning models by tracing the error to

play00:30

its source this significantly reduces

play00:33

the debugging time saving days or in

play00:36

many cases months of manual effort data

play00:40

lineage helps organizations be compliant

play00:43

and audit ready thereby alleviating the

play00:46

operational overhead of manually

play00:49

creating the trails of data flows for

play00:51

auditing reporting

play00:53

purposes Unity catalog provides built-in

play00:57

data lineage and offers endtoend visib

play00:59

ibility into how data flows and is

play01:02

consumed in your

play01:04

organization customers can autocapture

play01:07

runtime data lineage on a datab bricks

play01:10

cluster or SQL warehouse and track

play01:12

lineage down to the table and column

play01:14

level and even leverage common

play01:17

permission models from Unity catalog

play01:19

first let's take a look at how lineage

play01:22

is captured from a notebook I loaded my

play01:24

Lending Club data set into a volume

play01:27

called demo volume and then I'm writing

play01:30

it to a Delta table called Loan Data for

play01:32

future use after doing some exploratory

play01:35

data analysis on the table I understand

play01:38

that this data set is quite large and it

play01:41

contains tons of information surrounding

play01:44

loans to focus on a specific task from

play01:47

the given data set I want to reasonably

play01:49

estimate a specific loan status first

play01:53

I'm going to create a feature table that

play01:56

will include all the features I need to

play01:58

build my model after creating testing

play02:01

and training my model I can now register

play02:04

it in unity

play02:05

catalog offline I've also created a

play02:08

dashboard that showcases the status of

play02:11

bad loans and good loans and as well as

play02:14

loan status distribution and

play02:16

more now let's see how Unity catalog has

play02:20

captured runtime data lineage for our

play02:22

use case in the catalog Explorer I'm

play02:26

going to locate my loan features table

play02:28

that I created and in the lineage tab

play02:31

I'm able to see upstream or Downstream

play02:34

lineage from tables notebooks dashboards

play02:39

or even models associated with my

play02:41

features

play02:42

table but the best part is the lineage

play02:45

graph here this is an interactive graph

play02:48

that showcases lineage from our volume

play02:51

to our lone data table to our feature

play02:54

table and then our models generated all

play02:58

from that one notebook

play03:00

there is also an ability to view lineage

play03:02

at the column level here we are seeing

play03:05

the relationship between the bad loan

play03:07

column in the loan features table and

play03:10

the loan status column in the loan data

play03:14

table lineage can also be seen in any of

play03:17

your Federated data sources here I have

play03:21

data from a postgress instance whose

play03:23

tables are being mirrored into my

play03:25

Lakehouse I can then create a Delta

play03:28

table using one of the table from

play03:30

postgress we can see the lineage across

play03:32

the data back in the lineage tab of the

play03:35

Delta table we created here the ability

play03:38

for data lineage between Federated data

play03:41

sources Delta tables and even views

play03:45

allows you to ensure that your data came

play03:47

from trusted quality

play03:49

sources lastly lineage can also be

play03:52

captured for any workflow that reads or

play03:55

writes to Unity

play03:57

catalog here I've already create a

play04:00

workflow that takes line item bronze

play04:03

table and orders bronze table and is

play04:06

transformed to the line item silver and

play04:09

Order silver tables

play04:12

respectively from there the line item

play04:14

silver and Order silver tables are

play04:17

joined to create an order status gold

play04:20

table and an alert is generated off this

play04:23

order silver table as well to finish off

play04:26

our workflow a dashboard is refreshed

play04:30

on the job details on the right hand

play04:32

side I'm able to see the lineage

play04:35

information associated with this

play04:37

particular workflow or job here I can

play04:40

see five Upstream tables being read and

play04:44

five Downstream tables being written

play04:47

this is really impactful because in the

play04:49

event that a workflow fails as it does

play04:52

here there is data traceability to

play04:55

analyze any errors in the pipeline that

play04:57

may impact team's Upstream or

play05:01

Downstream with this knowledge your data

play05:03

teams can now automatically track

play05:06

sensitive data for compliance

play05:08

requirements audit reporting ensure data

play05:11

quality across all workloads and conduct

play05:14

root cause analysis of errors in any of

play05:17

your data Pipelines

Rate This

5.0 / 5 (0 votes)

Связанные теги
Data LineageUnity CatalogReal-time TrackingData ComplianceRoot Cause AnalysisData PipelinesMachine LearningData QualityAudit ReadySQL Warehouses
Вам нужно краткое изложение на английском?