Data Federation with Unity Catalog

Databricks
16 Jan 202408:12

Summary

TLDRIn this video, Pearl UU, a Technical Marketing Engineer at Databricks, discusses how the Databricks Data Intelligence Platform simplifies the discovery, querying, and governance of distributed data across multiple sources. She addresses the challenges posed by fragmented data and demonstrates how Databricks' Lakehouse Federation allows organizations to securely access and manage their data, regardless of its location, without the need for data ingestion. The video also includes a demo on creating connections and managing data across different databases using Databricks' Unity Catalog.

Takeaways

  • 🌐 Data is often scattered across multiple systems, making it hard to discover and access, which can hinder informed decision-making and innovation.
  • 🔍 The Databricks Data Intelligence Platform aims to simplify the discovery, querying, and governance of data, regardless of its location.
  • 🚀 Organizations face challenges with data integration due to the time and resources required to move data to a single platform, which can slow down execution.
  • 🔒 Fragmented governance can lead to weak compliance and increased risk of inappropriate data access or leakage, affecting collaboration and data democratization.
  • 🏰 Lakehouse Federation is introduced as a solution to address the pain points of a data mesh framework, allowing for easier exposure, querying, and governance of siloed data systems.
  • 🔑 With Lakehouse Federation, users can automatically classify and discover all data types in one place, enabling secure access and exploration for everyone in the organization.
  • đŸ› ïž The platform accelerates ad hoc analysis and prototyping across all data and analytics use cases without the need for data ingestion.
  • 📊 Advanced query planning and caching across sources ensure optimal query performance, even when combining data from multiple platforms with a single query.
  • đŸ›Ąïž A unified permission model allows for setting and applying access rules and safeguarding data across sources, including row and column-level security and tag-based policies.
  • 🔄 Data lineage and auditability are built-in, helping to track data usage and meet compliance requirements, with the ability to visualize upstream and downstream relationships.
  • 📝 The demonstration showcases how to discover, govern, and query data spread across PostgreSQL, MySQL, and Delta Lake in a unified and easy manner using Databricks tools.

Q & A

  • What is the main challenge with data scattered across multiple systems?

    -The main challenge is that it makes data discovery and access difficult, leading to incomplete data and insights, which hinders informed decision-making and innovation.

  • Why does data integration take time and resources, slowing down execution?

    -Data integration is time-consuming because it requires moving data from external sources to a chosen platform, and some data might not be worth the effort due to the time it takes before landing in a unified location.

  • What are the risks associated with fragmented governance?

    -Fragmented governance leads to weak compliance, duplication of efforts, and an increased risk of not being able to monitor and guard against inappropriate access or data leakage, which hinders collaboration and data democratization.

  • How does the Lakehouse Federation approach address the pain points of a data mesh?

    -Lakehouse Federation simplifies the exposure, querying, and governance of siloed data systems as an extension of the Lakehouse, enabling automatic classification, discovery, and secure access to all data, regardless of its location.

  • What does it mean to enable everyone in an organization to securely access and explore all data available?

    -It means that all users within an organization can access and explore data from various sources in a unified manner, with no need for data ingestion, and with advanced query planning and caching for optimal performance.

  • How does the Lakehouse Federation ensure optimal query performance across multiple platforms?

    -It uses a single engine for advanced query planning across sources and caching, ensuring that even when combining data from multiple platforms with a single query, the performance is optimized.

  • What is the significance of having a single permission model for data governance?

    -A single permission model allows for setting and applying access rules consistently across different data sources, enabling row and column level security, tag-based policies, centralized auditing, and compliance requirements with built-in data lineage and auditability.

  • How does the script demonstrate the process of creating a connection to an external database system?

    -The script shows the process by creating a connection to a PostgreSQL database using a function and then creating a foreign catalog that mirrors the PostgreSQL database in the unified catalog for querying and managing user access.

  • What is the purpose of granting access to users or groups of users to use a connection?

    -Granting access promotes democratic data processing in a seamless and efficient way, allowing appropriate teams to respond to data changes quickly.

  • How does the script illustrate the importance of data lineage for understanding data relationships?

    -The script uses a lineage graph to show how tables were created, starting with federating into a PostgreSQL database directly from Databricks, and subsequently creating Delta tables, illustrating the lineage between different tables and their sources.

  • What are the prerequisites for creating a foreign catalog in the Databricks environment?

    -To create a foreign catalog, one must have the 'create catalog' permission on the metastore and be either the owner of the connection or have the 'create foreign catalog' privilege on the connection.

Outlines

00:00

🔍 Data Mesh Challenges and Databricks' Solution

Pearl introduces herself as a technical marketing engineer at Databricks and addresses the complexities of a data mesh framework. She explains how the Databricks data intelligence platform simplifies the discovery, querying, and governance of data across various systems. The script discusses the challenges of data scattered across multiple sources, the time and resources required for data integration, and the issues with fragmented governance. It introduces Lakehouse Federation as a solution that allows organizations to expose, query, and govern data from siloed systems, enabling automatic classification, discovery, and secure access to data. The platform supports advanced query planning and caching for optimal performance and a unified permission model for data security and compliance.

05:00

đŸ› ïž Setting Up Data Access and Security with Databricks

This paragraph delves into the process of setting up data access and security within Databricks. It emphasizes the importance of being a metastore admin or having specific privileges to create connections and catalogs. The script outlines the steps to create a connection to external databases like PostgreSQL and MySQL, and to establish foreign catalogs that mirror these databases within the Databricks catalog. It highlights the ability to grant access to users and groups, promoting democratic data processing. The paragraph also discusses the capabilities of the unified catalog to provide row and column level security for external database tables. The script concludes by demonstrating the creation of Delta tables through joins with external database tables and the importance of data lineage for tracking data changes and relationships across different systems.

Mindmap

Keywords

💡Data Mesh

Data Mesh is a decentralized data architecture that distributes data ownership and governance throughout an organization. It is a framework that promotes data democratization, allowing data to be managed and governed by the teams closest to it. In the video, the Data Mesh framework is discussed as a solution to the complexities of managing scattered data across various systems, which can hinder decision-making and innovation.

💡Databricks

Databricks is a data analytics platform that unifies data science, engineering, and business analytics. It is used for data processing, machine learning, and data visualization. In the script, Databricks is the platform that simplifies the discovery, querying, and governance of data, regardless of its location, by leveraging its data intelligence platform.

💡Data Integration

Data Integration refers to the process of combining data from different sources into a unified view. It is crucial for providing a comprehensive understanding of data and enabling better decision-making. The script mentions that data integration can be time-consuming and resource-intensive, which can slow down execution and innovation.

💡Lakehouse

A Lakehouse is a modern data architecture that combines the best of data lakes and data warehouses. It allows for the storage of structured and unstructured data and supports both batch and real-time analytics. The script discusses how Lakehouse Federation addresses the pain points of a data mesh by enabling organizations to expose, query, and govern siloed data systems.

💡Data Governance

Data Governance is the process of managing data availability, usability, integrity, and security in an organization. It is essential for ensuring compliance and preventing data misuse. In the video, fragmented governance is highlighted as a problem that leads to weak compliance and increased risk of inappropriate data access or leakage.

💡Data Democratization

Data Democratization is the practice of making data accessible to a broader range of people within an organization. It aims to empower more individuals to make data-driven decisions. The script emphasizes the importance of data democratization in enabling collaboration and innovation by allowing everyone to securely access and explore data.

💡Data Catalog

A Data Catalog is a tool that provides a searchable index of all the data assets in an organization. It helps users discover and understand the data available to them. In the script, the Data Catalog is used to discover and manage data from various sources, including external databases, within Databricks.

💡Data Lineage

Data Lineage is the tracking of data from its origin to its final consumption, showing the data's movement and transformation over time. It is crucial for understanding data context and ensuring compliance. The script mentions how Databricks captures data lineage, allowing users to visualize the relationships and transformations of data.

💡Foreign Catalog

A Foreign Catalog in the context of Databricks is a virtual catalog that represents an external database system. It allows users to query external databases as if they were querying tables within Databricks. The script demonstrates how to create a foreign catalog to mirror an external database and manage user access to it.

💡Data Access

Data Access refers to the ability to retrieve and use data within an organization. It is a critical component of data governance and security. The script discusses granting data access to users or groups of users to promote democratic data processing and efficient collaboration.

💡Row and Column Level Security

Row and Column Level Security are security measures that control access to specific rows or columns in a database table. This allows organizations to restrict sensitive data access based on user roles or policies. In the script, the ability to apply row and column level security to external database tables within Databricks is highlighted as a way to safeguard data.

Highlights

Datab breaks data intelligence platform simplifies discovering, querying, and governing data across various systems.

Thousands of organizations use Datab breaks for data and AI innovation.

Data scattering across multiple systems creates challenges in data discovery and access.

Data integration is time-consuming and resource-intensive, slowing down execution.

Fragmented governance leads to weak compliance and increased risk of data leakage.

Lakehouse Federation addresses data mesh pain points by simplifying data exposure, querying, and governance.

Automatic classification and discovery of all data, both structured and unstructured, in one place.

Secure access and exploration of all available data for everyone in the organization.

Acceleration of ad hoc analysis and prototyping across all data analytics and AI use cases without data ingestion.

Advanced query planning and caching for optimal query performance across multiple platforms.

Unified permission model for setting access rules and safeguarding data across data sources.

Demonstration of discovering, governing, and querying data spread across PostgreSQL, MySQL, and Delta Lake.

Creating connections and foreign catalogs to federate into external database systems.

Granting access to users or groups for democratic data processing.

Setting up permissions at the catalog level for flexibility in responding to data changes.

Providing row and column level security for external database tables using Datab breaks.

Creating Delta tables by joining data from federated PostgreSQL and MySQL instances.

Visualizing data lineage to track data changes and relationships between tables in real-time.

Datab breaks captures metadata changes automatically, enabling real-time data lineage visualization.

Transcripts

play00:00

hi my name is Pearl uu and I am a

play00:02

technical marketing engineer here at

play00:04

datab breaks given the complexities

play00:07

surrounding a data mesh framework I'll

play00:10

share with you how the datab breaks data

play00:11

intelligence platform makes it easy for

play00:14

you to discover query and govern all of

play00:17

your data no matter where it lives

play00:20

thousands of organizations of all sizes

play00:23

are innovating across the world with

play00:25

data and AI on the datab bricks data

play00:27

intelligence

play00:28

platform but for historical

play00:31

organizational or technological reasons

play00:34

data is scattered across many

play00:36

operational and analytic systems causing

play00:39

more

play00:40

challenges first not all data is in one

play00:43

place making it difficult to discover

play00:46

and access all data most organizations

play00:50

have valuable data distributed across

play00:53

multiple data sources it may be in

play00:55

several databases a data warehouse

play00:58

object storage systems and more this

play01:01

leads to incomplete data and insights

play01:04

which hinders customers ability to make

play01:06

informed decisions and innovate

play01:09

faster second data integration takes

play01:13

time and resources which slows down

play01:16

execution due to engineering bottom legs

play01:19

to query data across multiple data

play01:22

sources customers typically need to

play01:24

first move their data from external data

play01:27

sources to their platform of choice

play01:30

some data might not even be worth the

play01:32

effort some data will take too long

play01:35

before landing in a single unified

play01:38

location slowing down

play01:40

Innovation and lastly fragmented

play01:43

governance leads to weak compliance

play01:46

across siloed

play01:47

systems fragmented governance leads to

play01:50

duplication of efforts and increases the

play01:53

risk of not being able to Monitor and

play01:55

guard against inappropriate access or

play01:58

leakage which hinders collaboration and

play02:01

data

play02:02

democratization Lake housee Federation

play02:04

addresses these critical pain points

play02:06

that a data mesh would promote and makes

play02:08

it simple for organizations to expose

play02:11

query and govern Silo Data Systems as an

play02:14

extension of their Lakehouse these new

play02:17

capabilities you can automatically

play02:19

classify and discover all of your data

play02:22

structured and unstructured in one place

play02:25

and enable everyone in your organization

play02:27

to securely access and explore all the

play02:30

data available at their fingertips no

play02:33

matter where it is you can also

play02:35

accelerate ad hoc analysis and

play02:38

prototyping across all of your data

play02:40

analytics and AI use cases on the most

play02:43

complete data no ingestion required with

play02:47

a single

play02:48

engine Advanced query planning across

play02:50

sources and caching ensures optimal

play02:53

query performance even when accessing

play02:55

and combining data from multiple

play02:57

platforms with a single query

play03:00

and lastly you can use one permission

play03:03

model to set and apply access rules and

play03:06

Safeguard all of your data across data

play03:08

sources you can apply rules like row and

play03:11

column level security tag based policies

play03:15

centralized auditing consistently across

play03:18

platforms track data usage and meet

play03:21

compliance requirements with built-in

play03:23

data lineage and auditability this demo

play03:26

will consist of various data spread

play03:29

across post post my SQL and my Delta

play03:32

Lake and I'll show you how to discover

play03:35

govern and query this data in a unified

play03:38

and easy way let's get into the

play03:40

workspace to see exactly how this

play03:42

works here I'm in my catalog Explorer

play03:46

this is where we'll find our regular

play03:48

standard cataloges that are based on our

play03:50

cloud storage data and our foreign

play03:52

cataloges that allow connection into

play03:55

external database

play03:58

sources

play04:00

to Federate into an external database

play04:02

system a connection needs to be created

play04:05

and subsequently a foreign catalog a

play04:08

connection specifies a path and

play04:11

credentials to access this system to

play04:13

create a connection you can use the

play04:15

catalog Explorer or the create

play04:17

connection SQL command in a datab bricks

play04:20

notebook or the datab bricks SQL

play04:23

editor I'm going to create the

play04:25

connection to postgress by just using

play04:27

this function and then creating a

play04:29

foreign catalog that will mirror my

play04:31

postgress database in unity catalog so

play04:34

that I can query the data and manage

play04:36

datab bricks user access to this

play04:39

database I'll do the exact same thing

play04:41

here for my MySQL

play04:47

instance back in the catalog Explorer by

play04:51

selecting the external data then

play04:53

connections I can confirm that my

play04:55

connections have been made now that they

play04:57

have been confirmed we can Grant access

play05:00

to any user or groups of users to use

play05:03

this connection this promotes Democratic

play05:05

data processing in a seamless and

play05:07

efficient way to create a connection you

play05:11

must be a metastore admin or a user with

play05:14

the create connection privilege on the

play05:16

unity Catalog metast Store attached to

play05:18

the

play05:19

workspace back in our catalog Explorer

play05:22

we can view our new postgress catalog

play05:25

and MySQL catalog we can see the

play05:28

cataloges we just made and the schemas

play05:30

from our postgress instance have

play05:32

populated and we can see the tables

play05:35

within the

play05:36

schema we can also preview the data in

play05:39

postgress by viewing the sample data in

play05:41

data bricks best of all we can provide

play05:45

access to users and groups of users by

play05:48

setting up permissioning at the catalog

play05:50

levels this allows users to use the

play05:53

cataloges and provide more flexibility

play05:56

for the appropriate team to respond to

play05:58

data changes quickly quickly to create a

play06:01

foreign catalog you must have the create

play06:03

catalog permission on The Meta store and

play06:06

be either the owner of the connection or

play06:08

have the create forign catalog privilege

play06:10

on the connection additionally with the

play06:13

power of unity catalog we can provide

play06:16

both row and column level security for

play06:19

our external database tables here is the

play06:22

original online users table from our

play06:24

postgress

play06:26

instance then I'll create a function

play06:28

that will mask a certain column if I am

play06:31

a super admin then I can apply that mask

play06:34

function to my online users table and

play06:37

because I am in fact a super admin the

play06:39

last name column is now masked the

play06:42

ability to work with your data no matter

play06:45

where it lives is important to our

play06:47

customers here at datab bricks and so in

play06:49

this particular notebook I've gone ahead

play06:52

and created a Delta table called PD join

play06:55

by doing a right join on the online

play06:57

users table from my postgress in

play06:59

instance and a loone data table which is

play07:01

a Delta table similarly I've also

play07:04

created a Delta table called PM join

play07:07

that joins data from the postgress

play07:09

instance online users table and my SQL

play07:13

external table let's see how these

play07:16

tables relate to one another by taking a

play07:18

closer look at the data lineage it's

play07:21

important to note that since the foreign

play07:24

catalog mirrors the database it will

play07:26

automatically capture any data or met

play07:29

data changes occurring there in real

play07:31

time with no caching or manual

play07:33

synchronization involved because of this

play07:36

data breaks takes on all of that data or

play07:39

metadata change and allows you to

play07:41

visualize upstream and downstream

play07:43

notebooks workflows dashboards tables

play07:47

and Views associated with your data as

play07:50

you can see in the lineage graph here we

play07:52

can see how tables were created starting

play07:55

with federating into postgress directly

play07:58

from data bricks and subsequently our

play08:00

Delta

play08:01

tables also we can see the lineage

play08:05

between our other postgress table in

play08:07

MySQL table as

play08:10

well so now you know how

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Étiquettes Connexes
Data GovernanceLakehouse FederationData MeshData IntegrationData AnalyticsAI InnovationData AccessData DemocratizationData SecurityData Lineage
Besoin d'un résumé en anglais ?