Data Federation with Unity Catalog
Summary
TLDRIn this video, Pearl UU, a Technical Marketing Engineer at Databricks, discusses how the Databricks Data Intelligence Platform simplifies the discovery, querying, and governance of distributed data across multiple sources. She addresses the challenges posed by fragmented data and demonstrates how Databricks' Lakehouse Federation allows organizations to securely access and manage their data, regardless of its location, without the need for data ingestion. The video also includes a demo on creating connections and managing data across different databases using Databricks' Unity Catalog.
Takeaways
- π Data is often scattered across multiple systems, making it hard to discover and access, which can hinder informed decision-making and innovation.
- π The Databricks Data Intelligence Platform aims to simplify the discovery, querying, and governance of data, regardless of its location.
- π Organizations face challenges with data integration due to the time and resources required to move data to a single platform, which can slow down execution.
- π Fragmented governance can lead to weak compliance and increased risk of inappropriate data access or leakage, affecting collaboration and data democratization.
- π° Lakehouse Federation is introduced as a solution to address the pain points of a data mesh framework, allowing for easier exposure, querying, and governance of siloed data systems.
- π With Lakehouse Federation, users can automatically classify and discover all data types in one place, enabling secure access and exploration for everyone in the organization.
- π οΈ The platform accelerates ad hoc analysis and prototyping across all data and analytics use cases without the need for data ingestion.
- π Advanced query planning and caching across sources ensure optimal query performance, even when combining data from multiple platforms with a single query.
- π‘οΈ A unified permission model allows for setting and applying access rules and safeguarding data across sources, including row and column-level security and tag-based policies.
- π Data lineage and auditability are built-in, helping to track data usage and meet compliance requirements, with the ability to visualize upstream and downstream relationships.
- π The demonstration showcases how to discover, govern, and query data spread across PostgreSQL, MySQL, and Delta Lake in a unified and easy manner using Databricks tools.
Q & A
What is the main challenge with data scattered across multiple systems?
-The main challenge is that it makes data discovery and access difficult, leading to incomplete data and insights, which hinders informed decision-making and innovation.
Why does data integration take time and resources, slowing down execution?
-Data integration is time-consuming because it requires moving data from external sources to a chosen platform, and some data might not be worth the effort due to the time it takes before landing in a unified location.
What are the risks associated with fragmented governance?
-Fragmented governance leads to weak compliance, duplication of efforts, and an increased risk of not being able to monitor and guard against inappropriate access or data leakage, which hinders collaboration and data democratization.
How does the Lakehouse Federation approach address the pain points of a data mesh?
-Lakehouse Federation simplifies the exposure, querying, and governance of siloed data systems as an extension of the Lakehouse, enabling automatic classification, discovery, and secure access to all data, regardless of its location.
What does it mean to enable everyone in an organization to securely access and explore all data available?
-It means that all users within an organization can access and explore data from various sources in a unified manner, with no need for data ingestion, and with advanced query planning and caching for optimal performance.
How does the Lakehouse Federation ensure optimal query performance across multiple platforms?
-It uses a single engine for advanced query planning across sources and caching, ensuring that even when combining data from multiple platforms with a single query, the performance is optimized.
What is the significance of having a single permission model for data governance?
-A single permission model allows for setting and applying access rules consistently across different data sources, enabling row and column level security, tag-based policies, centralized auditing, and compliance requirements with built-in data lineage and auditability.
How does the script demonstrate the process of creating a connection to an external database system?
-The script shows the process by creating a connection to a PostgreSQL database using a function and then creating a foreign catalog that mirrors the PostgreSQL database in the unified catalog for querying and managing user access.
What is the purpose of granting access to users or groups of users to use a connection?
-Granting access promotes democratic data processing in a seamless and efficient way, allowing appropriate teams to respond to data changes quickly.
How does the script illustrate the importance of data lineage for understanding data relationships?
-The script uses a lineage graph to show how tables were created, starting with federating into a PostgreSQL database directly from Databricks, and subsequently creating Delta tables, illustrating the lineage between different tables and their sources.
What are the prerequisites for creating a foreign catalog in the Databricks environment?
-To create a foreign catalog, one must have the 'create catalog' permission on the metastore and be either the owner of the connection or have the 'create foreign catalog' privilege on the connection.
Outlines
π Data Mesh Challenges and Databricks' Solution
Pearl introduces herself as a technical marketing engineer at Databricks and addresses the complexities of a data mesh framework. She explains how the Databricks data intelligence platform simplifies the discovery, querying, and governance of data across various systems. The script discusses the challenges of data scattered across multiple sources, the time and resources required for data integration, and the issues with fragmented governance. It introduces Lakehouse Federation as a solution that allows organizations to expose, query, and govern data from siloed systems, enabling automatic classification, discovery, and secure access to data. The platform supports advanced query planning and caching for optimal performance and a unified permission model for data security and compliance.
π οΈ Setting Up Data Access and Security with Databricks
This paragraph delves into the process of setting up data access and security within Databricks. It emphasizes the importance of being a metastore admin or having specific privileges to create connections and catalogs. The script outlines the steps to create a connection to external databases like PostgreSQL and MySQL, and to establish foreign catalogs that mirror these databases within the Databricks catalog. It highlights the ability to grant access to users and groups, promoting democratic data processing. The paragraph also discusses the capabilities of the unified catalog to provide row and column level security for external database tables. The script concludes by demonstrating the creation of Delta tables through joins with external database tables and the importance of data lineage for tracking data changes and relationships across different systems.
Mindmap
Keywords
π‘Data Mesh
π‘Databricks
π‘Data Integration
π‘Lakehouse
π‘Data Governance
π‘Data Democratization
π‘Data Catalog
π‘Data Lineage
π‘Foreign Catalog
π‘Data Access
π‘Row and Column Level Security
Highlights
Datab breaks data intelligence platform simplifies discovering, querying, and governing data across various systems.
Thousands of organizations use Datab breaks for data and AI innovation.
Data scattering across multiple systems creates challenges in data discovery and access.
Data integration is time-consuming and resource-intensive, slowing down execution.
Fragmented governance leads to weak compliance and increased risk of data leakage.
Lakehouse Federation addresses data mesh pain points by simplifying data exposure, querying, and governance.
Automatic classification and discovery of all data, both structured and unstructured, in one place.
Secure access and exploration of all available data for everyone in the organization.
Acceleration of ad hoc analysis and prototyping across all data analytics and AI use cases without data ingestion.
Advanced query planning and caching for optimal query performance across multiple platforms.
Unified permission model for setting access rules and safeguarding data across data sources.
Demonstration of discovering, governing, and querying data spread across PostgreSQL, MySQL, and Delta Lake.
Creating connections and foreign catalogs to federate into external database systems.
Granting access to users or groups for democratic data processing.
Setting up permissions at the catalog level for flexibility in responding to data changes.
Providing row and column level security for external database tables using Datab breaks.
Creating Delta tables by joining data from federated PostgreSQL and MySQL instances.
Visualizing data lineage to track data changes and relationships between tables in real-time.
Datab breaks captures metadata changes automatically, enabling real-time data lineage visualization.
Transcripts
hi my name is Pearl uu and I am a
technical marketing engineer here at
datab breaks given the complexities
surrounding a data mesh framework I'll
share with you how the datab breaks data
intelligence platform makes it easy for
you to discover query and govern all of
your data no matter where it lives
thousands of organizations of all sizes
are innovating across the world with
data and AI on the datab bricks data
intelligence
platform but for historical
organizational or technological reasons
data is scattered across many
operational and analytic systems causing
more
challenges first not all data is in one
place making it difficult to discover
and access all data most organizations
have valuable data distributed across
multiple data sources it may be in
several databases a data warehouse
object storage systems and more this
leads to incomplete data and insights
which hinders customers ability to make
informed decisions and innovate
faster second data integration takes
time and resources which slows down
execution due to engineering bottom legs
to query data across multiple data
sources customers typically need to
first move their data from external data
sources to their platform of choice
some data might not even be worth the
effort some data will take too long
before landing in a single unified
location slowing down
Innovation and lastly fragmented
governance leads to weak compliance
across siloed
systems fragmented governance leads to
duplication of efforts and increases the
risk of not being able to Monitor and
guard against inappropriate access or
leakage which hinders collaboration and
data
democratization Lake housee Federation
addresses these critical pain points
that a data mesh would promote and makes
it simple for organizations to expose
query and govern Silo Data Systems as an
extension of their Lakehouse these new
capabilities you can automatically
classify and discover all of your data
structured and unstructured in one place
and enable everyone in your organization
to securely access and explore all the
data available at their fingertips no
matter where it is you can also
accelerate ad hoc analysis and
prototyping across all of your data
analytics and AI use cases on the most
complete data no ingestion required with
a single
engine Advanced query planning across
sources and caching ensures optimal
query performance even when accessing
and combining data from multiple
platforms with a single query
and lastly you can use one permission
model to set and apply access rules and
Safeguard all of your data across data
sources you can apply rules like row and
column level security tag based policies
centralized auditing consistently across
platforms track data usage and meet
compliance requirements with built-in
data lineage and auditability this demo
will consist of various data spread
across post post my SQL and my Delta
Lake and I'll show you how to discover
govern and query this data in a unified
and easy way let's get into the
workspace to see exactly how this
works here I'm in my catalog Explorer
this is where we'll find our regular
standard cataloges that are based on our
cloud storage data and our foreign
cataloges that allow connection into
external database
sources
to Federate into an external database
system a connection needs to be created
and subsequently a foreign catalog a
connection specifies a path and
credentials to access this system to
create a connection you can use the
catalog Explorer or the create
connection SQL command in a datab bricks
notebook or the datab bricks SQL
editor I'm going to create the
connection to postgress by just using
this function and then creating a
foreign catalog that will mirror my
postgress database in unity catalog so
that I can query the data and manage
datab bricks user access to this
database I'll do the exact same thing
here for my MySQL
instance back in the catalog Explorer by
selecting the external data then
connections I can confirm that my
connections have been made now that they
have been confirmed we can Grant access
to any user or groups of users to use
this connection this promotes Democratic
data processing in a seamless and
efficient way to create a connection you
must be a metastore admin or a user with
the create connection privilege on the
unity Catalog metast Store attached to
the
workspace back in our catalog Explorer
we can view our new postgress catalog
and MySQL catalog we can see the
cataloges we just made and the schemas
from our postgress instance have
populated and we can see the tables
within the
schema we can also preview the data in
postgress by viewing the sample data in
data bricks best of all we can provide
access to users and groups of users by
setting up permissioning at the catalog
levels this allows users to use the
cataloges and provide more flexibility
for the appropriate team to respond to
data changes quickly quickly to create a
foreign catalog you must have the create
catalog permission on The Meta store and
be either the owner of the connection or
have the create forign catalog privilege
on the connection additionally with the
power of unity catalog we can provide
both row and column level security for
our external database tables here is the
original online users table from our
postgress
instance then I'll create a function
that will mask a certain column if I am
a super admin then I can apply that mask
function to my online users table and
because I am in fact a super admin the
last name column is now masked the
ability to work with your data no matter
where it lives is important to our
customers here at datab bricks and so in
this particular notebook I've gone ahead
and created a Delta table called PD join
by doing a right join on the online
users table from my postgress in
instance and a loone data table which is
a Delta table similarly I've also
created a Delta table called PM join
that joins data from the postgress
instance online users table and my SQL
external table let's see how these
tables relate to one another by taking a
closer look at the data lineage it's
important to note that since the foreign
catalog mirrors the database it will
automatically capture any data or met
data changes occurring there in real
time with no caching or manual
synchronization involved because of this
data breaks takes on all of that data or
metadata change and allows you to
visualize upstream and downstream
notebooks workflows dashboards tables
and Views associated with your data as
you can see in the lineage graph here we
can see how tables were created starting
with federating into postgress directly
from data bricks and subsequently our
Delta
tables also we can see the lineage
between our other postgress table in
MySQL table as
well so now you know how
5.0 / 5 (0 votes)