Databricks Unity Catalog: A Technical Overview
Summary
TLDRThis video offers an in-depth introduction to Databricks Unity Catalog, emphasizing its centralized approach to access control, auditing, lineage, and data discovery across workspaces. It contrasts Unity Catalog's robust governance capabilities with the more limited features of traditional Hive metastore setups. The presenter demonstrates Unity Catalog's functionality, including managing permissions, data lineage tracking, and federating queries across data sources, showcasing the platform's efficiency and ease of use.
Takeaways
- đ Unity Catalog is a centralized system for access control, auditing, lineage, and data discovery across Databricks workspaces.
- đ Prior to Unity Catalog, access control and user management were decentralized, with each workspace having its own Hive metastore and limited governance capabilities.
- đ Unity Catalog introduces a shared metastore across multiple workspaces, enhancing data governance and privacy controls compared to the traditional Hive metastore.
- đ The Unity Catalog metastore supports functionalities like data discovery and lineage tracking, and is designed to work with cloud object storage solutions like Amazon S3 and Azure Data Lake Storage.
- đ ïž Unity Catalog offers a single interface to administer data access policies, using standard SQL to grant and revoke permissions on catalog objects.
- đ„ It automatically captures user-level audit logs, providing transparency on who accesses the data and how it is used.
- đ Data lineage is provided by tracking and visualizing the flow of data across different datasets and processes within the platform.
- đ·ïž Users can tag and document data assets and search for them based on those tags, enhancing data discovery capabilities.
- đ The main administrative roles in Unity Catalog include Account Admin, Metastore Admin, and Workspace Admin, each with specific responsibilities and permissions.
- đïž Unity Catalog organizes data objects hierarchically, starting with catalogs, followed by schemas (or databases), and then tables, views, functions, and volumes for non-tabular data.
- đ It enables Lakehouse query federation, allowing users to query data from external sources like Snowflake and even other Databricks workspaces without migrating data to a unified system.
Q & A
What is Unity Catalog in Databricks?
-Unity Catalog is a feature in Databricks that provides centralized access, control, auditing, lineage, and data discovery capabilities across Databricks workspaces.
How was data governance managed before Unity Catalog in Databricks?
-Prior to Unity Catalog, data governance and management in Databricks were controlled at a workspace level, with each workspace having its own Hive metastore, leading to a decentralized and fragmented approach.
What are the limitations of the traditional Hive metastore in Databricks?
-The traditional Hive metastore had limitations such as basic security features and a reliance on access control lists for managing permissions, without robust data governance and privacy controls.
How does Unity Catalog's approach to Access Control and user management differ from the traditional approach?
-Unity Catalog uses a centralized approach for Access Control and user management, with a shared metastore across multiple workspaces, as opposed to the decentralized approach of the traditional Hive metastore.
What functionalities does Unity Catalog's metastore support that the Hive metastore does not?
-Unity Catalog's metastore supports functionalities such as data discovery and lineage tracking, and is designed to work with cloud object storage like Amazon S3 and Azure Data Lake Storage, unlike the Hive metastore which works with the Hadoop Distributed File System.
What are the main administrative roles in Unity Catalog?
-The main administrative roles in Unity Catalog are Account Admin, Metastore Admin, and Workspace Admin, each with different levels of permissions and responsibilities.
How does Unity Catalog enable data access policies administration across workspaces?
-Unity Catalog offers a single place to administer data access policies across all workspaces using standard SQL commands to grant and revoke permissions on Unity Catalog objects.
What is the significance of the three-level namespace in Unity Catalog?
-The three-level namespace in Unity Catalog (catalog, schema, table) allows for a more structured and organized way to reference data, as opposed to the two-level namespace (database, table) in the traditional Hive metastore.
How does Unity Catalog support data lineage and what benefits does it provide?
-Unity Catalog supports data lineage by tracking and visualizing the flow of data across different datasets and processes within the platform, providing transparency and understanding of data relationships and transformations.
What is the purpose of the 'Lakehouse query Federation' feature in Unity Catalog?
-Lakehouse query Federation allows Unity Catalog to access and query external databases, enabling federated queries across different data sources without the need to migrate data to a unified system.
What are the differences between a managed table and an external table in Unity Catalog?
-In Unity Catalog, a managed table is of the Delta format and can only be managed within Databricks, while an external table can have multiple formats such as Delta, Parquet, ORC, Avro, CSV, JSON, or text and can reference data stored outside of Databricks.
How does Unity Catalog enable sharing of data assets?
-Unity Catalog allows for data sharing through features like Delta sharing, which enables sharing of data assets outside the organization, and external data access using storage credentials and connections to query against multiple data sources.
What is the role of the Account Admin in the context of Unity Catalog?
-The Account Admin in Unity Catalog can create and link metastores to workspaces, assign Metastore Admins, configure storage credentials, and manage user group and service principal permissions.
Outlines
đ Introduction to Databricks Unity Catalog
This paragraph introduces the concept of Databricks Unity Catalog, which is a centralized solution for access control, auditing, lineage, and data discovery across Databricks workspaces. Prior to Unity Catalog, governance and data management were decentralized, with each workspace having its own Hive metastore. The limitations of this approach led to a fragmented governance model relying on access control lists. Unity Catalog changes this by offering a centralized approach to access control and user management, with a shared metastore across multiple workspaces. It supports advanced functionalities like data discovery and lineage tracking and is designed to work with cloud object storage solutions.
đ Unity Catalog Features and Administrative Roles
The second paragraph delves into the specific features of Unity Catalog, such as the ability to administer data access policies across all workspaces using standard SQL, automatic capture of user-level audit logs, and data lineage tracking. It also discusses the tagging and documentation of data assets for searchable access. The paragraph outlines the main administrative roles within Unity Catalog: account admins, metastore admins, and workspace admins, each with specific responsibilities and privileges. The structure of Unity Catalog's data objects is also explained, including the metastore, catalog, schema, and various other objects like tables, views, functions, and machine learning models.
đșïž Navigating Unity Catalog in a Databricks Workspace
This paragraph provides a practical demonstration of navigating Unity Catalog within a Databricks workspace. It explains how to access the account console, view workspaces and their assigned meta stores, and manage user permissions. The paragraph also describes the different types of catalogs in Unity Catalog, including the traditional Hive metastore, system catalog, and user-created catalogs. It shows how to use a foreign catalog to query data from external sources like Snowflake, illustrating the concept of Lakehouse query federation. The process of creating schemas and managing tables within those schemas is also covered, along with the ability to execute queries using both Python and SQL interfaces.
đ Comparing Unity Catalog with Non-Unity Catalog Workspaces
The final paragraph contrasts a Unity Catalog-enabled workspace with a non-Unity Catalog workspace. It highlights the limitations of the legacy Hive metastore, such as the lack of lineage information, fewer features for data sharing, and a two-level namespace instead of the three-level namespace provided by Unity Catalog. The paragraph demonstrates how tables are managed and referenced in a non-Unity Catalog workspace, showing the differences in functionality and the propagation of permissions across workspaces. The comparison serves to emphasize the enhanced capabilities and centralized management offered by Unity Catalog.
Mindmap
Keywords
đĄUnity Catalog
đĄData Governance
đĄHive Metastore
đĄData Lineage
đĄCloud Object Storage
đĄAccess Control
đĄMetastore Admin
đĄData Assets
đĄLakehouse Query Federation
đĄDelta Sharing
đĄThree-Level Namespace
Highlights
Introduction to Unity Catalog in Databricks and its key features.
Unity Catalog provides centralized access, control, auditing, lineage, and data discovery across Databricks workspaces.
Prior to Unity Catalog, governance and data management were decentralized and managed at the workspace level.
Unity Catalog introduces a centralized approach to Access Control and user management with a shared metastore.
The Unity Catalog metastore supports broader functionalities like data discovery and lineage tracking compared to the Hive metastore.
Unity Catalog offers a single place to administer data access policies using standard SQL.
Automatic capture of user-level audit logs for data access recording.
Data lineage is provided by tracking and visualizing data flow across datasets and processes.
Tagging and documentation of data assets with a search interface for asset retrieval based on tags.
Administrative roles in Unity Catalog include Account Admin, Metastore Admin, and Workspace Admin with distinct responsibilities.
Metastore is the top-level container for metadata in Unity Catalog with a three-level namespace.
Catalogs, schemas, and tables form the object hierarchy in Unity Catalog for referencing data.
Models for machine learning and other data objects like storage credentials and external locations are part of Unity Catalog.
Demonstration of Unity Catalog enabled workspaces and comparison with non-Unity Catalog enabled workspaces.
Permissions in Unity Catalog are propagated to all workspaces assigned to the same metastore.
Lineage graph in Unity Catalog provides transparency of data flow and relationships between datasets.
Delta sharing and external data access features in Unity Catalog for sharing and querying data across platforms.
Upcoming tutorial on enabling Unity Catalog in Databricks workspaces in the next video of the series.
Transcripts
hey everyone and welcome to this
overview video on datab Brick's Unity
catalog this video is the first video as
part of a wider Unity catalog series on
my channel in this initial video I'll
introduce you to Unity catalog and
discuss some of its most important
features so what is Unity catalog Unity
catalog provides centralized Access
Control auditing lineage and data
Discovery capabilities across datab
bricks
workspaces prior to Unity catalog on
data bricks everything related to
governance and data management was
controlled at a workspace level so
Access Control user management and the
hive metastore were completely
decentralized and they were managed
individually at a workspace level each
workspace had its own Hive metastore
this approach had its limitations so
governance capabilities were somewhat
fragmented it relied primarily on access
control lists for managing permissions
as at a basic
level data management was also
decentralized with individual teams
managing access to their own data assets
at a workspace level you also had basic
security features such as network
security encryption and authentication
mechanisms however comprehensive data
governance and privacy controls were not
as
robust with unity catalog it's
completely different as you can see from
the image on the screen there is a
centralized approach in Access Control
and user management and there's a meta
store that's shared across multiple
workspaces the unity catalog metastore
is different to The Hive
metastore it supports a broader range of
functionalities such as data Discovery
and lineage tracking while the hive
metastore is designed to work with the
Hadoop distributed file system the unity
catalog metastore has been designed to
work with Cloud object storage such as
Amazon S3 and aure data Lake storage
there are numerous features in unity
catalog Unity catalog offers a single
place to administer data access policies
across all of your workspaces you can
use standard an csql to Grant and revoke
permissions on Unity catalog objects
Unity catalog automatically captures
user level audit logs that record access
to your data it provides data lineage by
tracking and visualizing the flow of
data across different data sets and
processes Within the platform you can
also tag and document data assets and
then use a search interface to search
for these data assets based on those
tags the main administrative roles in
unity catalog are account admin metast
store admin and workpace admins account
admins can create and Link metast stores
to workspaces they can assign metast
store admins and configure storage
credentials among other things metastore
admins have extensive privileges over
the Met to store itself and then
workspace admins have Pro adding users
and groups to a workspace they can
delegate workspace admin roles and they
can manage job ownership and the
handling of workspace
objects so the image on the screen
depicts the unity catalog data
objects the metas store is the top level
container for metadata each meta store
exposes a thre level name space so you
have catalog schema and table that's how
you reference your data catalog is the
first layer of the object hierarchy
schemas also known as databases are the
second layer of the object hierarchy so
they store objects such as tables and
Views you then have within each schema
tables views functions and volumes so
volumes will store your non-tabular data
you also have models models refer to
machine learning models
you can also see other objects in the
diagram too storage credentials and
external locations work together for
managing data access shares and
recipients are used to distribute data
assets within Unity catalog and then
connections enable Unity catalog to
access and query external databases
providing a seamless way to Federate
queries across different data sources
okay so that's enough of the theory let
me now show you one of my Unity catalog
enabled workspaces and then I'll compare
that with one of my nonunity catalog
enabled workspaces so you can see the
differences for yourself okay so I'm in
one of my Unity catalog enabled datab
break workspaces the account console is
where the administration of your meta
stores across all workspaces in your
organization occurs to access the
account console under your username
click on manage account so under my user
under my username I can click on manage
account
only account admins can access this so
on the workspaces tab here you can see
all workspaces in your organization so I
have three workspaces and you can see
the relevant information such as the
meta store that they're assigned to so
these two workspaces have been assigned
to meta stores so that implies that
these two are unity catalog enabled and
this one is not Unity catalog enabled
because it's not assigned to a meta
store on the data
tab you can see all of your meta stores
so I have one meta store if I click into
that you can see the details for The
Meta store so it belongs to the UK South
Region and here is the meta store admin
so I can also as the account
admin edit this meta store admin and
then on the workspaces tab here you can
see the workspaces assigned to this meta
store so there are
two on the user management tab you can
change user group and service principal
permissions you can inherit these from
your cloud provider you can also assign
account admins here so if I click on a
user go on their roles you can assign
them as account admin only account
admins can assign account admins so keep
that in
mind okay so back in my workspace on
catalog Explorer you can see all of the
cataloges present in my
workspace so I'll start start with the
hive metastore this isn't really a
proper catalog it reflects the
traditional Hive metastore allowing
users to access and manage the metadata
of tables and databases that were
previously managed by The Hive metastore
using the old approach this is
particularly useful for workspaces that
need to be migrated to Unity catalog
from the traditional Hive metastore so
this Hive metastore catalog is workspace
specific unlike the other normal
cataloges which can be shared across
multi multiple
workspaces this main catalog here is
automatically
created you also have this system
catalog here this contains metadata
about the unity catalog itself such as
information about tables views schemers
and other data assets the schemers in
this catalog is used for administrative
and monitoring
purposes you also have this samples
catalog this is created by default and
just contains sample data for you to
play around
with you'll also notice this catalog
called snowflake Forum if I click into
it you can see that's created using a
connection to snowflake so this data is
actually stored outside of data bricks
and I am creating a foreign catalog so I
can actually query data from
Snowflake this is known as Lakehouse
query Federation and you can do this on
multiple platforms including
snowflake you can even use this to
connect to data bricks workspaces
outside of your organization so moving
on this HR catalog is one that I've
created when I click into
it there are five schemas or databases
you have the bronze silver and gold
schemas that I've created and then there
are two by default one is called default
and one is called information schema
these are created by default for each
catalog that you create so let me let me
click into the bronze schema here are
the tables in this
schema so previously when referencing
tables in the traditional Hive meta
store you specify the database and the
table Unity catalog has a three-level
name space so if I want to query this
country's table I would specify it by
doing HR which is the catalog dot the
schema which is bronze dot the table
name so let me quickly Show You by
opening a notebook so I'll just add a
notebook
make sure it's connected to a cluster it
is so to read that table I can just do
spark. read.table and I'm using python
right now so I can just reference hr.
bronze
dot countries and then run
this and then I can actually display
that like
so
so here's the data using an SQL cell I
can just simply type select star from HR
do
bronze.
countries and then run
this and that's worked as well as you
can
see so back in
catalog and if I go back to this table
you can see the details for this table
so this table is a managed table you can
have managed and external tables managed
tables can only be of the Delta format
external tables can have multiple
formats such as Delta par orc Avro CSV
Json or
text so since this is a manage table it
can only be Delta and here under the
details you can see the path to the
storage location you can see the metast
store ID that it's Associated to and the
table ID as well you can also see
information such as who it's created by
and other useful bits of information as
well you can also manage permissions to
this table as well by going on this
permissions
tab so you can do that using this UI
where you can grant and revoke certain
permissions and you can also do that
using an csql on notebooks as well so
this user has select privileges on this
table so this is user one the user also
has
has use schema Privileges and then they
also
have use catalog privileges as well so
to be able to query this table the user
also needs permissions on the cataloges
and the schemas for that table as well
furthermore at the catalog level notice
workspaces so I can assign this catalog
and all of the contents of that catalog
to multiple workspaces so right now all
workspaces that are a part of this meta
store can access this catalog and all of
the permissions that I apply to this
catalog will be propagated to all of the
workspaces that have access so if I
uncheck this I can now specify which
workspaces have access and you'll notice
right now I'm no longer able to access
this because it's been crossed out so if
I assign this to workspaces these
workspaces will now have access so I can
assign and then when I refresh this this
should change from no access to show me
the contents of that catalog and as you
can see that's the
case
great so just to reiterate that the
permissions that you give to your
cataloges schemers and tables and other
data assets will be applied to all
workspaces that the data is shared
to so now if I go to this countries
table again notice columns you can
actually add comments and
tags these comments and tags
can then be used in the global search
icon to search for the data so if I have
sensitive information I can add a
comment saying this is sensitive
information and I can search for that
using that specific tag further along
you can see lineage this shows the flow
of the data you can see under
tables this is the downstream table that
this table has been referenced in and
that is in the silver
layer you can also see the notebooks
that have being referencing this table
and then you can see workflows pipelines
paths and queries that have referenced
the table as
well you can also see the lineage graph
so you can see here is the table that
I'm currently selecting and here is the
downstream table that it's linked to and
to give you a better example of this let
me go on one of my gold tables so on
employee details if I go to
lineage see lineage graph and then I
expand this I can now see the full
[Music]
lineage of this specific data model so
the bronze layer to the silver layer to
the gold layer so this silver employees
table has been joined with this
Department's table to create this
employee details table you can see all
of the columns the data types and other
useful metrics as well so it gives you
full transparency of the data and the
flow of that
data so the there's a lot of useful and
Powerful features on Unity catalog that
you can see there are also features that
allow you to share your
data so Delta sharing allows you to
share data assets outside of your
organization under external data you can
access data stored in Cloud object
storage using storage credentials and
external
locations and then using connections you
can run queries against multiple data
sources such as snowflake MySQL postgrad
and even other data bricks workspaces
outside of your metast store without
needing to migrate all of the data to a
unified system so you can see I've
already created a connection to a
snowflake data warehouse and that is how
I have connected to this foreign catalog
on snowflake using that connection so
now you know what Unity catalog can do
let me show you one of my non-unity
catalog enabled workspaces so you can
compare the two
so let me go back to the account console
I'll go to my workspaces and this is the
workspace that has not been assigned to
a metastore as you can see so let's open
that so I'll go to
[Music]
catalog so to be able to access the
information let me just start the
serverless warehouse and that will just
take a moment to spin
up
and now
notice when I select a specific
table I have the details I have
permissions but I don't have lineage
information you also don't have
additional features such as external
data and Delta
sharing and this Legacy Hive metast
store also uses a two-level namespace so
you simply have the database and then
the tables so it does not use the three-
level namespace so when you reference
tables you just type HR doth table
net so that will be the database do the
table so where I was able to have a
separate database for my bronze silver
and gold tables in unity catalog here I
have all of the tables stored in the
same database of course I can have a
separate database for each layer as well
but this is how I've done it in this
instance I've prefixed the table with
the type of table that it is so you can
see bronze silver and gold
tables however you still can't see the
link between these tables or how they
flow between each other in general
there's much less functionality compared
to Unity catalog you can manage
permissions but the permissions apply
only to each workspace unlike Unity
catalog where you can set permissions
that are propagated to each workspace
assigned to the same metast
store okay so in this video I've
summarized the main features of unity
catalog which are centralized Access
Control auditing lineage and data
Discovery capabilities to name a
few I've also showed you a Unity catalog
enabled workspace so you can get a
better understanding of these features
in practice and I've also shown you a
non-unity catalog enabled workspace so
you can get a comparison of the two in
the next video of this Unity catalog
series I'll show you how to enable Unity
catalog on your as your data bricks
workspace so if you found this video
useful then please give it a like And
subscribe to my channel for more content
like this
5.0 / 5 (0 votes)