Access Controls with Unity Catalog
Summary
TLDRIn this video, Pearl uu from Databricks demonstrates Unity Catalog's central governance and auditing capabilities for data access control. The script showcases how to organize data assets, set permissions at different levels, and utilize a shared cluster for fine-grained access control. It also covers creating a feature store, training and registering a model using MLflow, and setting up row and column-level security for sensitive data, ensuring data analysts can access relevant subsets safely.
Takeaways
- đ Unity Catalog is a central governance tool that manages and audits data access across workspaces.
- đ„ Data governance leaders and central governance teams have full admin capabilities to grant access to workspaces or metastores.
- đ Access can be granted to users, groups, or service principles within an organization for specific data assets.
- đ The demonstration uses a Databricks ML Workshop notebook to showcase the power of access controls within Unity Catalog.
- đïž Catalogs are the first layer of Unity Catalog's three-level namespace, used for organizing data assets and setting permissions.
- đ Schemas, the second layer, organize tables, views, volumes, and models, with permissioning set for teams to use them.
- đŸ Volumes, the third layer, contain directories and files for data stored in any format, providing non-tabular data access.
- đ Demonstrated the process of granting read and write privileges on a volume to a data science team for a specific dataset.
- đ ïž The video shows setting up a cluster with Unity Catalog for data analysis and machine learning workflows.
- đ The data science team creates a feature store table for training and testing models, leveraging the catalog and schema setup.
- đ Unity Catalog provides row and column-level security through row filters and column masks to protect sensitive data.
- đ SQL functions are used to create row filters and column masks, tailoring data access for different teams' needs.
Q & A
What is the role of Pearl uu in the video?
-Pearl uu is a technical marketing engineer at Databricks, and she presents a demonstration on how Unity Catalog governs and audits data access.
What capabilities are granted to the data governance leader and central governance team in Unity Catalog?
-The data governance leader and central governance team are granted full admin capabilities, allowing them to grant access to workspaces or metastores to users, groups, or service principles within the organization.
What is a catalog in Unity Catalog and how is it used?
-A catalog in Unity Catalog is the first layer of the three-level namespace. It is used to organize data assets and set permissions for teams to use within the catalog.
What is a schema in Unity Catalog and what is its purpose?
-A schema in Unity Catalog is the second layer of the three-level namespace. It organizes tables, views, volumes, and models, and allows for permissioning to be set at the schema level.
What is a volume in Unity Catalog and how does it relate to data storage?
-A volume in Unity Catalog is part of the third layer of the namespace. It resides under a schema and contains directories and files for data stored in any format, providing non-tabular access to data.
What is the significance of the 'NP volume' in the demonstration?
-The 'NP volume' is significant as it contains a dataset called 'Lending Club' that the data science team will use. It demonstrates how to grant read and write privileges on a volume.
What is the purpose of creating a shared cluster in Unity Catalog?
-A shared cluster in Unity Catalog is used for most use cases where users can share resources and support fine-grained access control. It is suitable for collaborative work environments.
What is a feature store table and how is it utilized in the script?
-A feature store table, such as 'loan features Test 2' in the script, is used to house features needed for model training and testing. It is created to manage and organize the features for machine learning models.
How does Unity Catalog facilitate the registration of models?
-Unity Catalog allows the registration of models by referencing the same catalog and schema names used throughout the process. This ensures consistency and organization of models within the catalog.
What is the Catalog Explorer and how does it help in managing tables and models?
-The Catalog Explorer is a tool within Unity Catalog that allows users to view and manage all the tables and models within a specific catalog and schema. It helps in organizing and providing access to data assets.
How does Unity Catalog provide row and column-level security?
-Unity Catalog provides row and column-level security through the use of row filters and column masks. This allows for the control of access to specific rows of data and the masking of sensitive columns.
Outlines
đ ïž Data Governance with Unity Catalog
In this segment, Pearl introduces herself as a technical marketing engineer at Databricks and outlines the purpose of the video: to demonstrate how Unity Catalog manages and audits data access across workspaces. As the data governance leader, one has full admin capabilities to grant access to workspaces or metastores. The demonstration uses a Databricks ML Workshop notebook to showcase access controls within Unity Catalog. The video begins with setting up a catalog called 'niore catalog,' which is the first layer of Unity Catalog used for organizing data assets and setting permissions. The data science team has been granted privileges to use this catalog and create various data structures within it, such as schemas, volumes, and datasets. The setup includes defining catalog and schema names for consistency and setting up a shared cluster for data analysis. The process involves loading the 'Lending Club' dataset into a Delta table named 'Loan Data' for further analysis and model training.
đ Row and Column Level Security in Unity Catalog
This paragraph delves into the advanced security features of Unity Catalog, focusing on row and column-level security through the use of row filters and column masks. The data science team, having full privileges, wishes to provide the data analyst team with access to the 'loan data' table but with certain restrictions due to sensitive data. To achieve this, SQL functions are created to mask the 'annual income' column and filter the data to show only homeowners with mortgages. The functions are applied to the 'loan data' table, ensuring that the data analyst team can access the necessary data without exposure to sensitive information. The video concludes by verifying the successful application of these security measures, demonstrating how Unity Catalog can centrally govern and audit data access across workspaces, ensuring data privacy and compliance.
Mindmap
Keywords
đĄUnity Catalog
đĄTechnical Marketing Engineer
đĄData Governance
đĄCatalog
đĄSchema
đĄVolume
đĄData Science Team
đĄFeature Store
đĄMLflow
đĄRow and Column Level Security
đĄData Analyst Team
Highlights
Unity Catalog centrally governs and audits data access across workspaces, providing full admin capabilities to the central governance team.
Catalogs are the first layer of Unity Catalog's three-level namespace, used for organizing data assets and permissioning.
Schemas, also known as databases, are the second layer of Unity Catalog, organizing tables, views, volumes, and models with permissioning capabilities.
Volumes, the third layer of Unity Catalog, contain directories and files for data stored in any format, providing non-tabular access to data.
Data access can be consistently referenced by defining catalog and schema names appropriately.
Two types of clusters are supported: shared clusters for fine-grained access control and single-user clusters for advanced use cases like GPU or distributed machine learning.
Data can be loaded from volumes and saved as Delta tables for future use, as demonstrated with the London Club dataset.
Feature store tables, like 'loan features test 2', are created to house features needed for model training and testing.
MLflow can be used to create training sets, test, train models, and register them in Unity Catalog.
Models and tables in Unity Catalog can be explored in the Catalog Explorer, showing their relationships and permissions.
Execute permissions can be set for models, allowing the data science team to use them when necessary.
Row and column-level security can be provided through row filters and column masks to protect sensitive data.
SQL functions can be created to mask specific columns, like annual income, and filter rows based on criteria, such as homeowners with mortgages.
Data analysts can utilize tables with applied row filters and column masks, ensuring access to only the necessary data without exposing sensitive information.
Unity Catalog enables centralized governance and auditing of data access, ensuring secure and controlled data usage across workspaces.
Transcripts
hi my name is Pearl uu and I am a
technical marketing engineer here at
data breaks in this video I'm going to
share with you how Unity catalog
centrally governs and audits data access
across workspaces as the data governance
leader of your organization you and your
central governance team have been
granted full admin capabilities and can
grant access to workspaces or meta
stores to users groups or service
principles in your Organization for the
purpose of this demo we will use the
datab bricks ml Workshop notebook to see
the power of access controls within
Unity catalog and to get started we'll
need some data to work with to do this
the data science lead has created a
catalog called niore catalog a catalog
is the first layer of unity catalog thre
level name space it's used to organize
your data assets and permissioning has
been set for the data science team to
use the catalog within the catalog
there's a schema titled default that has
already been created by the data science
lead as well a schema also called a
database is the second layer of unity
catalog three-level name space a schema
organizes tables views volumes and
models permissioning can also be set at
the catalog level and in this case it's
been set for the data science team to
have privileges to use the schema create
functions materialize views models
tables and volumes within the schema
within the schema there's a volume
called NP volume again created by the
data science lead a volume resides in
the third layer of unity catalog
three-level
namespace volumes are organized under a
schema in unity catalog volume volumes
contain directories and files for data
stored in any format and provide
non-tabular access to
data NP volume has a data set called
Lending Club that the data science team
will use by granting read and write
privileges on volume
access back in the notebook before we do
anything we're going to make sure that
we can consistently reference the
catalog and scheme a name by defining
them appropriately for the workshop
we'll be referencing the catalog as ni
catalog and another schema called user
before any member of the data science
team can even run this code we have to
make sure that we have our cluster set
up so let's create a new resource with
unity catalog there are two different
clusters that we support one is the
shared cluster which is good for 90% of
your use cases users can share it and we
support all all of the fine grained
access
control then we have the single user
clusters this is where if you want to do
something slightly more advanced maybe
you want to use a GPU or distributed
machine learning then you can break out
onto this cluster and safely isolate
yourself for the purpose of this
Workshop we're going to use a shared
cluster then we're going to load our
London Club data set from our volume and
then save it as a Delta table called
Loan Data for future
use after doing some exploratory data
analysis on this loan data set the data
science team would like to reasonably
estimate a specific loan status given
the data provided in the loan data
table to do this a feature store table
will be created called loan features
Test
2 this table will house the features
needed for when the model is trained and
tested as you can see this table is
referencing the catalog and the schema
we created earlier harnessing the power
of unity catalog now we can create a
training set using our features from the
feature store and then test and train
the model using ml flow then we'll
register the model in unity catalog by
referencing the same catalog schema name
and the name of the model which in this
case is loan estimator the configuration
of the mlflow client to access models in
unity catalog is referenced here now
let's take a look at where all of the
tables and models reside in the catalog
Explorer under the Nico catalog and
under the user schema that was
referenced throughout the workshop we
have our Loan Data which is our full
Loan Data set we also have loan features
test two which is the feature store
table to hold our loan features and then
below that as you can see we have our
loan estimator model registered right
here in our catalog as
well if we click the model we can set
execute permissions to our data science
team giving them the ability to use the
model when necessary similarly our
tables also provide permissioning and we
can showcase that by giving the data
science team select privileges on the
table speaking of tables let's see how
Unity catalog provides row and column
level security through row filters and
column masks respectively the data
science team knows that eventually the
data analyst team might want to utilize
the loan data table to generate
reports however there might be some
columns that may need to be masked due
to the sensitive data that they hold or
maybe rows would need to be filtered out
so that the data analyst team can query
a subset of the data that they need to
do this it's very simple first we're
going to see what our current data set
looks like this is what the data science
team will be able to see since they have
full
privileges they want to show the data
analyst team only data where homeowners
have a mortgage and where their income
is hidden so let's create the SQL
functions I'm going to Leverage The
Power of unity catalog here here by
using the catalog and the schema where
my data set resides then I'll create a
function that masks the annual income
and only allows the data science team to
see it next we'll set up a function that
filters out the home ownership status to
just homeowners with
mortgages now we can verify that our
functions have been set up correctly and
then apply these newly created functions
to the loan data table
let's take a look at our new loan data
table and here we can verify that our
annual income column is masked to show
the null value instead and the
homeowners are all mortgages this table
can now be utilized by the data analyst
team as needed without any concerns
surrounding sensitive data and with that
you've learned how Unity catalog
centrally governs and audits data access
across workspaces
5.0 / 5 (0 votes)