Access Controls with Unity Catalog

Databricks
16 Jan 202407:28

Summary

TLDRIn this video, Pearl uu from Databricks demonstrates Unity Catalog's central governance and auditing capabilities for data access control. The script showcases how to organize data assets, set permissions at different levels, and utilize a shared cluster for fine-grained access control. It also covers creating a feature store, training and registering a model using MLflow, and setting up row and column-level security for sensitive data, ensuring data analysts can access relevant subsets safely.

Takeaways

  • πŸ“š Unity Catalog is a central governance tool that manages and audits data access across workspaces.
  • πŸ‘₯ Data governance leaders and central governance teams have full admin capabilities to grant access to workspaces or metastores.
  • πŸ” Access can be granted to users, groups, or service principles within an organization for specific data assets.
  • πŸ“ˆ The demonstration uses a Databricks ML Workshop notebook to showcase the power of access controls within Unity Catalog.
  • πŸ—‚οΈ Catalogs are the first layer of Unity Catalog's three-level namespace, used for organizing data assets and setting permissions.
  • πŸ“ Schemas, the second layer, organize tables, views, volumes, and models, with permissioning set for teams to use them.
  • πŸ’Ύ Volumes, the third layer, contain directories and files for data stored in any format, providing non-tabular data access.
  • πŸ”‘ Demonstrated the process of granting read and write privileges on a volume to a data science team for a specific dataset.
  • πŸ› οΈ The video shows setting up a cluster with Unity Catalog for data analysis and machine learning workflows.
  • πŸ“Š The data science team creates a feature store table for training and testing models, leveraging the catalog and schema setup.
  • πŸ”’ Unity Catalog provides row and column-level security through row filters and column masks to protect sensitive data.
  • πŸ” SQL functions are used to create row filters and column masks, tailoring data access for different teams' needs.

Q & A

  • What is the role of Pearl uu in the video?

    -Pearl uu is a technical marketing engineer at Databricks, and she presents a demonstration on how Unity Catalog governs and audits data access.

  • What capabilities are granted to the data governance leader and central governance team in Unity Catalog?

    -The data governance leader and central governance team are granted full admin capabilities, allowing them to grant access to workspaces or metastores to users, groups, or service principles within the organization.

  • What is a catalog in Unity Catalog and how is it used?

    -A catalog in Unity Catalog is the first layer of the three-level namespace. It is used to organize data assets and set permissions for teams to use within the catalog.

  • What is a schema in Unity Catalog and what is its purpose?

    -A schema in Unity Catalog is the second layer of the three-level namespace. It organizes tables, views, volumes, and models, and allows for permissioning to be set at the schema level.

  • What is a volume in Unity Catalog and how does it relate to data storage?

    -A volume in Unity Catalog is part of the third layer of the namespace. It resides under a schema and contains directories and files for data stored in any format, providing non-tabular access to data.

  • What is the significance of the 'NP volume' in the demonstration?

    -The 'NP volume' is significant as it contains a dataset called 'Lending Club' that the data science team will use. It demonstrates how to grant read and write privileges on a volume.

  • What is the purpose of creating a shared cluster in Unity Catalog?

    -A shared cluster in Unity Catalog is used for most use cases where users can share resources and support fine-grained access control. It is suitable for collaborative work environments.

  • What is a feature store table and how is it utilized in the script?

    -A feature store table, such as 'loan features Test 2' in the script, is used to house features needed for model training and testing. It is created to manage and organize the features for machine learning models.

  • How does Unity Catalog facilitate the registration of models?

    -Unity Catalog allows the registration of models by referencing the same catalog and schema names used throughout the process. This ensures consistency and organization of models within the catalog.

  • What is the Catalog Explorer and how does it help in managing tables and models?

    -The Catalog Explorer is a tool within Unity Catalog that allows users to view and manage all the tables and models within a specific catalog and schema. It helps in organizing and providing access to data assets.

  • How does Unity Catalog provide row and column-level security?

    -Unity Catalog provides row and column-level security through the use of row filters and column masks. This allows for the control of access to specific rows of data and the masking of sensitive columns.

Outlines

00:00

πŸ› οΈ Data Governance with Unity Catalog

In this segment, Pearl introduces herself as a technical marketing engineer at Databricks and outlines the purpose of the video: to demonstrate how Unity Catalog manages and audits data access across workspaces. As the data governance leader, one has full admin capabilities to grant access to workspaces or metastores. The demonstration uses a Databricks ML Workshop notebook to showcase access controls within Unity Catalog. The video begins with setting up a catalog called 'niore catalog,' which is the first layer of Unity Catalog used for organizing data assets and setting permissions. The data science team has been granted privileges to use this catalog and create various data structures within it, such as schemas, volumes, and datasets. The setup includes defining catalog and schema names for consistency and setting up a shared cluster for data analysis. The process involves loading the 'Lending Club' dataset into a Delta table named 'Loan Data' for further analysis and model training.

05:00

πŸ”’ Row and Column Level Security in Unity Catalog

This paragraph delves into the advanced security features of Unity Catalog, focusing on row and column-level security through the use of row filters and column masks. The data science team, having full privileges, wishes to provide the data analyst team with access to the 'loan data' table but with certain restrictions due to sensitive data. To achieve this, SQL functions are created to mask the 'annual income' column and filter the data to show only homeowners with mortgages. The functions are applied to the 'loan data' table, ensuring that the data analyst team can access the necessary data without exposure to sensitive information. The video concludes by verifying the successful application of these security measures, demonstrating how Unity Catalog can centrally govern and audit data access across workspaces, ensuring data privacy and compliance.

Mindmap

Keywords

πŸ’‘Unity Catalog

Unity Catalog is a central governance tool that manages and audits data access across workspaces. It is pivotal in the video's theme as it showcases how data governance leaders can control access to data assets. The script mentions Unity Catalog's ability to grant permissions at various levels, such as catalogs, schemas, and volumes, which is central to the demonstration provided.

πŸ’‘Technical Marketing Engineer

A technical marketing engineer, like Pearl Uu in the video, is a professional who combines technical expertise with marketing strategies to promote and explain complex products or technologies. Pearl's role is to demonstrate Unity Catalog's capabilities, highlighting its importance in the video's narrative.

πŸ’‘Data Governance

Data governance refers to the overall management of the availability, usability, integrity, and security of the data in an organization. In the video, the concept is central as it explains how Unity Catalog facilitates the governance of data access, ensuring that the right people have the right level of access to the data they need.

πŸ’‘Catalog

In the context of Unity Catalog, a catalog is the first layer of the three-level namespace used to organize data assets and set permissions. The script introduces 'niore catalog' as an example, demonstrating how it is used to manage permissions for the data science team.

πŸ’‘Schema

A schema, also referred to as a database in the script, is the second layer of Unity Catalog's namespace. It organizes tables, views, volumes, and models, and is where permissioning can be set. The 'default' schema is mentioned, showing how it provides a structured way to manage data within the catalog.

πŸ’‘Volume

A volume in Unity Catalog is part of the third layer of the namespace and contains directories and files for data stored in any format. It provides non-tabular access to data. The script uses 'NP volume' as an example, illustrating how it can be used to grant read and write privileges to the data science team.

πŸ’‘Data Science Team

The data science team is a group within an organization that is responsible for analyzing and interpreting complex digital data to inform business decisions. In the script, the team is granted specific privileges within Unity Catalog, such as access to the 'niore catalog' and the 'default' schema, to perform their work.

πŸ’‘Feature Store

A feature store is a system used to manage, share, and retrieve machine learning features for model training and inference. The script describes creating a feature store table called 'loan features Test 2' to house features needed for model training and testing, demonstrating its role in the data science workflow.

πŸ’‘MLflow

MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment. The video script mentions using MLflow to train and test the model, showcasing its importance in the machine learning process within Unity Catalog.

πŸ’‘Row and Column Level Security

Row and column level security are techniques used to restrict access to specific rows or columns in a database based on the user's role or other factors. The script explains how Unity Catalog provides this security through row filters and column masks, allowing the data analyst team to access only the necessary data without exposing sensitive information.

πŸ’‘Data Analyst Team

The data analyst team is responsible for analyzing data to provide insights and reports for decision-making. In the script, the team's potential use of the 'loan data' table is discussed, highlighting the need for row and column level security to protect sensitive data while allowing access to relevant information.

Highlights

Unity Catalog centrally governs and audits data access across workspaces, providing full admin capabilities to the central governance team.

Catalogs are the first layer of Unity Catalog's three-level namespace, used for organizing data assets and permissioning.

Schemas, also known as databases, are the second layer of Unity Catalog, organizing tables, views, volumes, and models with permissioning capabilities.

Volumes, the third layer of Unity Catalog, contain directories and files for data stored in any format, providing non-tabular access to data.

Data access can be consistently referenced by defining catalog and schema names appropriately.

Two types of clusters are supported: shared clusters for fine-grained access control and single-user clusters for advanced use cases like GPU or distributed machine learning.

Data can be loaded from volumes and saved as Delta tables for future use, as demonstrated with the London Club dataset.

Feature store tables, like 'loan features test 2', are created to house features needed for model training and testing.

MLflow can be used to create training sets, test, train models, and register them in Unity Catalog.

Models and tables in Unity Catalog can be explored in the Catalog Explorer, showing their relationships and permissions.

Execute permissions can be set for models, allowing the data science team to use them when necessary.

Row and column-level security can be provided through row filters and column masks to protect sensitive data.

SQL functions can be created to mask specific columns, like annual income, and filter rows based on criteria, such as homeowners with mortgages.

Data analysts can utilize tables with applied row filters and column masks, ensuring access to only the necessary data without exposing sensitive information.

Unity Catalog enables centralized governance and auditing of data access, ensuring secure and controlled data usage across workspaces.

Transcripts

play00:00

hi my name is Pearl uu and I am a

play00:03

technical marketing engineer here at

play00:05

data breaks in this video I'm going to

play00:08

share with you how Unity catalog

play00:10

centrally governs and audits data access

play00:13

across workspaces as the data governance

play00:16

leader of your organization you and your

play00:18

central governance team have been

play00:20

granted full admin capabilities and can

play00:23

grant access to workspaces or meta

play00:25

stores to users groups or service

play00:29

principles in your Organization for the

play00:31

purpose of this demo we will use the

play00:33

datab bricks ml Workshop notebook to see

play00:36

the power of access controls within

play00:38

Unity catalog and to get started we'll

play00:41

need some data to work with to do this

play00:44

the data science lead has created a

play00:47

catalog called niore catalog a catalog

play00:51

is the first layer of unity catalog thre

play00:54

level name space it's used to organize

play00:57

your data assets and permissioning has

play01:00

been set for the data science team to

play01:02

use the catalog within the catalog

play01:05

there's a schema titled default that has

play01:08

already been created by the data science

play01:10

lead as well a schema also called a

play01:13

database is the second layer of unity

play01:16

catalog three-level name space a schema

play01:19

organizes tables views volumes and

play01:23

models permissioning can also be set at

play01:26

the catalog level and in this case it's

play01:29

been set for the data science team to

play01:31

have privileges to use the schema create

play01:34

functions materialize views models

play01:38

tables and volumes within the schema

play01:41

within the schema there's a volume

play01:43

called NP volume again created by the

play01:47

data science lead a volume resides in

play01:50

the third layer of unity catalog

play01:52

three-level

play01:53

namespace volumes are organized under a

play01:57

schema in unity catalog volume volumes

play02:00

contain directories and files for data

play02:02

stored in any format and provide

play02:06

non-tabular access to

play02:08

data NP volume has a data set called

play02:12

Lending Club that the data science team

play02:14

will use by granting read and write

play02:17

privileges on volume

play02:19

access back in the notebook before we do

play02:22

anything we're going to make sure that

play02:24

we can consistently reference the

play02:26

catalog and scheme a name by defining

play02:29

them appropriately for the workshop

play02:32

we'll be referencing the catalog as ni

play02:34

catalog and another schema called user

play02:38

before any member of the data science

play02:40

team can even run this code we have to

play02:43

make sure that we have our cluster set

play02:45

up so let's create a new resource with

play02:48

unity catalog there are two different

play02:50

clusters that we support one is the

play02:53

shared cluster which is good for 90% of

play02:55

your use cases users can share it and we

play02:59

support all all of the fine grained

play03:00

access

play03:01

control then we have the single user

play03:04

clusters this is where if you want to do

play03:07

something slightly more advanced maybe

play03:09

you want to use a GPU or distributed

play03:12

machine learning then you can break out

play03:14

onto this cluster and safely isolate

play03:17

yourself for the purpose of this

play03:19

Workshop we're going to use a shared

play03:22

cluster then we're going to load our

play03:25

London Club data set from our volume and

play03:28

then save it as a Delta table called

play03:30

Loan Data for future

play03:32

use after doing some exploratory data

play03:35

analysis on this loan data set the data

play03:38

science team would like to reasonably

play03:40

estimate a specific loan status given

play03:43

the data provided in the loan data

play03:46

table to do this a feature store table

play03:49

will be created called loan features

play03:52

Test

play03:53

2 this table will house the features

play03:56

needed for when the model is trained and

play03:58

tested as you can see this table is

play04:02

referencing the catalog and the schema

play04:04

we created earlier harnessing the power

play04:07

of unity catalog now we can create a

play04:10

training set using our features from the

play04:12

feature store and then test and train

play04:15

the model using ml flow then we'll

play04:17

register the model in unity catalog by

play04:20

referencing the same catalog schema name

play04:24

and the name of the model which in this

play04:26

case is loan estimator the configuration

play04:29

of the mlflow client to access models in

play04:33

unity catalog is referenced here now

play04:37

let's take a look at where all of the

play04:38

tables and models reside in the catalog

play04:41

Explorer under the Nico catalog and

play04:45

under the user schema that was

play04:47

referenced throughout the workshop we

play04:49

have our Loan Data which is our full

play04:51

Loan Data set we also have loan features

play04:55

test two which is the feature store

play04:57

table to hold our loan features and then

play05:00

below that as you can see we have our

play05:02

loan estimator model registered right

play05:05

here in our catalog as

play05:08

well if we click the model we can set

play05:11

execute permissions to our data science

play05:14

team giving them the ability to use the

play05:16

model when necessary similarly our

play05:19

tables also provide permissioning and we

play05:22

can showcase that by giving the data

play05:24

science team select privileges on the

play05:28

table speaking of tables let's see how

play05:31

Unity catalog provides row and column

play05:34

level security through row filters and

play05:37

column masks respectively the data

play05:40

science team knows that eventually the

play05:43

data analyst team might want to utilize

play05:46

the loan data table to generate

play05:48

reports however there might be some

play05:51

columns that may need to be masked due

play05:53

to the sensitive data that they hold or

play05:56

maybe rows would need to be filtered out

play05:58

so that the data analyst team can query

play06:01

a subset of the data that they need to

play06:04

do this it's very simple first we're

play06:07

going to see what our current data set

play06:09

looks like this is what the data science

play06:12

team will be able to see since they have

play06:14

full

play06:15

privileges they want to show the data

play06:17

analyst team only data where homeowners

play06:20

have a mortgage and where their income

play06:22

is hidden so let's create the SQL

play06:25

functions I'm going to Leverage The

play06:28

Power of unity catalog here here by

play06:30

using the catalog and the schema where

play06:32

my data set resides then I'll create a

play06:35

function that masks the annual income

play06:39

and only allows the data science team to

play06:41

see it next we'll set up a function that

play06:44

filters out the home ownership status to

play06:47

just homeowners with

play06:49

mortgages now we can verify that our

play06:52

functions have been set up correctly and

play06:55

then apply these newly created functions

play06:57

to the loan data table

play07:00

let's take a look at our new loan data

play07:02

table and here we can verify that our

play07:06

annual income column is masked to show

play07:08

the null value instead and the

play07:11

homeowners are all mortgages this table

play07:13

can now be utilized by the data analyst

play07:16

team as needed without any concerns

play07:18

surrounding sensitive data and with that

play07:21

you've learned how Unity catalog

play07:23

centrally governs and audits data access

play07:26

across workspaces

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Data GovernanceUnity CatalogAccess ControlTechnical MarketingData ScienceML WorkshopData SecurityCatalog SchemaFeature StoreModel TrainingData Analysis