Intro to Databricks Lakehouse Platform Architecture and Security

Databricks
23 Nov 202228:47

Summary

TLDRThe video script delves into the architecture and security fundamentals of Databricks' Lakehouse platform, emphasizing the significance of data reliability and performance. It introduces Delta Lake, an open-source storage format that ensures ACID transactions and schema enforcement, and Photon, a next-gen query engine for cost-effective performance. The script also covers unified governance and security structures, including Unity Catalog for data governance and Delta Sharing for secure data exchange. Additionally, it highlights the benefits of serverless compute for on-demand, scalable data processing, and introduces key Lakehouse data management terminology.

Takeaways

  • 📈 **Data Reliability and Performance**: The importance of having reliable and clean data for accurate business insights and conclusions is emphasized, as bad data leads to bad outcomes.
  • 💧 **Data Lakes vs. Data Swamps**: Data lakes are great for storing raw data but often lack features for data reliability and quality, sometimes leading to them being referred to as data swamps.
  • 🚀 **Performance Issues with Data Lakes**: Data lakes may not perform as well as data warehouses due to issues like lack of ACID transaction support, schema enforcement, and integration with data catalogs.
  • 🔒 **Delta Lake**: Delta Lake is an open-source storage format that addresses the reliability and performance issues of data lakes by providing ACID transactions, scalable metadata handling, and schema enforcement.
  • 🛠️ **Photon Query Engine**: Photon is a next-generation query engine designed to provide the performance of a data warehouse with the scalability of a data lake, offering significant infrastructure cost savings.
  • 🌐 **Compatibility and Flexibility**: Delta Lake runs on top of existing data lakes and is compatible with Apache Spark and other processing engines, providing flexibility for data management infrastructure.
  • 🔑 **Unified Governance and Security**: The importance of a unified governance and security structure is highlighted, with features like Unity Catalog, Delta Sharing, and the separation of control and data planes.
  • 🔄 **Data Sharing with Delta Sharing**: Delta Sharing is an open-source solution for securely sharing live data across platforms, allowing for centralized administration and governance of data.
  • 🛡️ **Security Structure**: The Databricks Lakehouse platform offers a simple and unified approach to data security by splitting the architecture into control and data planes, ensuring data stays secure and compliant.
  • 🌟 **Serverless Compute**: Serverless compute is a fully managed service that simplifies the process of setting up and managing compute resources, reducing costs and increasing productivity.

Q & A

  • Why is data reliability and performance important in the context of a data platform architecture?

    -Data reliability and performance are crucial because they ensure that the data used for business insights and decision-making is accurate and clean. Poor data quality can lead to incorrect conclusions, and performance issues can slow down data processing and analysis.

  • What are some of the common issues with data lakes that can affect data reliability and performance?

    -Data lakes often lack features for data reliability and quality, such as ACID transaction support, schema enforcement, and integration with a data catalog, leading to data swamps and issues like the small file problem, ineffective partitioning, and high cardinality columns that degrade query performance.

  • What is Delta Lake and how does it address the issues associated with data lakes?

    -Delta Lake is a file-based open-source storage format that provides ACID transaction guarantees, scalable data and metadata handling, audit history, schema enforcement, and support for deletes, updates, and merges. It runs on top of existing data lakes and is compatible with Apache Spark and other processing engines.

  • How does Photon improve the performance of the Databricks Lakehouse platform?

    -Photon is a next-generation query engine that provides significant infrastructure cost savings and performance improvements. It is compatible with Spark APIs and offers increased speed for data ingestion, ETL, streaming, data science, and interactive queries directly on the data lake.

  • What are the benefits of using Delta tables over traditional Apache Parquet tables?

    -Delta tables, based on Apache Parquet, offer additional features such as versioning, reliability, metadata management, and time travel capabilities. They provide a transaction log that ensures multi-user environments have a single source of truth and prevent data corruption.

  • What is Unity Catalog and how does it contribute to data governance in the Databricks Lakehouse platform?

    -Unity Catalog is a unified governance solution for all data assets. It provides fine-grained access control, SQL query auditing, attribute-based access control, data versioning, and data quality constraints. It offers a common governance model across clouds, simplifying permissions and reducing risk.

  • What is Delta Sharing and how does it facilitate secure data sharing across platforms?

    -Delta Sharing is an open-source solution for sharing live data from the Lakehouse to any computing platform securely. It allows data providers to maintain governance and track usage, enabling centralized administration and secure collaboration without data duplication.

  • How does the security structure of the Databricks Lakehouse platform ensure data protection?

    -The security structure splits the architecture into a control plane and a data plane. The control plane manages services in Databrick's cloud account, while the data plane processes data in the business's cloud account. Data is encrypted at rest and in transit, and access is tightly controlled and audited.

  • What is the serverless compute option in Databricks and how does it benefit users?

    -Serverless compute is a fully managed service where Databricks provisions and manages compute resources. It offers immediate environment startup, scaling up and down within seconds, and resource release after use, reducing total cost of ownership and admin overhead while increasing user productivity.

  • What are the key components of Unity Catalog's data object hierarchy?

    -The key components of Unity Catalog's data object hierarchy include the metastore, catalog, schema, table, view, and user-defined function. The metastore is the logical container for metadata, the catalog is the topmost container for data objects, and schemas, tables, views, and functions are used to organize and manage data.

  • What is the purpose of the three-level namespace introduced by Unity Catalog?

    -The three-level namespace introduced by Unity Catalog provides improved data segregation capabilities. It consists of the catalog, schema, and table/view/function levels, allowing for more granular and organized data management compared to the traditional two-level namespace.

Outlines

00:00

🛠 Data Reliability and Performance in Databricks Lakehouse Platform

The first paragraph introduces the importance of data reliability and performance within the Databricks Lakehouse platform. It discusses the challenges of data lakes, which often lack features for data reliability and can lead to data swamps. The paragraph also touches on performance issues such as the lack of ACID transactions, schema enforcement, and integration with data catalogs. Delta Lake is highlighted as a solution with its ACID transaction support, scalable data and metadata handling, audit history, schema enforcement, and support for complex use cases like change data capture and streaming upserts. Delta Lake is built on top of existing data lakes and is compatible with Apache Spark and other processing engines, using Delta tables based on Apache Parquet for structured, semi-structured, and unstructured data.

05:00

🚀 Photon: The Next-Gen Query Engine for Databricks Lakehouse

This paragraph delves into Photon, the next-generation query engine designed to address the challenges of the lake house paradigm. Photon is highlighted for its ability to provide data warehouse-like performance while maintaining the scalability of a data lake. It offers significant infrastructure cost savings, with customers reportedly seeing up to an 80% reduction in total cost of ownership compared to traditional Databricks runtime Spark. Photon is compatible with Spark APIs and accelerates SQL and Spark queries without the need for user intervention. It has evolved to support a wide range of data and analytics workloads and is the first purpose-built lake house engine featured in the Databricks Lakehouse platform.

10:02

🔒 Unified Governance and Security in Databricks Lakehouse Platform

The third paragraph focuses on the importance of unified governance and security within the Databricks Lakehouse platform. It outlines the challenges of data and AI governance, such as the diversity of data assets, the use of disparate data platforms, and the complexities introduced by multi-cloud adoption. Databricks addresses these with Unity Catalog, a unified governance solution for all data assets, and Delta Sharing, an open solution for secure live data sharing to any computing platform. Unity Catalog provides centralized governance for data and AI, enabling better performance management and security across clouds, while Delta Sharing offers a simple REST protocol for secure data sharing.

15:04

🌐 Databricks Lakehouse Platform's Security Structure and Serverless Compute

This paragraph discusses the security structure of the Databricks Lakehouse platform, emphasizing the need for a simple and unified approach. The platform architecture is split into a control plane and a data plane, with the control plane managing back-end services and the data plane processing data. The paragraph also introduces serverless compute as a solution to the challenges of managing compute resources, offering immediate environment startup, scaling, and complete management by Databricks. Serverless compute reduces total cost of ownership, eliminates admin overhead, and increases user productivity. It provides a secure, elastic, and isolated compute resource that is released back to Databricks once the task is completed.

20:05

📚 Introduction to Lake House Data Management Terminology

The final paragraph provides an introduction to common lake house data management terminology, such as metastore, catalog, schema, table, view, and function, as used in the Databricks Lakehouse platform. It explains the role of Unity Catalog as the data governance solution and how it allows administrators to manage and control access to data. The paragraph also details the hierarchy of data objects in Unity Catalog, starting with the metastore as the top-level logical container, followed by catalogs, schemas, and finally tables, views, and functions. It discusses the differences between managed and external tables, the role of views and user-defined functions, and the use of storage credentials and Delta sharing for secure data sharing across organizations.

Mindmap

Keywords

💡Data Reliability

Data reliability refers to the accuracy, consistency, and trustworthiness of data. In the context of the video, it is crucial because it directly impacts the quality of business insights and decisions made from that data. The script emphasizes that 'bad data in equals bad data out,' highlighting the importance of starting with reliable data to ensure meaningful outcomes. Data lakes, while useful for storing large volumes of data, are often criticized as 'data swamps' due to their lack of features supporting data reliability and quality.

💡Data Performance

Data performance is about how well a system handles, processes, and retrieves data. The video discusses the limitations of data lakes in terms of performance, such as the ineffective partitioning and the 'small file problem,' which leads to query performance degradation. Data performance is a key aspect of the Databricks Lakehouse platform, which aims to offer performance on par with data warehouses while maintaining the scalability of data lakes.

💡Delta Lake

Delta Lake is an open-source storage format introduced in the video as a foundational technology for the Databricks Lakehouse platform. It is designed to provide ACID transaction guarantees, scalable metadata handling, and schema enforcement, which are essential for ensuring data reliability. The script mentions that Delta Lake supports operations like deletes, updates, and merges, which are uncommon in distributed processing frameworks, and it uses Delta tables based on Apache Parquet, a common data structuring format.

💡Photon

Photon is described as the next-generation query engine in the Databricks Lakehouse platform. It addresses the challenges of the lake house paradigm by providing the performance of a data warehouse with the scalability of a data lake. The script highlights that Photon offers dramatic infrastructure cost savings and is compatible with Spark APIs, leading to increased speed for various use cases such as ETL, streaming, data science, and interactive queries.

💡Unity Catalog

Unity Catalog is Databricks' unified governance solution for all data assets. It provides a common governance model based on ANSI SQL to define and enforce fine-grained access control on data and AI assets across clouds. The video script explains that Unity Catalog offers features like centralized governance, data lineage, and integration with existing tools, which are vital for managing and securing data in a multi-cloud environment.

💡Data Governance

Data governance is the set of processes, policies, standards, and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals. In the video, data governance is addressed as a critical challenge due to the diversity of data assets, the use of disparate data platforms, and the rise of multi-cloud adoption. Databricks offers solutions like Unity Catalog and Delta Sharing to tackle these governance challenges.

💡Data Security

Data security involves the protection of data from unauthorized access, use, disclosure, disruption, modification, inspection, recording, or destruction. It is a critical component of the Databricks Lakehouse platform, which splits its architecture into a control plane and a data plane to simplify permissions and reduce risk. The script mentions that Databricks supports various compliance standards and offers encryption, isolation, and auditing to ensure data security.

💡Serverless Compute

Serverless compute, as discussed in the video, is a fully managed service where Databricks provisions and manages compute resources for businesses. It is designed to address the challenges of managing clusters, such as complexity in setup, long startup times, and resource management. Serverless compute offers immediate environment startup, scaling up and down within seconds, and resource release after use, leading to decreased total cost of ownership and increased productivity.

💡Data Lake

A data lake is a system or repository that holds a vast amount of raw data in its native format until it is needed. The video script points out that while data lakes are great for storing large quantities of data, they often lack features for ensuring data reliability and quality, which can lead to them being referred to as 'data swamps.' Data lakes also face performance issues due to the immutable nature of the files stored in object storage.

💡Data Swamp

The term 'data swamp' is used in the script to describe the limitations of traditional data lakes. It implies that without proper features for data reliability and quality, data lakes can become overwhelming and unmanageable, much like a swamp. This concept is used to highlight the need for solutions like Delta Lake and the Databricks Lakehouse platform, which aim to provide structure and reliability to what would otherwise be a chaotic data environment.

💡Data Plane

The data plane in the context of the Databricks Lakehouse platform refers to the part of the architecture where data is processed. It is where compute resources, such as clusters, perform distributed data analysis. The script explains that Databricks separates the architecture into a control plane and a data plane to enhance security and simplify permissions. In the case of serverless compute, the data plane resources are managed by Databricks, reducing the administrative burden on the user.

Highlights

Data reliability and performance are crucial for building business insights and drawing actionable conclusions.

Data lakes often lack features for data reliability and quality, leading to the term 'data swamps'.

Data lakes do not offer the same level of performance as data warehouses.

Delta Lake provides ACID transaction support, schema enforcement, and support for deletes, updates, and merges.

Delta Lake uses a transaction log to maintain an audit history and enable time travel for data.

Photon is a next-generation query engine that offers significant infrastructure cost savings and improved speed.

Photon is compatible with Spark APIs and accelerates SQL and Spark queries without tuning or user intervention.

Unity Catalog offers a unified governance solution for all data assets with fine-grained access control and data quality constraints.

Delta Sharing is an open-source solution for securely sharing live data across platforms.

Databricks Lakehouse platform architecture splits into control and data planes for simplified permissions and reduced risk.

Serverless compute in Databricks offers instant compute resources, reducing total cost of ownership and admin overhead.

Databricks supports a range of compliance standards including SOC 2, ISO 27001, and GDPR.

Unity Catalog introduces a three-level namespace for improved data segregation in the Lakehouse platform.

Delta Lake tables are based on Apache Parquet and are compatible with semi-structured and unstructured data.

Data lineage in Unity Catalog provides an end-to-end view of data interactions and transformations.

Databricks Lakehouse platform offers encryption, isolation, and auditing for robust security.

Serverless SQL in Databricks provides on-demand, elastic compute resources for data processing.

Transcripts

play00:00

databricks Lakehouse platform

play00:02

architecture and security fundamentals

play00:04

data reliability and performance

play00:07

in this video you'll learn about the

play00:09

importance of data reliability and

play00:10

performance on platform architecture

play00:12

Define delta Lake and describe how

play00:15

Photon improves the performance of The

play00:17

databricks Lakehouse platform

play00:19

first we'll address why data reliability

play00:21

and performance is important

play00:24

it is common knowledge that bad data in

play00:26

equals bad data out so the data used to

play00:29

build business insights and draw

play00:31

actionable conclusions needs to be

play00:33

reliable and clean

play00:35

while data Lakes are a great solution

play00:37

for holding large quantities of raw data

play00:39

they lack important features for data

play00:41

reliability and quality often leading

play00:44

them to be called Data swamps also data

play00:46

Lakes don't often offer as good of

play00:48

performance as that of data warehouses

play00:53

some of the problems data Engineers May

play00:55

encounter when using a standard data

play00:57

Lake include a lack of acid transaction

play00:59

support making it impossible to mix

play01:02

updates of pins and reads

play01:04

a lack of schema enforcement creating

play01:07

inconsistent and low quality data and a

play01:09

lack of integration with the data

play01:11

catalog resulting in dark data and no

play01:13

single source of Truth these can bring

play01:15

the reliability of the available data in

play01:18

a data Lake into question as for

play01:20

performance using object storage means

play01:22

data is mostly kept in immutable files

play01:24

leading to issues such as ineffective

play01:26

partitioning and having too many small

play01:28

files

play01:29

partitioning is sometimes used as a poor

play01:31

man's indexing practice by data

play01:33

Engineers leading to hundreds of Dev

play01:35

hours lost tuning file sizes to improve

play01:37

performance in the end partitioning

play01:40

tends to be ineffective if the wrong

play01:42

field was selected for partitioning or

play01:44

due to high cardinality columns and

play01:47

because data Lake slack transactions

play01:48

support appending new data takes the

play01:50

shape of Simply adding new files the

play01:53

small file problem however is a known

play01:55

root cause of query performance

play01:57

degradation

play01:59

the databricks lake house platform

play02:00

solves these issues with two

play02:02

foundational Technologies Delta Lake and

play02:04

photon

play02:06

Delta lake is a file-based open source

play02:09

storage format it provides guarantees

play02:11

for acid transactions meaning no partial

play02:13

or corrupted files

play02:15

scalable data and metadata handling

play02:17

leveraging spark to scale out all the

play02:19

metadata processing handling metadata

play02:22

for petabyte scale tables

play02:24

audit history and time travel by

play02:26

providing a transaction log with details

play02:28

about every change to data providing a

play02:31

full audit Trail including the ability

play02:33

to revert to earlier versions for

play02:34

rollbacks or to reproduce experiments

play02:38

schema enforcement and schema Evolution

play02:40

preventing the insertion of data with

play02:42

the wrong schema while also allowing

play02:44

table schema to be explicitly and safely

play02:47

changed to accommodate ever-changing

play02:49

data

play02:50

support for deletes updates and merges

play02:52

which is rare for a distributed

play02:54

processing framework to support this

play02:57

allows Delta Lake to accommodate complex

play02:59

use cases such as change data capture

play03:01

slowly changing Dimension operations and

play03:04

streaming upserts to name a few

play03:07

and lastly a unified streaming and batch

play03:09

data processing allowing data teams to

play03:12

work across a wide variety of data

play03:14

latencies

play03:15

from streaming data ingestion to batch

play03:18

history backfill to interactive queries

play03:20

they all work from the start

play03:23

Delta Lake runs on top of existing data

play03:26

lakes and is compatible with Apache

play03:28

spark and other processing engines

play03:30

Delta Lake uses Delta tables which are

play03:33

based on Apache parquet a common format

play03:35

for structuring data currently used by

play03:37

many organizations

play03:39

this similarity makes switching from

play03:41

existing parquet tables to Delta tables

play03:43

quick and easy Delta tables are also

play03:45

usable with semi-structured and

play03:47

unstructured data providing versioning

play03:49

reliability metadata management and time

play03:51

travel capabilities making these types

play03:53

of data more manageable the key to all

play03:56

these features and functions is the

play03:57

Delta Lake transaction log this ordered

play04:00

record of every transaction makes it

play04:02

possible to accomplish a multi-user work

play04:04

environment because every transaction is

play04:06

accounted for the transaction log acts

play04:10

as a single source of Truth so that the

play04:12

databricks Lakehouse platform always

play04:13

presents users with correct views of the

play04:15

data when a user reads a Delta lake

play04:18

table for the first time or runs a new

play04:20

query on an Open Table spark checks the

play04:23

transaction log for new transactions

play04:25

that have been posted to the table

play04:27

if a change exists spark updates the

play04:29

table this ensures users are working

play04:31

with the most up-to-date information and

play04:33

the user table is synchronized with the

play04:35

master record it also prevents the user

play04:37

from making divergent or conflicting

play04:39

changes to the table and finally Delta

play04:41

lake is an open source project meaning

play04:43

it provides flexibility to your data

play04:45

management infrastructure

play04:48

you aren't limited to storing data in a

play04:50

single cloud provider and you can truly

play04:52

engage in a multi-cloud system

play04:55

additionally databricks has a robust

play04:58

partner solution ecosystem allowing you

play05:00

to work with the right tools for your

play05:02

use case

play05:05

next let's explore Photon the

play05:07

architecture of the lake house Paradigm

play05:09

can pose challenges with the underlying

play05:12

query execution engine for accessing and

play05:14

processing structured and unstructured

play05:16

data

play05:17

to support the lake house Paradigm the

play05:19

execution engine has to provide the same

play05:22

performance as a data warehouse while

play05:24

still having the scalability of a data

play05:26

Lake and the solution in the databricks

play05:28

lighthouse platform architecture for

play05:30

these challenges is photon

play05:32

photon is the next Generation query

play05:34

engine it provides dramatic

play05:37

infrastructure cost savings where

play05:38

typical customers are seeing up to an 80

play05:41

total cost of ownership savings over the

play05:44

traditional databricks runtime Spark

play05:47

photon is compatible with spark apis

play05:49

implementing a more General execution

play05:51

framework for efficient processing of

play05:53

data with support of the spark apis

play05:57

so with Photon you see increased speed

play05:58

for use cases such as data ingestion ETL

play06:01

streaming data science and interactive

play06:04

queries directly on your data Lake

play06:07

as databricks has evolved over the years

play06:09

query performance has steadily increased

play06:11

powered by spark and thousands of

play06:14

optimization packages as part of the

play06:15

databricks runtime Photon offers two

play06:18

times the speed per the tpcds one

play06:21

terabyte Benchmark compared to the

play06:23

latest dbr versions

play06:27

some customers have reported observing

play06:29

significant speed UPS using Photon on

play06:31

workloads such as SQL based jobs

play06:33

Internet of Things use cases data

play06:36

privacy and compliance and loading data

play06:39

into Delta and parquet

play06:43

photon is compatible with the Apache

play06:45

spark data frame and SQL apis to allow

play06:48

workloads to run without having to make

play06:50

any code changes

play06:52

Photon coordinates work on resources

play06:54

transparently accelerating portions of

play06:56

SQL and Spark queries without tuning or

play06:59

user intervention

play07:01

while Photon started out focusing on SQL

play07:04

use cases it has evolved in scope to

play07:06

accelerate all data and Analytics

play07:08

workloads

play07:09

photon is the first purpose-built lake

play07:11

house engine that can be found as a key

play07:14

feature for data performance and The

play07:16

databricks Lakehouse platform

play07:18

unified governance and security

play07:21

in this video you'll learn about the

play07:23

importance of having a unified

play07:24

governance and security structure the

play07:27

available security features Unity

play07:29

catalog and Delta sharing and the

play07:31

control and data planes of the

play07:33

databricks Lakehouse platform

play07:36

while it's important to make high

play07:38

quality data available to data teams the

play07:41

more individual access points added to a

play07:43

system such as users groups or external

play07:45

connectors

play07:47

higher the risk of data breaches along

play07:49

any of those lines the and any breach

play07:52

has long lasting negative impacts on a

play07:54

business and their brand

play07:56

there are several challenges to data and

play07:58

AI governance

play08:00

such as the diversity of data and AI

play08:01

assets as data takes many forms Beyond

play08:04

files and tables to complex structures

play08:06

such as dashboards machine learning

play08:08

models videos or images

play08:11

the use of two disparate and

play08:13

incompatible data platforms where past

play08:15

needs have forced businesses to use data

play08:18

warehouses for bi and data Lakes for AI

play08:20

resulting in data duplication and

play08:22

unsynchronized governance models

play08:25

the rise of multi-cloud adoption where

play08:27

each cloud has a unique governance model

play08:29

that requires individual familiarity and

play08:32

fragmented tool usage for data

play08:34

governance on the lake house introducing

play08:36

complexity in multiple integration

play08:38

points in the system leading to poor

play08:40

performance

play08:42

to address these challenges databricks

play08:43

offers the following Solutions Unity

play08:45

catalog as a unified governance solution

play08:48

for all data assets Delta sharing as an

play08:51

open solution to securely share live

play08:53

data to any Computing platform and a

play08:56

divided architecture into two planes

play08:58

control and data to simplify permissions

play09:01

avoid data duplication and reduce risk

play09:05

we'll start by exploring Unity catalog

play09:07

Unity catalog is a unified governance

play09:09

solution for all data assets

play09:12

modern lake house Systems Support

play09:14

fine-grained row column and view level

play09:16

Access Control via SQL query auditing

play09:19

attribute-based Access Control Data

play09:21

versioning and data quality constraints

play09:23

and monitoring database admins should be

play09:26

familiar with the standard interfaces

play09:27

allowing existing Personnel to manage

play09:30

all the data in an organization in a

play09:32

uniform way

play09:34

in The databricks Lakehouse platform

play09:36

Unity catalog provides a common

play09:38

governance model based on ANSI SQL to

play09:42

Define and enforce fine-grained access

play09:44

control on all data and AI assets on any

play09:46

Cloud Unity catalog supplies one

play09:49

consistent model to discover access and

play09:51

share data enabling better native

play09:53

Performance Management and security

play09:55

across clouds

play09:58

because Unity catalog provides

play10:00

centralized governance for data and AI

play10:02

there is a single source of Truth for

play10:04

all user identities and data Assets in

play10:06

The databricks Lakehouse platform

play10:09

the common metadata layer for cross

play10:11

workspace metadata is at the account

play10:13

level it provides a single access point

play10:16

with a common interface for

play10:17

collaboration from any workspace in the

play10:19

platform removing data team silos Unity

play10:22

catalog allows you to restrict access to

play10:25

certain rows and columns to users or

play10:27

groups authorized to query them and with

play10:29

attribute-based Access Control you can

play10:31

further simplify governance at scale by

play10:33

controlling access to multiple data

play10:35

items at one time for example personally

play10:38

identifiable information in multiple

play10:40

given columns can be tagged as such and

play10:42

a single rule can restrict or provide

play10:44

access as needed Regulatory Compliance

play10:47

is putting pressure on businesses for

play10:49

full compliance and data access audits

play10:51

are critical to ensure these regulations

play10:53

are being met

play10:54

for this Unity catalog provides a highly

play10:57

detailed audit Trail logging who has

play10:59

performed what action against the data

play11:03

to break down data silos and democratize

play11:06

data across your organization for

play11:07

data-driven decisions Unity catalog has

play11:10

a user interface for data search and

play11:12

discovery allowing teams to quickly

play11:14

search for Relevant data assets for any

play11:17

use case

play11:18

also the low latency metadata serving

play11:21

and auto tuning of tables enables Unity

play11:23

catalog to provide 38 times faster

play11:25

metadata processing compared to hive

play11:28

metastore

play11:29

all the Transformations and refinements

play11:32

of data from source to insights is

play11:34

encompassed in data lineage all of the

play11:36

interactions with the data including

play11:38

where it came from what other data sets

play11:40

it might have been combined with who

play11:42

created it and when what Transformations

play11:44

were performed and other events and

play11:46

attributes are included in a data sets

play11:48

data lineage Unity catalog provides

play11:51

automated data lineage charts down to

play11:53

table and column level giving that

play11:55

end-to-end view of the data not limited

play11:58

to just one workload multiple data teams

play12:00

can quickly investigate errors in their

play12:02

data pipelines or end applications

play12:04

impact analysis can also be performed to

play12:07

identify dependencies of data changes on

play12:09

Downstream systems or teams and then

play12:11

notified of potential impacts to their

play12:13

work and with this power of data lineage

play12:16

there is an increased understanding of

play12:18

the data reducing tribal knowledge and

play12:21

to round it out Unity catalog integrates

play12:23

with existing tools to help you future

play12:25

proof your data and AI governance

play12:27

next we'll discuss data sharing with

play12:30

Delta sharing

play12:31

data sharing is an important aspect of

play12:34

the digital economy that has developed

play12:36

with the Advent of big data but data

play12:38

sharing is difficult to manage existing

play12:41

data sharing Technologies come with

play12:42

several limitations

play12:44

traditional data sharing Technologies do

play12:46

not scale well and often serve files

play12:49

offloaded to a server

play12:51

Cloud object stores operate on an object

play12:53

level and are Cloud specific and

play12:56

Commercial data sharing offerings and

play12:58

vendor products often share tables

play13:00

instead of files scaling is expensive

play13:02

and they aren't open therefore don't

play13:04

permit data sharing to a different

play13:06

platform

play13:08

to address these challenges and

play13:10

limitations databricks developed Delta

play13:12

sharing with contributions from the OSS

play13:14

community and donated it to the Linux

play13:16

Foundation it is an open source solution

play13:19

to share live data from your Lighthouse

play13:21

to any Computing platform securely

play13:24

recipients don't have to be on the same

play13:27

cloud or even use the databricks lake

play13:30

house platform

play13:31

and the data isn't simply replicated or

play13:34

moved additionally data providers still

play13:37

maintain management and governance of

play13:39

the data with the ability to track and

play13:41

audit usage

play13:43

some key benefits of Delta sharing

play13:45

include that it is an open

play13:47

cross-platform sharing tool easily

play13:49

allowing you to share existing data in

play13:51

Delta Lake and Apache parquet formats

play13:53

without having to establish new

play13:55

ingestion processes to consume data

play13:57

since it provides native integration

play13:59

with power bi Tableau spark pandas and

play14:03

Java

play14:05

data is shared live without copying it

play14:07

with data being maintained on the

play14:10

provider's data Lake

play14:11

ensuring the data sets are reliable in

play14:13

real time and provide the most current

play14:15

information to the data recipient

play14:18

as mentioned earlier Delta sharing

play14:21

provides centralized Administration and

play14:23

governance to the data provider as the

play14:25

data is governed tracked and audited

play14:28

from a single location allowing usage to

play14:30

be monitored at the table partition and

play14:32

version level

play14:34

with Delta sharing you can build and

play14:36

package data products through a central

play14:38

Marketplace for distribution to anywhere

play14:42

and it is safe and secure with privacy

play14:44

safe data clean rooms meaning

play14:46

collaboration between data providers and

play14:48

recipients is hosted in a secure

play14:49

environment while safeguarding data

play14:52

privacy

play14:54

Unity catalog natively supports Delta

play14:57

sharing making these two tools smart

play14:59

choices in your data and AI governance

play15:01

and security structure

play15:03

Delta sharing is a simple rest protocol

play15:05

that securely shares access to part of a

play15:08

cloud data set leveraging modern cloud

play15:10

storage systems it can reliably transfer

play15:13

large data sets

play15:16

finally let's talk about the security

play15:17

structure of the data lake house

play15:19

platform a simple and unified approach

play15:22

to data security for the lake house is a

play15:24

critical requirement and the databricks

play15:26

lighthouse platform provides this by

play15:28

splitting the architecture into two

play15:29

separate planes the control plane and

play15:31

the data plane

play15:33

the control plane consists of the

play15:35

managed back-end services that

play15:36

databricks provides these live in

play15:38

databrick's own cloud account and are

play15:40

aligned with whatever cloud service the

play15:42

customer is using that is AWS Azure or

play15:45

gcp

play15:46

here databricks runs the workspace

play15:48

application and manages notebooks

play15:50

configuration and clusters

play15:52

the data plane is where your data is

play15:54

processed unless you choose to use

play15:56

serverless compute the compute resources

play15:58

in the data plane run inside the

play16:00

business owner's own cloud account

play16:03

all the data stays where it is

play16:06

while some data such as notebooks

play16:07

configurations logs and user information

play16:09

are available in the control plane the

play16:12

information is encrypted at rest

play16:14

and communication to and from the

play16:16

control plan is encrypted in transit

play16:18

security of the data plane within your

play16:20

chosen cloud service provider is very

play16:21

important so the databricks Lakehouse

play16:23

platform has several security key points

play16:25

for the networking of the environment if

play16:28

the business decides to host the data

play16:30

plane databix will configure the

play16:31

networking by default the serverless

play16:34

data plane networking infrastructure is

play16:35

managed by databricks in a databricks

play16:37

cloud service provider account and

play16:39

shared among customers with additional

play16:41

Network boundaries between workspaces

play16:43

and clusters

play16:45

for servers in the data plane databricks

play16:48

clusters are run using the latest

play16:50

hardened system images

play16:51

older less secure images or code cannot

play16:54

be chosen databricks code itself is peer

play16:57

reviewed by security trained developers

play16:58

and extensively reviewed with security

play17:00

in mind

play17:01

databricks clusters are typically

play17:03

short-lived often terminated after a job

play17:05

and do not persist data after

play17:07

termination code is launched in an

play17:09

unprivileged container to maintain

play17:10

system stability this security design

play17:13

provides protection against persistent

play17:15

attackers and privilege escalation

play17:19

for databricks support cases databricks

play17:21

access to the environment is limited to

play17:23

cloud service provider apis for

play17:26

Automation and support access databricks

play17:28

has a custom-built system allowing our

play17:31

staff access to fix issues or handle

play17:32

support requests and it requires either

play17:35

a support ticket or an engineering

play17:37

ticket tied expressly to your workspace

play17:39

access is limited to a specific group of

play17:42

employees for limited periods of time

play17:44

and with security audit logs the initial

play17:47

access event and the support team

play17:48

members actions are tracked

play17:52

for user identity and access databrick

play17:54

supports many ways to enable users to

play17:57

access their data

play17:59

the table ACLS feature uses traditional

play18:02

SQL based statements to manage access to

play18:04

data and enable fine-grained view-based

play18:06

access IM instance profiles enable AWS

play18:10

clusters to assume an IM role so users

play18:13

of that cluster automatically access

play18:15

allowed resources without explicit

play18:17

credentials

play18:18

external storage can be mounted and

play18:20

accessed using a securely stored access

play18:22

key and the secrets API separates

play18:24

credentials from code when accessing

play18:27

external resources

play18:29

as mentioned previously databricks

play18:31

provides encryption isolation and

play18:32

auditing throughout the governance and

play18:34

security structure users can also be

play18:36

isolated at different levels such as the

play18:39

workspace level where each team or

play18:40

Department uses a different workspace

play18:42

the cluster level where cluster ACLS can

play18:45

restrict users who attach notebooks to a

play18:47

given cluster

play18:48

for high concurrency clusters process

play18:50

isolation JV and white listing and

play18:53

language limitations can be used for

play18:55

safe coexistence of users with different

play18:57

access levels and single user clusters

play19:00

if permitted allow users to create a

play19:03

private dedicated cluster

play19:05

and finally for compliance databrick

play19:07

supports these compliance standards on

play19:09

our multi-tenant platform

play19:11

SOC 2 type 2

play19:13

ISO 27001 ISO

play19:17

27017 and isos 27018 certain clouds also

play19:22

support databricks development options

play19:24

for fedramp high

play19:26

Trust

play19:27

HIPAA

play19:28

and PCI and databricks in the databricks

play19:31

platform are also gdpr and CCPA ready

play19:35

instant compute and serverless

play19:39

in this video you'll learn about the

play19:41

available compute resources for The

play19:42

dataworks Lakehouse platform

play19:44

what serverless compute is

play19:46

and the benefits of databricks

play19:48

serverless SQL

play19:50

The dataworks Lakehouse platform

play19:52

architecture is split into the control

play19:53

plane and the data plane the data plane

play19:56

is where data is processed by clusters

play19:58

of compute resources this architecture

play20:00

is known as the classic data plane

play20:03

with the classic data plane compute

play20:04

resources are run in the business's

play20:06

cloud storage account and clusters

play20:09

perform distributed data analysis using

play20:11

queries in the databrick SQL workspace

play20:13

or notebooks in the data science and

play20:15

engineering or databricks machine

play20:16

learning environments

play20:18

however in using this structure

play20:21

businesses encountered challenges

play20:23

first creating clusters is a complicated

play20:25

task choosing the correct size instance

play20:28

type and configuration for the cluster

play20:30

can be overwhelming to the user

play20:31

provisioning the cluster

play20:33

next it takes several minutes for the

play20:35

environment to start after making the

play20:37

multitude of choices to configure and

play20:38

provision the cluster

play20:40

and finally because these clusters are

play20:42

hosted within the businesses cloud

play20:43

account there are many additional

play20:45

considerations to make about managing

play20:47

the capacity and pool of resources

play20:49

available and this leads to users

play20:51

exhibiting some costly behaviors such as

play20:54

leaving clusters running for longer than

play20:56

necessary to avoid the startup times and

play20:58

over provisioning their resources to

play21:00

ensure the cluster can handle spikes and

play21:02

data processing needs leading to users

play21:04

paying for unneeded resources and having

play21:07

large amounts of admin overhead ending

play21:09

up with unproductive users

play21:12

to solve these problems for the business

play21:14

databricks has released the serverless

play21:16

compute option or serverless data plane

play21:19

as of the release of this content

play21:21

serverless compute is only available for

play21:23

use with databrick SQL and is referred

play21:25

to at times as databrick serverless SQL

play21:29

serverless compute is a fully managed

play21:31

service that databricks provisions and

play21:32

manages the compute resources for a

play21:34

business in the databricks cloud account

play21:36

instead of the businesses the

play21:39

environment starts immediately scales up

play21:41

and down within seconds is completely

play21:43

managed by data bricks

play21:45

you have clusters available on demand

play21:47

and when finished the resources are

play21:50

released back to data breaks because of

play21:52

this the total cost of ownership

play21:53

decreases on average between 20 to 40

play21:56

percent admin overhead is eliminated and

play21:59

users see an increase in their

play22:00

productivity

play22:02

at the heart of the serverless compute

play22:04

is a fleet of database clusters that are

play22:06

always running unassigned to any

play22:08

customer waiting in a warm State ready

play22:11

to be assigned within seconds

play22:13

the pool of resources managed by

play22:15

databricks so the business doesn't need

play22:16

to worry about the offerings from the

play22:18

cloud service and databricks works with

play22:20

the cloud vendors to keep things patched

play22:22

and upgraded as needed

play22:24

when allocated to the business the

play22:27

serverless compute resource is elastic

play22:29

being able to scale up or down as needed

play22:32

and has three layers of isolation the

play22:34

container hosting the runtime the

play22:36

virtual machine hosting the container

play22:38

and the virtual Network for the

play22:39

workspace

play22:41

and each part is isolated with no

play22:43

sharing or cross-network traffic allowed

play22:45

ensuring your work is secure

play22:47

when finished the VM is terminated and

play22:50

not reused but entirely deleted and a

play22:53

new unallocated VM is released back into

play22:56

the pool of waiting resources

play22:58

introduction to Lake House data

play23:00

management terminology

play23:02

in this video you'll learn about the

play23:04

definitions for common lake house terms

play23:06

such as metastore catalog schema table

play23:09

View and function and how they are used

play23:12

to describe data management in the

play23:13

databricks lake house platform

play23:16

Delta Lake a key architectural component

play23:19

of the databricks lake house platform

play23:21

provides a data storage format built for

play23:23

the lake house and unity catalog the

play23:26

data governance solution for the

play23:27

databricks lighthouse platform allows

play23:30

administrators to manage and control

play23:31

access to data

play23:34

Unity catalog provides a common

play23:36

governance model to Define and enforce

play23:39

fine-grained access control on all data

play23:42

and AI assets on any Cloud Unity catalog

play23:45

supplies one consistent place for

play23:46

governing all workspaces to discover

play23:48

access and share data enabling better

play23:51

native Performance Management and

play23:53

security across clouds

play23:55

let's look at some of the key elements

play23:57

of unity catalog that are important to

play23:59

understanding how data management works

play24:01

in databricks

play24:03

the metastore is the top level logical

play24:05

container in unity catalog it's a

play24:07

construct that represents the metadata

play24:09

metadata is the information about the

play24:12

data objects being managed by the

play24:14

metastore and the ACLS governing those

play24:16

lists

play24:17

compared to the hive metastore which is

play24:20

a local metastore linked to each

play24:21

databricks workspace Unity catalog

play24:24

metastors offer improved security and

play24:26

auditing capabilities as well as other

play24:28

useful features

play24:30

the next thing in the data object

play24:32

hierarchy is the catalog a catalog is

play24:34

the topmost container for data objects

play24:37

in unity catalog

play24:38

a metastore can have as many catalogs as

play24:40

desired although only those with

play24:42

appropriate permissions can create them

play24:45

because catalogs constitute the topmost

play24:47

element in the addressable data

play24:49

hierarchy the catalog forms the first

play24:51

part of the three-level namespace that

play24:54

data analysts use to reference data

play24:56

objects in unity catalog

play24:58

this image illustrates how a three-level

play25:02

namespace compares to a traditional

play25:03

two-level namespace analysts familiar

play25:06

with the traditional data breaks or SQL

play25:08

for that matter should recognize the

play25:10

traditional two-level namespace used to

play25:12

address tables Within schemas

play25:15

Unity catalog introduces a third level

play25:17

to provide improved data segregation

play25:19

capabilities complete SQL references in

play25:22

unity catalog use three levels

play25:26

a schema is part of traditional SQL and

play25:29

is unchanged by unity catalog it

play25:32

functions as a container for data assets

play25:34

like tables and Views and is the second

play25:36

part of the three level namespace

play25:38

referenced earlier

play25:39

catalogs can contain as many schemes as

play25:42

desired which in turn can contain as

play25:44

many data objects as desired

play25:47

at the bottom layer of the hierarchy are

play25:49

tables views and functions starting with

play25:51

tables these are SQL relations

play25:53

consisting of an ordered list of columns

play25:56

though databricks doesn't change the

play25:58

overall concept of a table tables do

play26:01

have two key variations it's important

play26:03

to recognize that tables refined by two

play26:05

distinct elements first the metadata or

play26:09

the information about the table such as

play26:10

comments tags and the list of columns

play26:13

and Associated data types and then the

play26:15

data that populates the rows of the

play26:17

table the data originates from formatted

play26:20

data files stored in the businesses

play26:21

Cloud object storage

play26:25

there are two types of tables in this

play26:27

structure managed and external tables

play26:29

both tables have metadata managed by the

play26:32

metastore in the control plane the

play26:34

difference lies in where the table data

play26:37

is stored with a manage table data files

play26:40

are stored in the meta stores manage

play26:42

storage location whereas within an

play26:44

external table data files are stored in

play26:46

an external storage location

play26:49

from an access control point of view

play26:51

managing both types of tables is

play26:52

identical

play26:54

views are stored queries executed when

play26:57

you query The View views perform

play26:59

arbitrary SQL Transformations on tables

play27:02

and other views and are read only they

play27:05

do not have the ability to modify the

play27:07

underlying data

play27:08

the final element in the data object

play27:11

hierarchy are user-defined functions

play27:13

user-defined functions enable you to

play27:15

encapsulate custom functionality into a

play27:18

function that can be evoked within

play27:19

queries

play27:21

storage credentials are created by

play27:23

admins and are used to authenticate with

play27:25

cloud storage containers either external

play27:28

storage user supplied storage or the

play27:30

managed storage location for the

play27:31

metastore

play27:33

external locations are used to provide

play27:35

Access Control at the file level

play27:38

shares and recipients relate to Delta

play27:40

sharing an open protocol developed by

play27:43

databricks for secure low overhead data

play27:45

sharing across organizations it's

play27:48

intrinsically built into Unity catalog

play27:50

and is used to explicitly declare shares

play27:53

read-only logical collections of tables

play27:56

these can be shared with one or more

play27:58

recipients inside or outside the

play28:00

organization

play28:01

shares can be used for two main purposes

play28:04

to secure share data outside the

play28:07

organization in a performant way or to

play28:10

provide linkage between metastors and

play28:12

different parts of the world

play28:14

the metastore is best described as a

play28:17

logical construct for organizing your

play28:18

data and its Associated metadata rather

play28:21

than a physical container itself

play28:23

the metastore essentially functions as a

play28:25

reference for a collection of metadata

play28:27

and a link to the cloud storage

play28:28

container

play28:30

the metadata information about the data

play28:32

objects and the ACLS for those objects

play28:34

are stored in the control plane and data

play28:37

related to objects maintained by the

play28:39

metastore is stored in a cloud storage

play28:41

container

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
Data ReliabilityPerformance OptimizationLakehouse ArchitectureDelta LakePhoton EngineData ManagementBusiness InsightsDatabricks PlatformScalable ProcessingData Governance
Besoin d'un résumé en anglais ?