Intro to Databricks Lakehouse Platform Architecture and Security

Databricks
23 Nov 202228:47

Summary

TLDREl script del video explica la importancia de la fiabilidad y el rendimiento de los datos en la arquitectura de plataformas, destacando Delta Lake y Photon como tecnologías fundamentales en la plataforma Lakehouse de Databricks. Delta Lake, un formato de almacenamiento de código abierto basado en archivos, garantiza transacciones ACID, manejo escalable de datos y metadatos, y evolución del esquema. Photon es el motor de consultas de nueva generación que ofrece ahorros en costos de infraestructura y mejora el rendimiento. El script también cubre la gobernanza unificada y la seguridad, con Unity Catalog y Delta Sharing, y presenta el concepto de computación sin servidor y sus beneficios en el Lakehouse de Databricks.

Takeaways

  • 📈 El script destaca la importancia de la fiabilidad y el rendimiento de los datos en la arquitectura de la plataforma.
  • 💧 Se menciona que los Data Lakes a menudo se llaman Data Swamps debido a su falta de características para la fiabilidad y calidad de datos.
  • 🔄 Delta Lake es un formato de almacenamiento basado en archivos y de código abierto que proporciona garantías para transacciones ACID y manejo escalable de datos y metadatos.
  • 🛠️ Photon es el nuevo motor de consultas que ofrece ahorros significativos en costos de infraestructura y mejora el rendimiento de las consultas en el Data Lake.
  • 🔒 La plataforma Lakehouse de Databricks ofrece una estructura de gobernanza y seguridad unificada, lo que es crucial para proteger los datos y la marca de una empresa.
  • 🌐 Unity Catalog es una solución de gobernanza unificada para todos los activos de datos, proporcionando control de acceso fino y auditoría de consultas SQL.
  • 🔗 Delta Sharing es una herramienta de código abierto para compartir datos en vivo de manera segura y eficiente entre plataformas.
  • 🛡️ La arquitectura de la plataforma se divide en dos planos: el plano de control y el plano de datos, lo que simplifica los permisos y reduce el riesgo.
  • 🚀 El servidor de computación sin servidor de Databricks ofrece una solución completamente administrada que reduce los costos y aumenta la productividad de los usuarios.
  • 🔑 Databricks ofrece varias formas de habilitar el acceso de los usuarios a sus datos, incluyendo ACLs de tabla, perfiles de instancia de AWS y el API de secretos.

Q & A

  • ¿Qué problemas pueden enfrentar los ingenieros de datos al utilizar un lago de datos estándar?

    -Los ingenieros de datos pueden enfrentar problemas como la falta de soporte para transacciones ACID, lo que impide mezclar actualizaciones y lecturas; la falta de aplicación de esquema, lo que resulta en datos inconsistentes y de baja calidad; y la falta de integración con el catálogo de datos, lo que lleva a datos oscuros y la ausencia de una única fuente de verdad.

  • ¿Cómo Delta Lake mejora la fiabilidad y el rendimiento en la plataforma Lakehouse de Databricks?

    -Delta Lake mejora la fiabilidad y el rendimiento proporcionando garantías para transacciones ACID, manejo escalable de datos y metadatos, historial de auditoría y viaje en el tiempo, aplicación de esquema y evolución del esquema, y soporte para eliminaciones, actualizaciones y fusiones.

  • ¿Qué es Photon y cómo resuelve los desafíos de rendimiento en la plataforma Lakehouse de Databricks?

    -Photon es el siguiente motor de consultas de generación que proporciona ahorros dramáticos en costos de infraestructura y es compatible con las API de Spark, implementando un marco de ejecución más general para un procesamiento eficiente de datos. Proporciona velocidades incrementadas para casos de uso como la ingesta de datos, ETL, transmisión de datos, ciencia de datos interactiva y consultas interactivas directamente en el lago de datos.

  • ¿Qué beneficios ofrece la compatibilidad de Delta Lake con Apache Spark y otros motores de procesamiento?

    -La compatibilidad de Delta Lake con Apache Spark y otros motores de procesamiento permite que los equipos de datos trabajen con una variedad de latencias de datos, desde la ingesta de datos en streaming hasta la retroalimentación histórica por lotes e consultas interactivas, todo desde el principio.

  • ¿Cómo ayuda Unity Catalog a abordar los desafíos de gobernanza y seguridad en la plataforma Lakehouse de Databricks?

    -Unity Catalog ofrece una solución de gobernanza unificada para todos los activos de datos, con control de acceso fino a nivel de fila, columna y vista, auditoría de consultas de SQL, control de acceso basado en atributos, control de versiones de datos y restricciones de calidad de datos, y monitoreo.

  • ¿Qué es Delta Sharing y cómo ayuda a compartir datos de manera segura y eficiente?

    -Delta Sharing es una herramienta abierta y entre plataformas para compartir datos en vivo de forma segura. Permite compartir datos en formatos Delta Lake y Apache Parquet sin tener que establecer nuevos procesos de ingesta y mantiene la administración y gobernanza de los datos por parte del proveedor de datos, con la capacidad de hacer un seguimiento y auditar el uso.

  • ¿Cómo se divide la arquitectura de la plataforma Lakehouse de Databricks para mejorar la seguridad?

    -La arquitectura se divide en dos planos separados: el plano de control y el plano de datos. El plano de control consiste en los servicios back-end administrados que Databricks proporciona y el plano de datos es donde se procesa los datos, asegurando que los datos se mantengan en la cuenta de la nube del propietario del negocio.

  • ¿Qué ventajas ofrece el uso de Serverless Compute en la plataforma Lakehouse de Databricks?

    -El Serverless Compute es un servicio completamente administrado que Databricks proporciona y maneja los recursos de cómputo para un negocio en la cuenta de nube de Databricks. Reduce el costo total de propiedad, elimina la sobrecarga administrativa y aumenta la productividad de los usuarios, con un inicio inmediato y un escalado en segundos.

  • ¿Cuáles son algunas de las características clave de Unity Catalog que son importantes para entender cómo funciona la administración de datos en Databricks?

    -Unity Catalog incluye elementos clave como el metastore, que es el contenedor lógico de nivel superior en Unity Catalog y representa los metadatos; el catálogo, que es el contenedor de nivel más alto para objetos de datos; y el esquema, que es un contenedor para activos de datos como tablas y vistas y forma parte del tercer nivel del espacio de nombres de tres niveles.

  • ¿Qué son las vistas en el contexto de Unity Catalog y cómo se relacionan con las consultas SQL?

    -Las vistas son consultas almacenadas que se ejecutan cuando se realiza una consulta en la vista. Realizan transformaciones SQL arbitrarias en tablas y otras vistas y son de solo lectura, lo que significa que no tienen la capacidad de modificar los datos subyacentes.

Outlines

00:00

🛠️ Fundamentos de Arquitectura y Seguridad de Databricks Lakehouse

Este párrafo introduce los conceptos fundamentales de la plataforma Lakehouse de Databricks, enfocándose en la importancia de la fiabilidad y el rendimiento de los datos. Se menciona que los 'data lakes' a menudo se denominan 'data swamps' debido a su falta de características para garantizar la calidad y fiabilidad de los datos. Además, se discuten problemas comunes como la falta de soporte para transacciones ACID, la falta de aplicación de esquemas y la falta de integración con un catálogo de datos. La plataforma Databricks utiliza Delta Lake y Photon para resolver estos desafíos, proporcionando transacciones ACID, manejo escalable de datos y metadatos, historial de auditoría y compatibilidad con Apache Spark, entre otros beneficios.

05:00

🌟 Introducción a Photon, el Motor de Consulta de Databricks

Se explora Photon, el motor de consulta de nueva generación de Databricks, que aporta importantes ahorros en costos de infraestructura y mejora el rendimiento de las consultas. Photon es compatible con las API de Spark y ofrece un marco de ejecución más general para procesar datos de manera eficiente. Este motor mejora significativamente la velocidad en tareas como la ingesta de datos, ETL, streaming, análisis de datos y consultas interactivas directamente en el lago de datos. Además, Photon ha evolucionado para acelerar todo tipo de cargas de trabajo de datos y análisis, proporcionando una solución nativa para el rendimiento de la plataforma Lakehouse de Databricks.

10:02

🔒 Importancia de la Gobernanza y Seguridad Unificada en Databricks

Este segmento destaca la importancia de tener una estructura de gobernanza y seguridad unificada en el entorno Lakehouse. Se discuten los desafíos de la gobernanza de datos y AI, como la diversidad de activos de datos, la utilización de plataformas de datos incompatibles y la adopción multi-nube. Databricks ofrece soluciones como Unity Catalog, que es una solución de gobernanza unificada para todos los activos de datos, y Delta Sharing, una solución abierta para compartir datos en vivo de manera segura. También se menciona la arquitectura dividida en dos planos, el plano de control y el plano de datos, para simplificar permisos y reducir riesgos.

15:04

🔑 Delta Sharing y Seguridad en la Plataforma Lakehouse de Databricks

Se aborda Delta Sharing como una herramienta de código abierto desarrollada por Databricks para compartir datos en vivo de manera segura y eficiente. Esta herramienta permite a los proveedores de datos mantener la gestión y gobernanza de los datos, con la capacidad de realizar un seguimiento y auditoría del uso. Delta Sharing se integra con Power BI, Tableau, Spark, Pandas y Java, y permite la creación y empaquetado de productos de datos a través de un mercado central. Además, se discute la estructura de seguridad de la plataforma Lakehouse, que se divide en el plano de control y el plano de datos, con enfoques en la seguridad de la red, el uso de imágenes del sistema endurecidas y la separación de credenciales de código para el acceso a recursos externos.

20:05

🚀 Computación sin Servidores y SQL Serverless en Databricks

Este párrafo presenta la opción de computación sin servidor o 'serverless compute' de Databricks, que es una solución para simplificar la gestión de clusters y reducir costos. La computación sin servidor es un servicio completamente administrado que permite a los usuarios acceder a recursos de cómputo en demanda, lo que reduce los tiempos de inicio y la sobre-provision de recursos. Este enfoque también elimina la sobrecarga administrativa y aumenta la productividad de los usuarios. La computación sin servidor escala recursos en segundos y cuenta con tres capas de aislamiento para garantizar la seguridad del trabajo del usuario.

25:06

📚 Terminología de Gestión de Datos Lakehouse en Databricks

Se proporciona una introducción a los términos comunes utilizados en la gestión de datos Lakehouse en Databricks, como 'metastore', 'catalog', 'schema', 'table', 'view' y 'function'. Se describe cómo Unity Catalog actúa como solución de gobernanza de datos para Databricks, permitiendo a los administradores gestionar y controlar el acceso a los datos. Se explica la jerarquía de objetos de datos en Unity Catalog, que incluye metastores, catálogos, esquemas, tablas, vistas y funciones, y se mencionan las diferencias entre tablas administradas y externas, así como los tipos de tablas y vistas.

Mindmap

Keywords

💡Data reliability

La fiabilidad de los datos es fundamental en cualquier plataforma de datos, ya que garantiza que la información utilizada para tomar decisiones empresariales sea confiable y limpia. En el video, se enfatiza que 'bad data in equals bad data out', lo que demuestra la importancia de contar con datos precisos para obtener conclusiones y conclusiones útiles. La fiabilidad de los datos está directamente relacionada con la arquitectura de la plataforma y el rendimiento, aspectos clave en la plataforma Lakehouse de Databricks.

💡Data lakes

Los data lakes son soluciones muy utilizadas para almacenar grandes volúmenes de datos en bruto. Sin embargo, a menudo carecen de características importantes para garantizar la fiabilidad y calidad de los datos, lo que a veces los convierte en lo que se denomina 'data swamps'. En el script, se menciona que los data lakes no ofrecen rendimiento tan bueno como los data warehouses y pueden tener problemas como la falta de soporte para transacciones ACID y la incapacidad para integrarse con un catálogo de datos.

💡Delta Lake

Delta Lake es un formato de almacenamiento basado en archivos y de código abierto que proporciona garantías para transacciones ACID, lo que significa que no hay archivos parciales o corruptos. Es una tecnología fundamental en la plataforma Lakehouse de Databricks, permitiendo un manejo escalable de datos y metadatos, así como la capacidad de hacer rollbacks o reproducir experimentos. En el video, se describe cómo Delta Lake mejora la fiabilidad y el rendimiento de los data lakes al proporcionar características que estos a menudo no ofrecen.

💡Photon

Photon es el motor de consultas de nueva generación en la plataforma Lakehouse de Databricks. Se menciona en el script como la solución arquitectónica que aborda los desafíos de rendimiento asociados con el paradigma Lakehouse, proporcionando rendimiento similar al de un data warehouse. Photon es compatible con las API de Spark y ofrece un marco de ejecución más general para procesar datos de manera eficiente, lo que resulta en un aumento significativo de velocidad para diversas cargas de trabajo.

💡Unity catalog

El Unity catalog es una solución de gobernanza unificada para todos los activos de datos en la plataforma Lakehouse de Databricks. Proporciona un modelo común de gobernanza basado en ANSI SQL para definir y aplicar controles de acceso finos en todos los activos de datos e inteligencia artificial en cualquier nube. En el video, se destaca cómo Unity catalog ayuda a simplificar permisos, evitar la duplicación de datos y reducir el riesgo, al tiempo que ofrece características como el control de acceso a nivel de fila, columnas y vistas, y la generación automática de linajes de datos.

💡Data governance

La gobernanza de datos es un tema central en el video, ya que aborda los desafíos de gestionar y controlar la diversidad de activos de datos y la incompatibilidad de plataformas de datos. La gobernanza es crucial para asegurar que los datos sean accesibles y seguros, y para cumplir con las regulaciones de cumplimiento de datos. Databricks ofrece soluciones como Unity catalog y Delta sharing para abordar estos desafíos, permitiendo una gobernanza unificada y segura de los datos.

💡Schema enforcement

El refuerzo del esquema es una función clave de Delta Lake que se menciona en el script. Impide la inserción de datos con el esquema incorrecto y permite que el esquema de la tabla se cambie explícitamente y de manera segura para adaptarse a los datos en constante evolución. Esto es importante para mantener la calidad y la coherencia de los datos en el lago de datos.

💡Data sharing

El intercambio de datos es una parte integral de la economía digital, pero puede ser difícil de gestionar. El script habla sobre las limitaciones de las tecnologías de intercambio de datos tradicionales y cómo Delta sharing, desarrollado por Databricks, aborda estos desafíos al proporcionar un enfoque abierto y seguro para compartir datos en vivo de forma cross-platform.

💡Security structure

La estructura de seguridad de la plataforma Lakehouse es esencial para proteger los datos y la información de la organización. El script describe cómo Databricks divide la arquitectura en dos planos separados: el plano de control y el plano de datos. El plano de control se encarga de servicios de back-end administrados por Databricks, mientras que el plano de datos se encarga del procesamiento de los datos. La seguridad en ambos planos es crucial para evitar fugas de datos y mantener la integridad de la información.

💡Serverless compute

El cálculo sin servidor, o 'serverless compute', es una opción ofrecida por Databricks para simplificar la gestión de recursos de cómputo. En el script, se presenta como una solución que elimina la sobrecarga administrativa y reduce los costos al proporcionar recursos de cómputo bajo demanda, que se escalan rápidamente y se liberan cuando no son necesarios. Este enfoque mejora la productividad de los usuarios y permite un uso más eficiente de los recursos.

Highlights

Data reliability and performance are crucial for building accurate business insights and drawing actionable conclusions from data.

Data lakes often lack features for data reliability and quality, leading to them being called data swamps.

Data lakes typically do not offer the same level of performance as data warehouses.

Using object storage in data lakes can lead to issues like ineffective partitioning and the small file problem that degrades query performance.

Delta Lake is a file-based open source storage format that provides ACID transaction guarantees, scalable metadata handling, audit history, schema enforcement, and support for deletes, updates and merges.

Delta Lake uses Delta tables based on Apache Parquet, making it easy to switch from existing Parquet tables and providing versioning, reliability and time travel capabilities.

Delta Lake runs on top of existing data lakes and is compatible with Apache Spark and other processing engines.

Photon is a next-generation query engine that provides significant infrastructure cost savings and up to 2x the speed per TPC-DS 1TB benchmark compared to Databricks Runtime Spark.

Photon is compatible with Spark APIs and accelerates SQL and Spark queries transparently without needing user intervention.

Unity Catalog provides a unified governance solution for all data assets with fine-grained access control, SQL query auditing, attribute-based access control, data versioning, and data quality constraints.

Unity Catalog offers centralized governance with a single source of truth for all user identities and data assets in the Databricks Lakehouse platform.

Delta Sharing is an open-source solution for securely sharing live data from your Databricks Lakehouse to any computing platform.

Delta Sharing provides centralized administration and governance for data providers, allowing them to track and audit data usage.

The Databricks Lakehouse platform splits the architecture into a control plane and a data plane to simplify permissions, avoid data duplication, and reduce risk.

Databricks provides a robust partner solution ecosystem allowing you to work with the right tools for your use case.

Databricks supports many ways to enable users to access their data, including table ACLs, instance profiles, external storage, and the Secrets API.

Databricks offers serverless compute as a fully managed service that provisions and manages compute resources, eliminating admin overhead and increasing productivity.

Unity Catalog introduces a three-level namespace (catalog, schema, table) to provide improved data segregation capabilities compared to the traditional two-level namespace.

Databricks supports compliance standards like SOC 2 Type 2, ISO 27001, ISO 27017, ISO 27018, and is GDPR and CCPA ready.

Transcripts

play00:00

databricks Lakehouse platform

play00:02

architecture and security fundamentals

play00:04

data reliability and performance

play00:07

in this video you'll learn about the

play00:09

importance of data reliability and

play00:10

performance on platform architecture

play00:12

Define delta Lake and describe how

play00:15

Photon improves the performance of The

play00:17

databricks Lakehouse platform

play00:19

first we'll address why data reliability

play00:21

and performance is important

play00:24

it is common knowledge that bad data in

play00:26

equals bad data out so the data used to

play00:29

build business insights and draw

play00:31

actionable conclusions needs to be

play00:33

reliable and clean

play00:35

while data Lakes are a great solution

play00:37

for holding large quantities of raw data

play00:39

they lack important features for data

play00:41

reliability and quality often leading

play00:44

them to be called Data swamps also data

play00:46

Lakes don't often offer as good of

play00:48

performance as that of data warehouses

play00:53

some of the problems data Engineers May

play00:55

encounter when using a standard data

play00:57

Lake include a lack of acid transaction

play00:59

support making it impossible to mix

play01:02

updates of pins and reads

play01:04

a lack of schema enforcement creating

play01:07

inconsistent and low quality data and a

play01:09

lack of integration with the data

play01:11

catalog resulting in dark data and no

play01:13

single source of Truth these can bring

play01:15

the reliability of the available data in

play01:18

a data Lake into question as for

play01:20

performance using object storage means

play01:22

data is mostly kept in immutable files

play01:24

leading to issues such as ineffective

play01:26

partitioning and having too many small

play01:28

files

play01:29

partitioning is sometimes used as a poor

play01:31

man's indexing practice by data

play01:33

Engineers leading to hundreds of Dev

play01:35

hours lost tuning file sizes to improve

play01:37

performance in the end partitioning

play01:40

tends to be ineffective if the wrong

play01:42

field was selected for partitioning or

play01:44

due to high cardinality columns and

play01:47

because data Lake slack transactions

play01:48

support appending new data takes the

play01:50

shape of Simply adding new files the

play01:53

small file problem however is a known

play01:55

root cause of query performance

play01:57

degradation

play01:59

the databricks lake house platform

play02:00

solves these issues with two

play02:02

foundational Technologies Delta Lake and

play02:04

photon

play02:06

Delta lake is a file-based open source

play02:09

storage format it provides guarantees

play02:11

for acid transactions meaning no partial

play02:13

or corrupted files

play02:15

scalable data and metadata handling

play02:17

leveraging spark to scale out all the

play02:19

metadata processing handling metadata

play02:22

for petabyte scale tables

play02:24

audit history and time travel by

play02:26

providing a transaction log with details

play02:28

about every change to data providing a

play02:31

full audit Trail including the ability

play02:33

to revert to earlier versions for

play02:34

rollbacks or to reproduce experiments

play02:38

schema enforcement and schema Evolution

play02:40

preventing the insertion of data with

play02:42

the wrong schema while also allowing

play02:44

table schema to be explicitly and safely

play02:47

changed to accommodate ever-changing

play02:49

data

play02:50

support for deletes updates and merges

play02:52

which is rare for a distributed

play02:54

processing framework to support this

play02:57

allows Delta Lake to accommodate complex

play02:59

use cases such as change data capture

play03:01

slowly changing Dimension operations and

play03:04

streaming upserts to name a few

play03:07

and lastly a unified streaming and batch

play03:09

data processing allowing data teams to

play03:12

work across a wide variety of data

play03:14

latencies

play03:15

from streaming data ingestion to batch

play03:18

history backfill to interactive queries

play03:20

they all work from the start

play03:23

Delta Lake runs on top of existing data

play03:26

lakes and is compatible with Apache

play03:28

spark and other processing engines

play03:30

Delta Lake uses Delta tables which are

play03:33

based on Apache parquet a common format

play03:35

for structuring data currently used by

play03:37

many organizations

play03:39

this similarity makes switching from

play03:41

existing parquet tables to Delta tables

play03:43

quick and easy Delta tables are also

play03:45

usable with semi-structured and

play03:47

unstructured data providing versioning

play03:49

reliability metadata management and time

play03:51

travel capabilities making these types

play03:53

of data more manageable the key to all

play03:56

these features and functions is the

play03:57

Delta Lake transaction log this ordered

play04:00

record of every transaction makes it

play04:02

possible to accomplish a multi-user work

play04:04

environment because every transaction is

play04:06

accounted for the transaction log acts

play04:10

as a single source of Truth so that the

play04:12

databricks Lakehouse platform always

play04:13

presents users with correct views of the

play04:15

data when a user reads a Delta lake

play04:18

table for the first time or runs a new

play04:20

query on an Open Table spark checks the

play04:23

transaction log for new transactions

play04:25

that have been posted to the table

play04:27

if a change exists spark updates the

play04:29

table this ensures users are working

play04:31

with the most up-to-date information and

play04:33

the user table is synchronized with the

play04:35

master record it also prevents the user

play04:37

from making divergent or conflicting

play04:39

changes to the table and finally Delta

play04:41

lake is an open source project meaning

play04:43

it provides flexibility to your data

play04:45

management infrastructure

play04:48

you aren't limited to storing data in a

play04:50

single cloud provider and you can truly

play04:52

engage in a multi-cloud system

play04:55

additionally databricks has a robust

play04:58

partner solution ecosystem allowing you

play05:00

to work with the right tools for your

play05:02

use case

play05:05

next let's explore Photon the

play05:07

architecture of the lake house Paradigm

play05:09

can pose challenges with the underlying

play05:12

query execution engine for accessing and

play05:14

processing structured and unstructured

play05:16

data

play05:17

to support the lake house Paradigm the

play05:19

execution engine has to provide the same

play05:22

performance as a data warehouse while

play05:24

still having the scalability of a data

play05:26

Lake and the solution in the databricks

play05:28

lighthouse platform architecture for

play05:30

these challenges is photon

play05:32

photon is the next Generation query

play05:34

engine it provides dramatic

play05:37

infrastructure cost savings where

play05:38

typical customers are seeing up to an 80

play05:41

total cost of ownership savings over the

play05:44

traditional databricks runtime Spark

play05:47

photon is compatible with spark apis

play05:49

implementing a more General execution

play05:51

framework for efficient processing of

play05:53

data with support of the spark apis

play05:57

so with Photon you see increased speed

play05:58

for use cases such as data ingestion ETL

play06:01

streaming data science and interactive

play06:04

queries directly on your data Lake

play06:07

as databricks has evolved over the years

play06:09

query performance has steadily increased

play06:11

powered by spark and thousands of

play06:14

optimization packages as part of the

play06:15

databricks runtime Photon offers two

play06:18

times the speed per the tpcds one

play06:21

terabyte Benchmark compared to the

play06:23

latest dbr versions

play06:27

some customers have reported observing

play06:29

significant speed UPS using Photon on

play06:31

workloads such as SQL based jobs

play06:33

Internet of Things use cases data

play06:36

privacy and compliance and loading data

play06:39

into Delta and parquet

play06:43

photon is compatible with the Apache

play06:45

spark data frame and SQL apis to allow

play06:48

workloads to run without having to make

play06:50

any code changes

play06:52

Photon coordinates work on resources

play06:54

transparently accelerating portions of

play06:56

SQL and Spark queries without tuning or

play06:59

user intervention

play07:01

while Photon started out focusing on SQL

play07:04

use cases it has evolved in scope to

play07:06

accelerate all data and Analytics

play07:08

workloads

play07:09

photon is the first purpose-built lake

play07:11

house engine that can be found as a key

play07:14

feature for data performance and The

play07:16

databricks Lakehouse platform

play07:18

unified governance and security

play07:21

in this video you'll learn about the

play07:23

importance of having a unified

play07:24

governance and security structure the

play07:27

available security features Unity

play07:29

catalog and Delta sharing and the

play07:31

control and data planes of the

play07:33

databricks Lakehouse platform

play07:36

while it's important to make high

play07:38

quality data available to data teams the

play07:41

more individual access points added to a

play07:43

system such as users groups or external

play07:45

connectors

play07:47

higher the risk of data breaches along

play07:49

any of those lines the and any breach

play07:52

has long lasting negative impacts on a

play07:54

business and their brand

play07:56

there are several challenges to data and

play07:58

AI governance

play08:00

such as the diversity of data and AI

play08:01

assets as data takes many forms Beyond

play08:04

files and tables to complex structures

play08:06

such as dashboards machine learning

play08:08

models videos or images

play08:11

the use of two disparate and

play08:13

incompatible data platforms where past

play08:15

needs have forced businesses to use data

play08:18

warehouses for bi and data Lakes for AI

play08:20

resulting in data duplication and

play08:22

unsynchronized governance models

play08:25

the rise of multi-cloud adoption where

play08:27

each cloud has a unique governance model

play08:29

that requires individual familiarity and

play08:32

fragmented tool usage for data

play08:34

governance on the lake house introducing

play08:36

complexity in multiple integration

play08:38

points in the system leading to poor

play08:40

performance

play08:42

to address these challenges databricks

play08:43

offers the following Solutions Unity

play08:45

catalog as a unified governance solution

play08:48

for all data assets Delta sharing as an

play08:51

open solution to securely share live

play08:53

data to any Computing platform and a

play08:56

divided architecture into two planes

play08:58

control and data to simplify permissions

play09:01

avoid data duplication and reduce risk

play09:05

we'll start by exploring Unity catalog

play09:07

Unity catalog is a unified governance

play09:09

solution for all data assets

play09:12

modern lake house Systems Support

play09:14

fine-grained row column and view level

play09:16

Access Control via SQL query auditing

play09:19

attribute-based Access Control Data

play09:21

versioning and data quality constraints

play09:23

and monitoring database admins should be

play09:26

familiar with the standard interfaces

play09:27

allowing existing Personnel to manage

play09:30

all the data in an organization in a

play09:32

uniform way

play09:34

in The databricks Lakehouse platform

play09:36

Unity catalog provides a common

play09:38

governance model based on ANSI SQL to

play09:42

Define and enforce fine-grained access

play09:44

control on all data and AI assets on any

play09:46

Cloud Unity catalog supplies one

play09:49

consistent model to discover access and

play09:51

share data enabling better native

play09:53

Performance Management and security

play09:55

across clouds

play09:58

because Unity catalog provides

play10:00

centralized governance for data and AI

play10:02

there is a single source of Truth for

play10:04

all user identities and data Assets in

play10:06

The databricks Lakehouse platform

play10:09

the common metadata layer for cross

play10:11

workspace metadata is at the account

play10:13

level it provides a single access point

play10:16

with a common interface for

play10:17

collaboration from any workspace in the

play10:19

platform removing data team silos Unity

play10:22

catalog allows you to restrict access to

play10:25

certain rows and columns to users or

play10:27

groups authorized to query them and with

play10:29

attribute-based Access Control you can

play10:31

further simplify governance at scale by

play10:33

controlling access to multiple data

play10:35

items at one time for example personally

play10:38

identifiable information in multiple

play10:40

given columns can be tagged as such and

play10:42

a single rule can restrict or provide

play10:44

access as needed Regulatory Compliance

play10:47

is putting pressure on businesses for

play10:49

full compliance and data access audits

play10:51

are critical to ensure these regulations

play10:53

are being met

play10:54

for this Unity catalog provides a highly

play10:57

detailed audit Trail logging who has

play10:59

performed what action against the data

play11:03

to break down data silos and democratize

play11:06

data across your organization for

play11:07

data-driven decisions Unity catalog has

play11:10

a user interface for data search and

play11:12

discovery allowing teams to quickly

play11:14

search for Relevant data assets for any

play11:17

use case

play11:18

also the low latency metadata serving

play11:21

and auto tuning of tables enables Unity

play11:23

catalog to provide 38 times faster

play11:25

metadata processing compared to hive

play11:28

metastore

play11:29

all the Transformations and refinements

play11:32

of data from source to insights is

play11:34

encompassed in data lineage all of the

play11:36

interactions with the data including

play11:38

where it came from what other data sets

play11:40

it might have been combined with who

play11:42

created it and when what Transformations

play11:44

were performed and other events and

play11:46

attributes are included in a data sets

play11:48

data lineage Unity catalog provides

play11:51

automated data lineage charts down to

play11:53

table and column level giving that

play11:55

end-to-end view of the data not limited

play11:58

to just one workload multiple data teams

play12:00

can quickly investigate errors in their

play12:02

data pipelines or end applications

play12:04

impact analysis can also be performed to

play12:07

identify dependencies of data changes on

play12:09

Downstream systems or teams and then

play12:11

notified of potential impacts to their

play12:13

work and with this power of data lineage

play12:16

there is an increased understanding of

play12:18

the data reducing tribal knowledge and

play12:21

to round it out Unity catalog integrates

play12:23

with existing tools to help you future

play12:25

proof your data and AI governance

play12:27

next we'll discuss data sharing with

play12:30

Delta sharing

play12:31

data sharing is an important aspect of

play12:34

the digital economy that has developed

play12:36

with the Advent of big data but data

play12:38

sharing is difficult to manage existing

play12:41

data sharing Technologies come with

play12:42

several limitations

play12:44

traditional data sharing Technologies do

play12:46

not scale well and often serve files

play12:49

offloaded to a server

play12:51

Cloud object stores operate on an object

play12:53

level and are Cloud specific and

play12:56

Commercial data sharing offerings and

play12:58

vendor products often share tables

play13:00

instead of files scaling is expensive

play13:02

and they aren't open therefore don't

play13:04

permit data sharing to a different

play13:06

platform

play13:08

to address these challenges and

play13:10

limitations databricks developed Delta

play13:12

sharing with contributions from the OSS

play13:14

community and donated it to the Linux

play13:16

Foundation it is an open source solution

play13:19

to share live data from your Lighthouse

play13:21

to any Computing platform securely

play13:24

recipients don't have to be on the same

play13:27

cloud or even use the databricks lake

play13:30

house platform

play13:31

and the data isn't simply replicated or

play13:34

moved additionally data providers still

play13:37

maintain management and governance of

play13:39

the data with the ability to track and

play13:41

audit usage

play13:43

some key benefits of Delta sharing

play13:45

include that it is an open

play13:47

cross-platform sharing tool easily

play13:49

allowing you to share existing data in

play13:51

Delta Lake and Apache parquet formats

play13:53

without having to establish new

play13:55

ingestion processes to consume data

play13:57

since it provides native integration

play13:59

with power bi Tableau spark pandas and

play14:03

Java

play14:05

data is shared live without copying it

play14:07

with data being maintained on the

play14:10

provider's data Lake

play14:11

ensuring the data sets are reliable in

play14:13

real time and provide the most current

play14:15

information to the data recipient

play14:18

as mentioned earlier Delta sharing

play14:21

provides centralized Administration and

play14:23

governance to the data provider as the

play14:25

data is governed tracked and audited

play14:28

from a single location allowing usage to

play14:30

be monitored at the table partition and

play14:32

version level

play14:34

with Delta sharing you can build and

play14:36

package data products through a central

play14:38

Marketplace for distribution to anywhere

play14:42

and it is safe and secure with privacy

play14:44

safe data clean rooms meaning

play14:46

collaboration between data providers and

play14:48

recipients is hosted in a secure

play14:49

environment while safeguarding data

play14:52

privacy

play14:54

Unity catalog natively supports Delta

play14:57

sharing making these two tools smart

play14:59

choices in your data and AI governance

play15:01

and security structure

play15:03

Delta sharing is a simple rest protocol

play15:05

that securely shares access to part of a

play15:08

cloud data set leveraging modern cloud

play15:10

storage systems it can reliably transfer

play15:13

large data sets

play15:16

finally let's talk about the security

play15:17

structure of the data lake house

play15:19

platform a simple and unified approach

play15:22

to data security for the lake house is a

play15:24

critical requirement and the databricks

play15:26

lighthouse platform provides this by

play15:28

splitting the architecture into two

play15:29

separate planes the control plane and

play15:31

the data plane

play15:33

the control plane consists of the

play15:35

managed back-end services that

play15:36

databricks provides these live in

play15:38

databrick's own cloud account and are

play15:40

aligned with whatever cloud service the

play15:42

customer is using that is AWS Azure or

play15:45

gcp

play15:46

here databricks runs the workspace

play15:48

application and manages notebooks

play15:50

configuration and clusters

play15:52

the data plane is where your data is

play15:54

processed unless you choose to use

play15:56

serverless compute the compute resources

play15:58

in the data plane run inside the

play16:00

business owner's own cloud account

play16:03

all the data stays where it is

play16:06

while some data such as notebooks

play16:07

configurations logs and user information

play16:09

are available in the control plane the

play16:12

information is encrypted at rest

play16:14

and communication to and from the

play16:16

control plan is encrypted in transit

play16:18

security of the data plane within your

play16:20

chosen cloud service provider is very

play16:21

important so the databricks Lakehouse

play16:23

platform has several security key points

play16:25

for the networking of the environment if

play16:28

the business decides to host the data

play16:30

plane databix will configure the

play16:31

networking by default the serverless

play16:34

data plane networking infrastructure is

play16:35

managed by databricks in a databricks

play16:37

cloud service provider account and

play16:39

shared among customers with additional

play16:41

Network boundaries between workspaces

play16:43

and clusters

play16:45

for servers in the data plane databricks

play16:48

clusters are run using the latest

play16:50

hardened system images

play16:51

older less secure images or code cannot

play16:54

be chosen databricks code itself is peer

play16:57

reviewed by security trained developers

play16:58

and extensively reviewed with security

play17:00

in mind

play17:01

databricks clusters are typically

play17:03

short-lived often terminated after a job

play17:05

and do not persist data after

play17:07

termination code is launched in an

play17:09

unprivileged container to maintain

play17:10

system stability this security design

play17:13

provides protection against persistent

play17:15

attackers and privilege escalation

play17:19

for databricks support cases databricks

play17:21

access to the environment is limited to

play17:23

cloud service provider apis for

play17:26

Automation and support access databricks

play17:28

has a custom-built system allowing our

play17:31

staff access to fix issues or handle

play17:32

support requests and it requires either

play17:35

a support ticket or an engineering

play17:37

ticket tied expressly to your workspace

play17:39

access is limited to a specific group of

play17:42

employees for limited periods of time

play17:44

and with security audit logs the initial

play17:47

access event and the support team

play17:48

members actions are tracked

play17:52

for user identity and access databrick

play17:54

supports many ways to enable users to

play17:57

access their data

play17:59

the table ACLS feature uses traditional

play18:02

SQL based statements to manage access to

play18:04

data and enable fine-grained view-based

play18:06

access IM instance profiles enable AWS

play18:10

clusters to assume an IM role so users

play18:13

of that cluster automatically access

play18:15

allowed resources without explicit

play18:17

credentials

play18:18

external storage can be mounted and

play18:20

accessed using a securely stored access

play18:22

key and the secrets API separates

play18:24

credentials from code when accessing

play18:27

external resources

play18:29

as mentioned previously databricks

play18:31

provides encryption isolation and

play18:32

auditing throughout the governance and

play18:34

security structure users can also be

play18:36

isolated at different levels such as the

play18:39

workspace level where each team or

play18:40

Department uses a different workspace

play18:42

the cluster level where cluster ACLS can

play18:45

restrict users who attach notebooks to a

play18:47

given cluster

play18:48

for high concurrency clusters process

play18:50

isolation JV and white listing and

play18:53

language limitations can be used for

play18:55

safe coexistence of users with different

play18:57

access levels and single user clusters

play19:00

if permitted allow users to create a

play19:03

private dedicated cluster

play19:05

and finally for compliance databrick

play19:07

supports these compliance standards on

play19:09

our multi-tenant platform

play19:11

SOC 2 type 2

play19:13

ISO 27001 ISO

play19:17

27017 and isos 27018 certain clouds also

play19:22

support databricks development options

play19:24

for fedramp high

play19:26

Trust

play19:27

HIPAA

play19:28

and PCI and databricks in the databricks

play19:31

platform are also gdpr and CCPA ready

play19:35

instant compute and serverless

play19:39

in this video you'll learn about the

play19:41

available compute resources for The

play19:42

dataworks Lakehouse platform

play19:44

what serverless compute is

play19:46

and the benefits of databricks

play19:48

serverless SQL

play19:50

The dataworks Lakehouse platform

play19:52

architecture is split into the control

play19:53

plane and the data plane the data plane

play19:56

is where data is processed by clusters

play19:58

of compute resources this architecture

play20:00

is known as the classic data plane

play20:03

with the classic data plane compute

play20:04

resources are run in the business's

play20:06

cloud storage account and clusters

play20:09

perform distributed data analysis using

play20:11

queries in the databrick SQL workspace

play20:13

or notebooks in the data science and

play20:15

engineering or databricks machine

play20:16

learning environments

play20:18

however in using this structure

play20:21

businesses encountered challenges

play20:23

first creating clusters is a complicated

play20:25

task choosing the correct size instance

play20:28

type and configuration for the cluster

play20:30

can be overwhelming to the user

play20:31

provisioning the cluster

play20:33

next it takes several minutes for the

play20:35

environment to start after making the

play20:37

multitude of choices to configure and

play20:38

provision the cluster

play20:40

and finally because these clusters are

play20:42

hosted within the businesses cloud

play20:43

account there are many additional

play20:45

considerations to make about managing

play20:47

the capacity and pool of resources

play20:49

available and this leads to users

play20:51

exhibiting some costly behaviors such as

play20:54

leaving clusters running for longer than

play20:56

necessary to avoid the startup times and

play20:58

over provisioning their resources to

play21:00

ensure the cluster can handle spikes and

play21:02

data processing needs leading to users

play21:04

paying for unneeded resources and having

play21:07

large amounts of admin overhead ending

play21:09

up with unproductive users

play21:12

to solve these problems for the business

play21:14

databricks has released the serverless

play21:16

compute option or serverless data plane

play21:19

as of the release of this content

play21:21

serverless compute is only available for

play21:23

use with databrick SQL and is referred

play21:25

to at times as databrick serverless SQL

play21:29

serverless compute is a fully managed

play21:31

service that databricks provisions and

play21:32

manages the compute resources for a

play21:34

business in the databricks cloud account

play21:36

instead of the businesses the

play21:39

environment starts immediately scales up

play21:41

and down within seconds is completely

play21:43

managed by data bricks

play21:45

you have clusters available on demand

play21:47

and when finished the resources are

play21:50

released back to data breaks because of

play21:52

this the total cost of ownership

play21:53

decreases on average between 20 to 40

play21:56

percent admin overhead is eliminated and

play21:59

users see an increase in their

play22:00

productivity

play22:02

at the heart of the serverless compute

play22:04

is a fleet of database clusters that are

play22:06

always running unassigned to any

play22:08

customer waiting in a warm State ready

play22:11

to be assigned within seconds

play22:13

the pool of resources managed by

play22:15

databricks so the business doesn't need

play22:16

to worry about the offerings from the

play22:18

cloud service and databricks works with

play22:20

the cloud vendors to keep things patched

play22:22

and upgraded as needed

play22:24

when allocated to the business the

play22:27

serverless compute resource is elastic

play22:29

being able to scale up or down as needed

play22:32

and has three layers of isolation the

play22:34

container hosting the runtime the

play22:36

virtual machine hosting the container

play22:38

and the virtual Network for the

play22:39

workspace

play22:41

and each part is isolated with no

play22:43

sharing or cross-network traffic allowed

play22:45

ensuring your work is secure

play22:47

when finished the VM is terminated and

play22:50

not reused but entirely deleted and a

play22:53

new unallocated VM is released back into

play22:56

the pool of waiting resources

play22:58

introduction to Lake House data

play23:00

management terminology

play23:02

in this video you'll learn about the

play23:04

definitions for common lake house terms

play23:06

such as metastore catalog schema table

play23:09

View and function and how they are used

play23:12

to describe data management in the

play23:13

databricks lake house platform

play23:16

Delta Lake a key architectural component

play23:19

of the databricks lake house platform

play23:21

provides a data storage format built for

play23:23

the lake house and unity catalog the

play23:26

data governance solution for the

play23:27

databricks lighthouse platform allows

play23:30

administrators to manage and control

play23:31

access to data

play23:34

Unity catalog provides a common

play23:36

governance model to Define and enforce

play23:39

fine-grained access control on all data

play23:42

and AI assets on any Cloud Unity catalog

play23:45

supplies one consistent place for

play23:46

governing all workspaces to discover

play23:48

access and share data enabling better

play23:51

native Performance Management and

play23:53

security across clouds

play23:55

let's look at some of the key elements

play23:57

of unity catalog that are important to

play23:59

understanding how data management works

play24:01

in databricks

play24:03

the metastore is the top level logical

play24:05

container in unity catalog it's a

play24:07

construct that represents the metadata

play24:09

metadata is the information about the

play24:12

data objects being managed by the

play24:14

metastore and the ACLS governing those

play24:16

lists

play24:17

compared to the hive metastore which is

play24:20

a local metastore linked to each

play24:21

databricks workspace Unity catalog

play24:24

metastors offer improved security and

play24:26

auditing capabilities as well as other

play24:28

useful features

play24:30

the next thing in the data object

play24:32

hierarchy is the catalog a catalog is

play24:34

the topmost container for data objects

play24:37

in unity catalog

play24:38

a metastore can have as many catalogs as

play24:40

desired although only those with

play24:42

appropriate permissions can create them

play24:45

because catalogs constitute the topmost

play24:47

element in the addressable data

play24:49

hierarchy the catalog forms the first

play24:51

part of the three-level namespace that

play24:54

data analysts use to reference data

play24:56

objects in unity catalog

play24:58

this image illustrates how a three-level

play25:02

namespace compares to a traditional

play25:03

two-level namespace analysts familiar

play25:06

with the traditional data breaks or SQL

play25:08

for that matter should recognize the

play25:10

traditional two-level namespace used to

play25:12

address tables Within schemas

play25:15

Unity catalog introduces a third level

play25:17

to provide improved data segregation

play25:19

capabilities complete SQL references in

play25:22

unity catalog use three levels

play25:26

a schema is part of traditional SQL and

play25:29

is unchanged by unity catalog it

play25:32

functions as a container for data assets

play25:34

like tables and Views and is the second

play25:36

part of the three level namespace

play25:38

referenced earlier

play25:39

catalogs can contain as many schemes as

play25:42

desired which in turn can contain as

play25:44

many data objects as desired

play25:47

at the bottom layer of the hierarchy are

play25:49

tables views and functions starting with

play25:51

tables these are SQL relations

play25:53

consisting of an ordered list of columns

play25:56

though databricks doesn't change the

play25:58

overall concept of a table tables do

play26:01

have two key variations it's important

play26:03

to recognize that tables refined by two

play26:05

distinct elements first the metadata or

play26:09

the information about the table such as

play26:10

comments tags and the list of columns

play26:13

and Associated data types and then the

play26:15

data that populates the rows of the

play26:17

table the data originates from formatted

play26:20

data files stored in the businesses

play26:21

Cloud object storage

play26:25

there are two types of tables in this

play26:27

structure managed and external tables

play26:29

both tables have metadata managed by the

play26:32

metastore in the control plane the

play26:34

difference lies in where the table data

play26:37

is stored with a manage table data files

play26:40

are stored in the meta stores manage

play26:42

storage location whereas within an

play26:44

external table data files are stored in

play26:46

an external storage location

play26:49

from an access control point of view

play26:51

managing both types of tables is

play26:52

identical

play26:54

views are stored queries executed when

play26:57

you query The View views perform

play26:59

arbitrary SQL Transformations on tables

play27:02

and other views and are read only they

play27:05

do not have the ability to modify the

play27:07

underlying data

play27:08

the final element in the data object

play27:11

hierarchy are user-defined functions

play27:13

user-defined functions enable you to

play27:15

encapsulate custom functionality into a

play27:18

function that can be evoked within

play27:19

queries

play27:21

storage credentials are created by

play27:23

admins and are used to authenticate with

play27:25

cloud storage containers either external

play27:28

storage user supplied storage or the

play27:30

managed storage location for the

play27:31

metastore

play27:33

external locations are used to provide

play27:35

Access Control at the file level

play27:38

shares and recipients relate to Delta

play27:40

sharing an open protocol developed by

play27:43

databricks for secure low overhead data

play27:45

sharing across organizations it's

play27:48

intrinsically built into Unity catalog

play27:50

and is used to explicitly declare shares

play27:53

read-only logical collections of tables

play27:56

these can be shared with one or more

play27:58

recipients inside or outside the

play28:00

organization

play28:01

shares can be used for two main purposes

play28:04

to secure share data outside the

play28:07

organization in a performant way or to

play28:10

provide linkage between metastors and

play28:12

different parts of the world

play28:14

the metastore is best described as a

play28:17

logical construct for organizing your

play28:18

data and its Associated metadata rather

play28:21

than a physical container itself

play28:23

the metastore essentially functions as a

play28:25

reference for a collection of metadata

play28:27

and a link to the cloud storage

play28:28

container

play28:30

the metadata information about the data

play28:32

objects and the ACLS for those objects

play28:34

are stored in the control plane and data

play28:37

related to objects maintained by the

play28:39

metastore is stored in a cloud storage

play28:41

container

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
DatabricksLakehouseDelta LakePhotonFiabilidad de DatosRendimientoGobernanza de DatosSeguridadData ReliabilityData Performance