Intro to Supported Workloads on the Databricks Lakehouse Platform

Databricks
23 Nov 202220:57

Summary

TLDRThe video script introduces the Databricks Lakehouse platform as a solution for modern data warehousing challenges, supporting SQL analytics and BI tasks with Databrick SQL. It highlights the platform's benefits, including cost-effectiveness, scalability, and built-in governance with Delta Lake. The script also covers the platform's capabilities in data engineering, ETL pipelines, data streaming, and machine learning, emphasizing its unified approach to simplify complex data tasks and enhance data quality and reliability.

Takeaways

  • ๐Ÿฐ Databricks Lakehouse Platform supports data warehousing workloads with Databrick SQL and provides a unified solution for SQL analytics and BI tasks.
  • ๐Ÿš€ Traditional data warehouses are struggling to keep up with current business needs, leading to the rise of the Data Lakehouse concept for more efficient data handling.
  • ๐Ÿ’ฐ The platform offers cost-effective scalability and elasticity, reducing infrastructure costs by an average of 20 to 40 percent and minimizing resource management overhead.
  • ๐Ÿ”’ Built-in governance with Delta Lake allows for a single copy of data with fine-grained control, data lineage, and standard SQL, enhancing data security and management.
  • ๐Ÿ› ๏ธ A rich ecosystem of tools supports BI on data lakes, enabling data analysts to use their preferred tools like DBT, 5tran, Power BI, or Tableau for better collaboration.
  • ๐Ÿ”„ Databricks Lakehouse simplifies data engineering by providing a unified platform for data ingestion, transformation, processing, scheduling, and delivery, improving data quality and reliability.
  • ๐ŸŒ The platform automates ETL pipelines and supports both batch and streaming data operations, making it easier for data engineers to implement business logic and quality checks.
  • ๐Ÿ“ˆ Databricks supports high data quality through its end-to-end data engineering and ETL platform, which automates pipeline building and maintenance.
  • ๐Ÿ“Š Delta Live Tables (DLT) is an ETL framework that simplifies the building of reliable data pipelines with automatic infrastructure scaling, supporting both Python and SQL.
  • ๐Ÿ”ง Databricks Workflows is a managed orchestration service that simplifies the building of reliable data analytics and ML workflows on any cloud, reducing operational overhead.
  • ๐Ÿ”ฎ The platform supports the data streaming workload, providing real-time analytics, machine learning, and applications in one unified platform, which is crucial for quick business decisions.

Q & A

  • What is the primary challenge traditional data warehouses face in today's business environment?

    -Traditional data warehouses are no longer able to keep up with the needs of businesses due to their inability to handle the rapid influx of new data and the complexity of managing multiple systems for different tasks like BI and AI/ML.

  • How does the Databricks Lakehouse platform support data warehousing workloads?

    -The Databricks Lakehouse platform supports data warehousing workloads through Databrick SQL and Databrick Serverless SQL, enabling data practitioners to perform SQL analytics, BI tasks, and deliver real-time business insights in a unified environment.

  • What are some key benefits of using the Databricks Lakehouse platform for data warehousing?

    -Key benefits include the best price for performance, greater scale and elasticity, instant elastic SQL serverless compute, reduced infrastructure costs, and built-in governance supported by Delta Lake.

  • How does the Databricks Lakehouse platform address the challenge of managing data in a unified way?

    -The platform allows organizations to unify all their analytics and simplify their architecture by using Databrick SQL, which helps in managing data with fine-grained governance, data lineage, and standard SQL.

  • What is the role of Delta Lake in the Databricks Lakehouse platform?

    -Delta Lake plays a crucial role in maintaining a single copy of all data in existing data lakes, seamlessly integrated with Unity Catalog, enabling discovery, security, and management of data with fine-grained governance.

  • How does the Databricks Lakehouse platform support data engineering tasks?

    -The platform provides a complete end-to-end data warehousing solution, enabling data teams to ingest, transform, process, schedule, and deliver data with ease. It automates the complexity of building and managing pipelines and running ETL workloads directly on the data lake.

  • What are the challenges faced by data engineering teams in traditional data processing?

    -Challenges include complex data ingestion methods, the need for Agile development methods, complex orchestration tools, performance tuning of pipelines, and inconsistencies between various data warehouse and data lake providers.

  • How does the Databricks Lakehouse platform simplify data engineering operations?

    -The platform offers a unified data platform with managed data ingestion, schema detection, enforcement, and evolution, paired with declarative auto-scaling data flow and integrated with a native orchestrator that supports all kinds of workflows.

  • What is the significance of Delta Live Tables (DLT) in the Databricks Lakehouse platform?

    -Delta Live Tables (DLT) is an ETL framework that uses a simple declarative approach to building reliable data pipelines. It automates infrastructure scaling, supports both Python and SQL, and is tailored to work with both streaming and batch workloads.

  • How does the Databricks Lakehouse platform support streaming data workloads?

    -The platform empowers real-time analysis, real-time machine learning, and real-time applications by providing the ability to build streaming pipelines and applications faster, simplified operations from automated tooling, and unified governance for real-time and historical data.

  • What are the main challenges businesses face in harnessing machine learning and AI?

    -Challenges include siloed and disparate data systems, complex experimentation environments, difficulties in getting models served to a production setting, and the multitude of tools available that can complicate the ML lifecycle.

  • How does the Databricks Lakehouse platform facilitate machine learning and AI projects?

    -The platform provides a space for data scientists, ML engineers, and developers to use data, derive insights, build predictive models, and serve them to production. It simplifies tasks with MLflow, AutoML, and built-in tools for model versioning, monitoring, and serving.

Outlines

00:00

๐Ÿ’ผ Data Warehousing with Databricks Lakehouse Platform

This paragraph introduces the Databricks Lakehouse platform's support for data warehousing workloads. It highlights the challenges traditional data warehouses face in keeping up with modern business needs and the complexities introduced by separate BI and AI data structures. The Lakehouse platform, with Databrick SQL and serverless SQL, offers a unified solution for SQL analytics and BI tasks such as data ingestion, transformation, querying, and dashboard building. Key benefits include cost-effectiveness, scalability, and reduced resource management overhead. The platform also supports built-in governance with Delta Lake, enabling a single data copy with fine-grained control and a rich ecosystem for BI tools integration.

05:02

๐Ÿš€ Simplify Data Engineering with Databricks Lakehouse

This section discusses the modernization of data engineering through the Databricks Lakehouse platform. It addresses the challenges of complex data ingestion, ETL workloads, and pipeline management. The platform provides a unified data platform with managed data ingestion, schema management, and declarative auto-scaling. Databricks offers an end-to-end engineering solution that automates pipeline complexity, supports software engineering principles, and enables high data quality. Features include easy data ingestion, automated ETL pipelines, data quality checks, and simplified operations for deploying and managing data pipelines. The platform also supports Delta Live Tables (DLT) for building reliable data pipelines with a simple declarative approach, reducing the need for advanced data engineering skills.

10:02

๐ŸŒ Real-time Streaming Data Processing with Databricks

This paragraph focuses on the explosion of real-time streaming data and its impact on traditional data processing platforms. It outlines the three primary categories of streaming use cases supported by the Databricks Lakehouse platform: real-time analysis, real-time machine learning, and real-time applications. The platform empowers businesses to make quick decisions, detect fraud, provide personalized offerings, and predict machine failures. The top reasons for using Databricks for streaming data include faster pipeline and application development, simplified operations with automated tooling, and unified governance for real-time and historical data. The platform supports real-time analytics, machine learning, and applications, enabling businesses to harness the full potential of streaming data.

15:04

๐Ÿค– Machine Learning and AI with Databricks Lakehouse

This section delves into the challenges businesses face in implementing machine learning and AI, such as siloed data systems, complex experimentation environments, and model deployment issues. The Databricks Lakehouse platform provides a comprehensive solution for data scientists, ML engineers, and developers to perform exploratory data analysis, model training, and production deployment. It offers tools like MLflow for model tracking and versioning, a feature store for feature management, and AutoML for low-code experimentation. The platform simplifies the ML lifecycle by tracking lineage and governance, ensuring regulatory compliance and security. Databricks makes it easy to experiment, create, serve, and monitor models within the same platform.

20:05

๐Ÿ” Model Versioning and Monitoring with Databricks

This final paragraph emphasizes the importance of model versioning, monitoring, and serving within the Databricks Lakehouse platform. It highlights how the platform provides a world-class experience for these tasks, tracking lineage and governance throughout the entire ML lifecycle. This approach reduces regulatory compliance and security concerns, saving costs in the long run. Tools like MLflow and AutoML, built on top of Delta Lake, simplify the process of experimenting with data, creating models, and serving them to production, all within a unified platform.

Mindmap

Keywords

๐Ÿ’กDatabricks Lakehouse Platform

The Databricks Lakehouse Platform is a unified data analytics platform that combines the best of data lakes and data warehouses. It is designed to handle a wide range of data workloads, including data warehousing, machine learning, and big data processing. In the video, it is highlighted as a solution that supports data warehousing workloads with Databrick SQL and offers benefits such as cost-effectiveness, scalability, and ease of use.

๐Ÿ’กData Warehousing Workload

Data Warehousing Workload refers to the tasks involved in managing and analyzing large volumes of data, typically for business intelligence (BI) purposes. These tasks include data ingestion, transformation, querying, and the creation of dashboards. The script mentions that the Databricks Lakehouse Platform supports these tasks with Databrick SQL, emphasizing its role in providing real-time business insights.

๐Ÿ’กDatabrick SQL

Databrick SQL is a part of the Databricks Lakehouse Platform that allows users to execute SQL queries on large datasets. It is serverless and can be used by data practitioners to perform data analysis and deliver insights. The script describes it as a tool that simplifies the process of working with data and BI tools within the Lakehouse environment.

๐Ÿ’กData Lakes

Data Lakes are storage repositories that hold a vast amount of raw data in its native format until it is needed. They are used for big data processing and analytics. The video script discusses the challenges of using data lakes for AI and ML and how the Databricks Lakehouse Platform addresses these by integrating data lakes with data warehousing capabilities.

๐Ÿ’กDelta Lake

Delta Lake is an open-source storage layer that brings reliability and structure to data lakes. It supports ACID transactions and helps in managing the data lifecycle. In the context of the video, Delta Lake is mentioned as a key component of the Databricks Lakehouse Platform that provides built-in governance and supports data lineage and standard SQL.

๐Ÿ’กElastic SQL Serverless Compute

Elastic SQL Serverless Compute is a feature of the Databricks Lakehouse Platform that allows for the dynamic scaling of computing resources based on demand. This helps in managing infrastructure costs and reduces the overhead for data and platform administration teams. The script highlights the cost savings and efficiency improvements it brings to data warehousing workloads.

๐Ÿ’กData Governance

Data Governance involves the processes and policies that ensure the availability, usability, integrity, and security of the data used in an organization. The video script mentions that the Databricks Lakehouse Platform supports data governance through fine-grained controls, enabling organizations to manage their data effectively.

๐Ÿ’กData Engineering

Data Engineering is the process of designing, building, and maintaining the systems and processes that enable the collection, storage, and transformation of data. The script discusses the challenges faced by data engineering teams and how the Databricks Lakehouse Platform provides an end-to-end solution for data ingestion, transformation, and orchestration.

๐Ÿ’กData Quality

Data Quality refers to the overall integrity and reliability of data. It is a critical aspect of data engineering and analytics. The video script emphasizes the importance of data quality for data engineering within the Lakehouse, highlighting features like Delta Live Tables and the use of the Medallion architecture for improving data quality.

๐Ÿ’กData Streaming

Data Streaming involves the continuous flow of data in real-time, which is increasingly important for businesses to make timely decisions. The script discusses the support for data streaming workloads in the Databricks Lakehouse Platform, enabling real-time analysis, machine learning, and applications.

๐Ÿ’กMachine Learning and AI

Machine Learning (ML) and Artificial Intelligence (AI) involve the development of algorithms and models that can learn from and make decisions based on data. The video script explains how the Databricks Lakehouse Platform supports ML and AI workloads by providing tools for data scientists and ML engineers to experiment, build models, and serve them to production.

Highlights

Databricks Lakehouse platform supports data warehousing workload with Databrick SQL and offers benefits such as real-time business insights and cost-effective performance.

Traditional data warehouses struggle to keep up with current business needs and complex architectures create challenges in providing timely and cost-effective data value.

Data lake houses offer a solution for data warehousing workloads with features and tools, particularly with Databrick SQL, to simplify SQL analytics and BI tasks.

Databricks Lakehouse platform unifies analytics and simplifies architecture, providing instant elastic SQL serverless compute to lower infrastructure costs.

Built-in governance supported by Delta Lake allows for single data copy management with fine-grained governance and data lineage.

The platform features a rich ecosystem with tools for BI on data lakes, enabling data analysts to use preferred tools without needing specific knowledge or skills.

Data engineering teams face challenges with complex data ingestion methods and the need for Agile development methods and CI/CD pipelines.

Databricks Lakehouse platform simplifies modern data engineering with a unified data platform, managed data ingestion, and integrated orchestration.

Data quality is emphasized in data engineering, with the platform providing an end-to-end solution for data ingestion, transformation, and orchestration.

Delta Live Tables (DLT) is an ETL framework that simplifies building reliable data pipelines with a declarative approach and automatic infrastructure scaling.

Databricks Workflows is a managed orchestration service that simplifies building reliable data analytics and ML workflows on any cloud.

The platform supports real-time streaming data, enabling businesses to make quick decisions and keep pace with their industries.

Databricks Lakehouse platform empowers streaming use cases for real-time analysis, machine learning, and real-time applications.

The platform provides a space for data scientists and ML engineers to experiment, create models, and serve them to production within a unified environment.

MLflow, an open-source platform created by Databricks, simplifies tracking model training sessions and packaging models for reuse.

AutoML in the platform allows data scientists to experiment with low to no code, automatically training models and tuning hyperparameters.

Databricks Lakehouse platform offers model versioning, monitoring, and serving with lineage and governance tracking throughout the ML lifecycle.

Transcripts

play00:00

supported workloads on the databricks

play00:02

lake house platform data warehousing

play00:05

in this video you'll learn how The

play00:07

databricks Lakehouse platform supports

play00:09

the data warehousing workload with

play00:11

databrick SQL and the benefits of data

play00:14

warehousing with the databricks lake

play00:15

house platform

play00:17

traditional data warehouses are no

play00:19

longer able to keep up with the needs

play00:21

businesses in today's world and although

play00:23

organizations have attempted using

play00:25

complicated and complex architectures

play00:27

with data warehouses for bi and data

play00:29

Lakes for AI and ml too many challenges

play00:32

have come to light with those structures

play00:34

to provide value from the data in a

play00:36

timely or cost effective manner

play00:39

with the Advent of the data lake house

play00:41

data warehousing workloads finally have

play00:44

a home and the databricks lake house

play00:45

platform provides several features and

play00:47

tools to support this workload

play00:49

especially with databrick SQL

play00:52

when we refer to the data warehousing

play00:53

workload we are referencing SQL

play00:55

analytics and bi tasks such as ingesting

play00:58

transforming and querying data building

play01:01

dashboards and delivering business

play01:02

insights The databricks Lakehouse

play01:04

platform supports these tasks with

play01:06

databrick SQL and databrick serverless

play01:08

SQL

play01:09

data practitioners can complete their

play01:12

data analysis tests all in one location

play01:14

using the SQL and bi tools of their

play01:16

choice and deliver real-time business

play01:19

insights at the best price for

play01:21

performance

play01:22

organizations can unify all their

play01:24

analytics and simplify their

play01:26

architecture by using databricks SQL

play01:30

some of the key benefits include

play01:32

the best price for performance cloud

play01:35

data warehouses provide greater scale

play01:37

and elasticity needed to handle the

play01:40

rapid influx of new data and the

play01:42

databricks lake house platform offers

play01:44

instant elastic SQL serverless compute

play01:46

that can lower overall infrastructure

play01:48

costs on average between 20 to 40

play01:51

percent this also reduces or removes the

play01:54

resource management overhead from the

play01:56

workload of the data and platform

play01:58

Administration teams

play02:00

built in governance

play02:02

supported by Delta Lake the databricks

play02:04

lake house platform allows you to keep a

play02:06

single copy of all your data in your

play02:08

existing data Lakes seamlessly

play02:10

integrated with unity catalog you can

play02:12

discover secure and manage all of your

play02:15

data with fine-grained governance data

play02:17

lineage and standard SQL

play02:20

a rich ecosystem

play02:22

tools for conducting bi on data Lakes

play02:24

are few and far between often requiring

play02:27

data analysts to use developer

play02:28

interfaces or tools designed for data

play02:30

scientists that require specific

play02:32

Knowledge and Skills

play02:34

The databricks Lakehouse platform allows

play02:36

you to work with your preferred tools

play02:38

such as DBT 5tran power bi or Tableau

play02:42

teams can quickly collaborate across the

play02:45

organization without having to move or

play02:47

transfer data

play02:50

thus leading to the breakdown of silos

play02:53

data engineering teams are challenged

play02:56

with needing to enable data analysts at

play02:58

the speed a business requires data needs

play03:01

to be ingested and processed ahead of

play03:03

time before it can be used for bi The

play03:06

databricks Lakehouse platform provides a

play03:08

complete end-to-end data warehousing

play03:10

solution empowering data teams and

play03:12

business users by providing them with

play03:13

the tools to quickly and effortlessly

play03:15

work with data all in one single

play03:18

platform

play03:19

data engineering

play03:22

in this video you'll learn why data

play03:24

quality is so important for data

play03:26

engineering how the databricks

play03:28

lighthouse platform supports the data

play03:29

engineering workload

play03:31

what Delta live tables are and how they

play03:34

support data transformation and how

play03:37

databricks workflows support data

play03:38

orchestration in the lake house

play03:43

data is a valuable asset to businesses

play03:46

and it can be collected and brought into

play03:48

the platform or ingested from hundreds

play03:50

of different sources cleaned in various

play03:52

different ways then shared and utilized

play03:54

by multiple different teams for their

play03:56

projects

play03:57

the data engineering workload focuses

play03:59

around ingesting that data transforming

play04:02

it and orchestrating it out to the

play04:04

different data teams that utilize it for

play04:06

day-to-day insights Innovation and tasks

play04:09

however while the data teams rely on

play04:12

getting the right data at the right time

play04:14

for their analytics data science and

play04:16

machine learning tasks data Engineers

play04:18

often face several challenges trying to

play04:20

meet these needs as data reaches New

play04:23

Heights in volume velocity and variety

play04:26

several of the challenges to the data

play04:28

engineering workload are complex data

play04:32

ingestion methods where data Engineers

play04:33

need to use an always running streaming

play04:36

platform or keep track of which files

play04:38

haven't been ingested yet or having to

play04:41

spend time hand coding error-prone

play04:43

repetitive data ingestion tasks

play04:46

data engineering principles need to be

play04:48

supported such as Agile development

play04:50

methods isolated development and

play04:52

production environments CI CD and

play04:55

Version Control transformations

play04:57

third-party tools for orchestration

play04:59

increases the operational overhead and

play05:02

decreases the reliability of the system

play05:04

Performance Tuning of pipelines and

play05:07

architectures requires knowledge of the

play05:08

underlying architecture and constantly

play05:11

observing throughput parameters and with

play05:13

platform inconsistencies between the

play05:15

various data warehouse and data Lake

play05:17

providers businesses struggle trying to

play05:19

get multiple products to work in their

play05:21

environments due to different

play05:23

limitations workloads development

play05:25

languages and governance models

play05:29

The databricks Lakehouse platform makes

play05:31

modern data engineering simple as there

play05:34

is no industry-wide definition of what

play05:36

this means databricks offers the

play05:39

following

play05:39

a unified data platform with managed

play05:42

data ingestion schema detection

play05:44

enforcement and evolution paired with

play05:47

declarative Auto scaling data flow

play05:49

integrated with a lighthouse native

play05:51

orchestrator that supports all kinds of

play05:53

workflows

play05:56

the databricks lighthouse platforms

play05:57

gives data Engineers an end-to-end

play05:59

engineering solution for ingesting

play06:01

transforming processing scheduling and

play06:03

delivering data

play06:04

the complexity of building and managing

play06:07

pipelines and running ETL workloads is

play06:09

automated directly on the data lake so

play06:11

data Engineers can focus on quality and

play06:13

reliability

play06:15

the key capabilities of data engineering

play06:17

on the lake house include easy data

play06:19

ingestion where petabytes of data can be

play06:21

automatically ingested quickly and

play06:23

reliably for analytics data science and

play06:26

machine learning automated ETL pipelines

play06:28

help reduce development time and effort

play06:31

so data Engineers can focus on

play06:33

implementing business logic and data

play06:35

quality checks in data Pipelines

play06:37

data quality checks can be defined and

play06:39

errors automatically addressed so data

play06:41

teams can confidently trust the

play06:43

information they're using batch and

play06:45

streaming data latency can be tuned with

play06:48

cost controls without data Engineers

play06:50

having to know complex stream processing

play06:52

details

play06:53

automatic recovery from common errors

play06:55

during a pipeline operation

play06:58

data pipeline observability allows data

play07:00

Engineers to monitor overall data

play07:02

pipeline status and visibly track

play07:04

pipeline health

play07:06

simplified operations for deploying data

play07:09

pipelines to production or for rolling

play07:11

back pipelines and minimizing downtime

play07:13

and lastly scheduling an orchestration

play07:16

is simple clear and reliable for data

play07:19

processing tasks with the ability to run

play07:21

non-interactive tasks as a directed

play07:24

acylic graph on a databricks compute

play07:26

cluster

play07:29

High data quality is the goal of modern

play07:31

data engineering within the lake house

play07:33

so a critical workload for data teams is

play07:36

to build ETL pipelines to ingest

play07:38

transform and orchestrate data for

play07:41

machine learning and Analytics

play07:43

databricks data engineering enables data

play07:46

teams to unify batch and streaming

play07:48

operations on a simplified architecture

play07:50

provide modern SW engineered data

play07:53

pipeline development and testing build

play07:55

reliable data analytics and AI workflows

play07:57

on any Cloud platform and meet

play07:59

regulatory requirements to maintain

play08:01

world-class governance the lake house

play08:03

provides an end-to-end data engineering

play08:06

and ETL platform that automates the

play08:08

complexity of building and maintaining

play08:10

pipelines and running ETL workloads so

play08:13

data engineers and analysts can focus on

play08:15

quality and reliability to drive

play08:17

valuable insights

play08:19

as data loads into the Delta lake lake

play08:21

house databricks automatically infers

play08:23

the schema and involves it as the data

play08:26

comes in The databricks Lakehouse

play08:28

platform also provides autoloader and an

play08:31

optimized data ingestion tool that

play08:33

processes new data files as they arrive

play08:35

in the lake house cloud storage

play08:37

it auto detects the schema and enforces

play08:39

it on your data guaranteeing data

play08:41

quality data ingestion for data analysts

play08:43

and analytics Engineers is easy with the

play08:46

copy into SQL command that follows the

play08:48

lake first approach and loads data from

play08:51

a folder into a Delta lake table

play08:53

when run only new files from The Source

play08:56

will be processed

play08:58

data transformation through the use of

play09:00

The Medallion architecture shown earlier

play09:02

is an established and reliable pattern

play09:05

for improving data quality however

play09:07

implementation is challenging for many

play09:09

data engineering teams

play09:11

attempts to hand code the architecture

play09:14

are hard for data engineers and data

play09:16

pipeline creation is simply impossible

play09:17

for data analysts not able to code with

play09:20

spark structure streaming in Scala or

play09:22

python so even in small scale

play09:25

implementations data engineering time is

play09:27

spent on tooling and managing

play09:29

infrastructure instead of

play09:30

transformations

play09:32

Delta live tables DLT is the first ETL

play09:36

framework that uses a simple declarative

play09:38

approach to building reliable data

play09:40

pipelines DLT automatically Auto scales

play09:44

the infrastructure so data analysts and

play09:46

Engineers spend less time on tooling and

play09:48

can focus on getting value from their

play09:50

data Engineers treat their data as code

play09:52

and apply software engineering best

play09:54

practices to deploy reliable pipelines

play09:57

at scale

play09:58

DLT fully supports both Python and SQL

play10:01

and is tailored to work with bull

play10:02

streaming and batch workloads

play10:04

by speeding up deployment and automating

play10:06

complex tasks DLT reduces implementation

play10:09

time software engineering principles are

play10:12

applied for data engineering to Foster

play10:15

the idea of treating your data as code

play10:17

and Beyond Transformations there are

play10:19

many things to include in the code that

play10:21

defines your data such as declaratively

play10:23

Express entire data flows in SQL or

play10:26

python and natively enable modern

play10:28

software engineering best practices such

play10:31

as separate production and development

play10:32

environments testing before deploying

play10:34

using parameterization to deploy and

play10:36

manage environments unit testing and

play10:38

documentation unlike other products DLT

play10:41

supports both batch and streaming

play10:43

workloads in a single API reducing the

play10:45

need for Advanced Data engineering

play10:47

skills orchestrating and managing

play10:49

end-to-end production workflows can be a

play10:52

challenge if a business relies on

play10:53

external or cloud-specific tools that

play10:56

are separate from the lake house

play10:57

platform the structure also reduced the

play10:59

overall reliability of production

play11:01

workloads limits of observability and

play11:03

increases the complexity in the

play11:05

environment for end users

play11:07

databricks workflows is the first fully

play11:10

managed orchestration service embedded

play11:12

in The databricks Lakehouse platform

play11:14

workflows allows data teams to build

play11:17

reliable data analytics and ML workflows

play11:20

on any Cloud without needing to manage a

play11:22

complex infrastructure

play11:24

databricks workflows allow you to

play11:26

orchestrate data flow pipelines written

play11:28

in DLT or DBT machine learning pipelines

play11:32

and other tasks such as notebooks or

play11:34

python Wheels as a fully managed feature

play11:36

databricks workflows eliminates

play11:38

operational overhead for data Engineers

play11:41

with an easy point-and-click authoring

play11:43

experience all data teams can utilize

play11:45

databricks workflows

play11:47

while you can create workflows with the

play11:49

UI you can use the databricks workflows

play11:51

API or external orchestrators such as

play11:53

Apache airflow even with an external

play11:56

orchestrator databricks workflows

play11:58

monitoring acts like a window that

play12:00

includes externally triggered workflows

play12:02

Delta live tables is one of the many

play12:04

task types for databricks workflows and

play12:07

is where the managed data flow pipelines

play12:09

with DLT join with the easy point-click

play12:11

authoring experience of databricks

play12:13

workflows this example illustrates an

play12:16

end-to-end workflow where data is

play12:17

streamed from Twitter according to

play12:19

search terms ingested with autoloader

play12:21

using automatic schema detection and

play12:24

then cleaned and transformed with Delta

play12:26

live tables pipelines written in SQL

play12:29

finally the data is run through a

play12:31

pre-trained Bert language model from

play12:33

hugging face for sentiment analysis of

play12:36

the tweets as you can see different

play12:38

tasks for ingestion cleansing and

play12:40

transforming the data and machine

play12:41

learning are all combined in a single

play12:43

workflow using workflows tasks can be

play12:46

scheduled to provide daily overviews of

play12:48

social media coverage and customer

play12:50

sentiment

play12:51

so needless to say you can orchestrate

play12:53

anything with databricks workflows

play12:56

data streaming

play12:59

in this video you'll learn what

play13:01

streaming data is and how the data

play13:03

streaming workload in the databricks

play13:05

lake house platform is supported

play13:08

in the last few years we have seen an

play13:10

explosion of real-time streaming data

play13:12

and it is overwhelming traditional data

play13:14

processing platforms that were never

play13:16

designed with streaming data in mind

play13:19

constantly generated by every individual

play13:22

every machine and every organization on

play13:24

the planet businesses require this data

play13:26

to make necessary decisions and keep

play13:28

Pace with their respective industries

play13:30

from transactions to operational systems

play13:33

to customer and employee interactions to

play13:36

third-party data services in the cloud

play13:38

and Internet of Things data from sensors

play13:41

and devices real-time data is everywhere

play13:44

all this real-time data creates new

play13:46

opportunities to build Innovative

play13:48

real-time applications to detect fraud

play13:51

provide personalized offerings to

play13:53

customers dynamically adjust pricing in

play13:56

real time and predict when a machine or

play13:59

part is going to fail and much more

play14:02

the databricks lake house platform

play14:03

empowers three primary categories of

play14:05

streaming use cases

play14:07

real-time analysis by supplying your

play14:10

data warehouses and bi tools and

play14:12

dashboards with real-time data for

play14:14

instant insights and faster decision

play14:15

making

play14:16

real-time machine learning first with

play14:19

training of machine learning models on

play14:21

real-time data as it's coming in and

play14:23

second with the application of those

play14:25

models to score new events leading to

play14:27

machine learning inference in real time

play14:30

and real-time applications

play14:32

applications can mean a lot of things so

play14:34

this might be an embedded application

play14:36

for real-time and analytics or machine

play14:38

learning but it also could be as simple

play14:41

as that if then business rules based on

play14:44

streaming data triggering actions in

play14:46

real time

play14:48

further different Industries with have

play14:50

different use cases for streaming data

play14:53

making it highly important for the

play14:55

future of data processing and Analytics

play14:57

for example in a retail environment

play14:59

real-time inventory helps support

play15:01

business activities pricing and supply

play15:03

chain demands

play15:05

in Industrial Automation streaming and

play15:07

predictive analysis help manufacturers

play15:09

improve production processes and product

play15:12

quality sending alerts and shutting down

play15:14

production automatically if there is an

play15:16

active dip in quality

play15:18

for healthcare streaming patient monitor

play15:20

data can help encourage appropriate

play15:22

medication and Care is provided when is

play15:24

needed without delay

play15:26

for financial institutions real-time

play15:28

analysis of transactions can detect

play15:30

fraud activity and send alerts and by

play15:33

using machine learning algorithms firms

play15:35

can gain Insight from fraud analytics to

play15:38

identify patterns and there are still

play15:40

many more use cases for the value of

play15:42

streaming data to businesses

play15:47

so the top three reasons for using the

play15:49

databricks lake house platform for

play15:51

streaming data are the ability to build

play15:53

streaming pipelines and applications

play15:55

faster simplified operations from

play15:58

automated tooling and unified governance

play16:00

for real-time and historical data

play16:03

one of the key takeaways is that the

play16:05

databricks lake house platform unlocks

play16:07

many different real-time use cases

play16:09

Beyond those already mentioned giving

play16:11

you the ability to solve really high

play16:13

value problems for your business

play16:16

the databricks lighthouse platform has

play16:17

the capability to support the data

play16:19

streaming workload to provide real-time

play16:21

analytics machine learning and

play16:23

applications all in one platform

play16:26

data streaming helps business teams to

play16:28

make quicker better decisions

play16:30

development teams to deliver real-time

play16:32

and differentiated experiences and

play16:34

operations teams to detect and react to

play16:37

operational issues in real time data

play16:39

streaming is one of the fastest growing

play16:41

workloads for the lake house

play16:42

architecture and is the future of all

play16:44

data processing data science and machine

play16:47

learning

play16:48

in this video you'll learn about the

play16:50

challenges businesses face in attempting

play16:52

to harness machine learning and AI

play16:54

Endeavors and how the databricks lake

play16:56

house platform supports the data science

play16:58

and machine learning workload for

play17:00

successful machine learning and AI

play17:01

projects

play17:03

businesses know machine learning and AI

play17:06

have a myriad of benefits but realizing

play17:08

these benefits proves challenging for

play17:10

businesses brave enough to attempt

play17:11

machine learning and AI

play17:13

several of the challenges businesses

play17:15

face include siled and disparate Data

play17:18

Systems complex experimentation

play17:20

environments and getting models served

play17:23

to a production setting

play17:24

additionally businesses have multiple

play17:26

concerns when it comes to using machine

play17:28

learning such as there are so many tools

play17:31

available covering each phase of the ml

play17:33

lifecycle but unlike traditional

play17:36

software development machine learning

play17:37

development benefits from trying

play17:39

multiple tools available to see if

play17:41

results improve

play17:42

experiments are hard to track as there

play17:45

are so many parameters tracking the

play17:47

parameters code and data that went into

play17:49

producing a model can be cumbersome

play17:52

reproducing results is difficult

play17:54

especially without detailed tracking and

play17:57

when you want to release your trained

play17:58

code for use in production or even debug

play18:01

a problem reproducing past steps of the

play18:03

ml workflow is key

play18:05

and it's hard to deploy ml especially

play18:08

when there are so many available tools

play18:10

for moving a model to production and as

play18:13

there is no standard way to move models

play18:15

there is always a new risk with each new

play18:18

deployment

play18:19

The databricks Lakehouse platform

play18:21

provides a space for data scientists ml

play18:24

engineers and developers to use data and

play18:27

derive Innovative insights build

play18:29

powerful predictive models all within

play18:31

the space of machine learning and AI

play18:33

with data all in one location data

play18:35

scientists can perform exploratory data

play18:37

analysis easily in the notebook style

play18:40

experience with support from multiple

play18:42

languages and built-in visualizations

play18:44

and dashboards

play18:45

code can be shared securely and

play18:47

confidently for co-authoring and

play18:49

commenting with automatic versioning git

play18:52

Integrations and role-based access

play18:53

controls

play18:55

from data ingestion to model training

play18:58

and tuning all the way through to

play18:59

production model serving and versioning

play19:02

the databricks like house platform

play19:04

brings the tools you need to simplify

play19:06

those tasks

play19:07

the databricks machine learning runtimes

play19:09

help you get started with experimenting

play19:12

and are optimized and pre-configured

play19:14

with the most popular libraries

play19:16

with GPU support for distributed

play19:18

training and Hardware acceleration you

play19:20

can scale as needed

play19:22

ml flow is an open source machine

play19:24

learning platform created by databricks

play19:27

and is managed service within the

play19:29

databricks Lakehouse platform

play19:31

with ML flow you can track model

play19:33

training sessions from within the

play19:35

runtimes and package and reuse models

play19:37

with ease a feature store is available

play19:40

allowing you to create new features and

play19:42

reuse existing ones for training and

play19:44

scoring machine learning models

play19:46

automl allows both beginner and

play19:49

experienced data scientists to get

play19:51

started with low to no code

play19:52

experimentation automl points to your

play19:55

data set automatically trains models and

play19:57

tunes hyper parameters to save you time

play20:00

in the machine learning process

play20:01

additionally automl reports back metrics

play20:05

related to the results as well as the

play20:07

code necessary to repeat the training

play20:09

customize to your data set this glass

play20:11

box feature means you don't need to feel

play20:13

trapped by vendor lock-in

play20:16

the databricks lake house platform

play20:18

provides a world-class experience for

play20:20

model versioning monitoring and serving

play20:23

within the same platform used to

play20:25

generate the models themselves lineage

play20:27

and governance is tracked throughout the

play20:29

entire ml lifecycle so Regulatory

play20:32

Compliance and security concerns can be

play20:34

reduced saving costs down the road

play20:37

with tools like mlflow and automl and

play20:40

built on top of Delta Lake the

play20:42

databricks lake house platform makes it

play20:44

easy for data scientists to experiment

play20:46

create models and serve them to

play20:49

production and monitor them all in one

play20:51

place

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Data WarehousingData EngineeringData StreamingDatabricks LakehouseSQL AnalyticsBI ToolsMachine LearningAI InsightsCloud ComputingETL AutomationReal-time AnalysisData Governance