Automating Databricks Environment | How to use Databricks Rest API | Databricks Spark Automation

Learning Journal
5 Oct 202328:03

Summary

TLDRIn this session, the focus is on automation tools provided by Databricks. The discussion covers the transition from manual tasks to automated processes, particularly for deploying projects into production environments. The presenter introduces three approaches for automation: Databricks REST API, Databricks SDK, and Databricks CLI. A detailed walkthrough of using the REST API to create and manage jobs is provided, including a live demonstration of automating job creation and execution within a Databricks workspace. The session aims to equip viewers with the knowledge to automate various tasks using these tools, with a comprehensive example set to be explored in the Capstone project.

Takeaways

  • 🔧 The session focuses on automation tools provided by Databricks, which are crucial for automating tasks in a Databricks workspace.
  • 🛠️ Databricks offers three main approaches for automation: REST API, Databricks SDK, and Databricks CLI, each suitable for different programming languages and use cases.
  • 📚 The REST API is the most frequently used method, allowing users to perform almost any action programmatically that can be done through the Databricks UI.
  • 🔗 The REST API documentation is platform-agnostic and provides a comprehensive list of endpoints for various Databricks services.
  • 💻 The Databricks SDK provides language-specific libraries, such as Python, Scala, and Java, for automation tasks.
  • 📝 The Databricks CLI is a command-line tool that enables users to perform UI actions through command-line commands, suitable for shell scripting.
  • 🔄 The session includes a live demo of using the REST API to create and manage jobs in Databricks, showcasing the process from job creation to monitoring job status.
  • 🔑 Authentication is a critical aspect of using Databricks REST API, requiring an access token that can be generated from the user's settings in the Databricks UI.
  • 🔍 The process of creating a job via REST API involves defining a JSON payload that includes job details such as name, tasks, and cluster configurations.
  • 🔎 The script provided in the session demonstrates how to automate job creation, triggering, and monitoring, which is part of a larger automation strategy in Databricks environments.

Q & A

  • What are the automation tools offered by Databricks?

    -Databricks offers three approaches for automation: Databricks REST API, Databricks SDK, and Databricks CLI.

  • How does the Databricks REST API work?

    -The Databricks REST API allows you to perform actions programmatically using HTTP requests. It's a universal tool that can be used from any language that supports calling REST-based APIs.

  • What is the purpose of the Databricks SDK?

    -The Databricks SDK provides language-specific libraries for Python, Scala, and Java, which can be used to interact with Databricks services in a more straightforward way than using raw REST API calls.

  • What can you do with the Databricks CLI?

    -The Databricks CLI is a command-line tool that allows you to perform operations that you can do through the UI, making it useful for scripting and automation tasks.

  • How can you automate the creation of a job in Databricks?

    -You can automate the creation of a job in Databricks by using the 'jobs create' REST API endpoint, which requires a JSON payload that defines the job configuration.

  • What is the role of the 'jobs run' API in Databricks automation?

    -The 'jobs run' API is used to trigger the execution of a job in Databricks. It takes a job ID and other optional parameters to start the job.

  • How can you monitor the status of a job in Databricks using the REST API?

    -You can monitor the status of a job using the 'jobs get' API, which provides details about the job, including its current lifecycle state.

  • What is the significance of the job ID and run ID in Databricks automation?

    -The job ID uniquely identifies a job in Databricks, while the run ID identifies a specific execution of that job. These IDs are crucial for tracking and managing jobs and their runs programmatically.

  • How can you automate the deployment of a Databricks project to a production environment?

    -You can automate the deployment of a Databricks project by using CI/CD pipelines that trigger on code commits, automatically build and test the code, and then deploy it to the Databricks workspace environment.

  • What is the process of generating a JSON payload for job creation in Databricks?

    -The JSON payload for job creation can be generated by manually defining the job through the UI, viewing the JSON, and copying it for use in automation scripts, or by constructing it programmatically based on the job's requirements.

  • How does the speaker demonstrate the use of Databricks REST API in the provided transcript?

    -The speaker demonstrates the use of Databricks REST API by showing how to create a job, trigger it, and monitor its status using Python code that makes HTTP requests to the Databricks REST API endpoints.

Outlines

00:00

🤖 Introduction to Databricks Automation Tools

The speaker begins by introducing the session's focus on automation tools provided by Databricks. They discuss the various manual tasks that can be automated in Databricks, such as creating notebooks, clusters, and defining jobs with workflows. The speaker emphasizes the importance of automation in production environments and mentions the use of CI/CD pipelines to automate build processes, integration testing, and deployment. They introduce three main approaches for automation in Databricks: REST API, Databricks SDK, and Databricks CLI. The REST API is highlighted as the most frequently used method, with the documentation being a universal resource across all platforms. The speaker promises to provide a demo of these tools and their capabilities.

05:00

🔗 Exploring Databricks REST API

The speaker dives into the details of the Databricks REST API, explaining how it allows for automation of tasks within the Databricks workspace. They mention the various areas covered by the API, such as workspace, compute, workflows, and more. The speaker provides an example of how to use the Jobs API to list and create jobs, emphasizing the use of JSON for request and response payloads. They explain the structure of API calls, including the use of GET and POST methods, and provide a high-level overview of how to interact with the API. The speaker also demonstrates how to find specific API documentation and gives a live example of creating a cluster in Databricks.

10:02

📚 Automating Job Creation with REST API

The speaker illustrates how to automate the creation of a Databricks job using the REST API. They discuss the need for an integration test for a streaming application and how to create an automation test case for it. The speaker outlines the process of defining job parameters, such as the job name, schedule, tasks, and cluster configuration, within a JSON payload. They demonstrate how to use Python's 'requests' library to make a POST request to the Databricks REST API to create a job. The speaker also shows how to extract the job ID from the API response, which is crucial for further automation tasks.

15:04

📝 Demonstrating Job Creation and Execution

The speaker provides a practical demonstration of creating and executing a Databricks job using the REST API. They walk through the process of defining a job in the UI and then showing how to replicate that definition using the API. The speaker shows how to manually create a job in the Databricks UI, detailing the steps involved in setting up a job with a notebook task, job cluster, and parameters. They then explain how to view the JSON definition of the job, which can be used for automation. The speaker also demonstrates how to trigger a job run using the REST API and how to monitor the job's status until it starts execution.

20:05

🔄 Monitoring Job Status and Running Test Cases

The speaker continues the demonstration by showing how to monitor the status of a job run using the REST API. They explain the use of a while loop to check the job's lifecycle state until it transitions from pending to running. Once the job starts, the speaker outlines the steps for running test cases, which include loading historical data, validating data across different layers of a data architecture, and producing additional data batches. They emphasize the importance of waiting for the job to start before running test cases to ensure data availability. The speaker also mentions the use of additional REST APIs for job cancellation and deletion after testing is complete.

25:06

🔄 Practical Implementation and Future Automation Topics

The speaker concludes the demonstration by running the automation script to create and execute a job, monitor its status, and run test cases. They successfully create a job and trigger it, demonstrating the practical application of the REST API for automation in Databricks. The speaker also mentions the upcoming topics of Databricks SDK and Databricks CLI for automation, indicating that these will be covered in future sessions. They invite questions from the audience while the job is being executed, highlighting the interactive nature of the session.

Mindmap

Keywords

💡Databricks

Databricks is a unified data analytics platform that provides a collaborative environment for working with data. In the script, Databricks is central to the discussion as the platform where automation tools are being explored. The video aims to educate on how to automate tasks within Databricks, such as job creation and workflow management.

💡Automation

Automation refers to the process of creating technology to perform tasks with minimal human intervention. In the context of the video, automation is discussed as a means to streamline and simplify workflows on Databricks. The script mentions automating tasks like job scheduling, cluster management, and notebook execution, which are crucial for efficient data processing and analytics.

💡REST API

REST API stands for Representational State Transfer Application Programming Interface, which is a set of rules and protocols for building and interacting with APIs. The script emphasizes the use of Databricks REST API for automating tasks. It is highlighted as a tool that allows users to perform various operations programmatically, such as creating jobs and managing clusters.

💡SDK

SDK is an acronym for Software Development Kit, which is a collection of tools, libraries, documentation, code samples, and guidelines used to develop software for a specific platform or service. In the script, Databricks SDK is mentioned as a language-specific tool that can be used for automation, providing a more direct and easier way to interact with Databricks services compared to using REST APIs.

💡CLI

CLI stands for Command-Line Interface, which is a text-based interface used to interact with computer programs. The script introduces Databricks CLI as a tool for automation, allowing users to execute commands for various tasks such as job management directly from the command line, offering a flexible way to script and automate workflows.

💡Notebooks

Notebooks in Databricks refer to web-based interactive documents that allow users to combine live code, equations, visualizations, and narrative text. The script discusses creating and executing notebooks as part of the automation process, emphasizing their role in data processing and analytics workflows that can be automated for efficiency.

💡Workflows

Workflows are the orchestrated sequence of steps or tasks that must be executed in a particular order to accomplish a goal. In the video, workflows are mentioned in relation to job scheduling and automation on Databricks. They are essential for defining the sequence of operations that need to be automated, such as running a series of data processing tasks.

💡Jobs

In the context of Databricks, a job is a scheduled task that runs a notebook or a command in a workspace. The script focuses on automating the creation and management of jobs, which is crucial for automating data processing pipelines and ensuring that tasks are executed at the right time and in the correct order.

💡Clusters

Clusters in Databricks refer to a collection of computing resources like CPUs and memory allocated to execute tasks and jobs. The script discusses the automation of cluster creation and management as a way to optimize resource usage and job execution, which is vital for performance and cost efficiency in data processing.

💡Integration Testing

Integration Testing is the phase in software development where individual units are combined and tested as a group to ensure that the components work together as expected. The script mentions writing and automating integration tests, which is an important step in the deployment process to validate that the automated workflows and jobs function correctly within the Databricks environment.

Highlights

Introduction to automation tools offered by Databricks.

Overview of automating tasks in Databricks workspace such as creating notebooks, clusters, and defining jobs.

The importance of automating deployment to production environments.

Building CI/CD pipelines for automated builds and deployments in Databricks.

Automating integration tests using Databricks' automation tools.

Three approaches for automating work in Databricks: REST API, Databricks SDK, and Databricks CLI.

Explanation of Databricks REST API and its documentation.

How to use Databricks REST API for creating and managing jobs.

Demonstration of using Python to interact with Databricks REST API for job creation.

Details on creating a JSON payload for job creation using REST API.

Using REST API to trigger a job and obtain a run ID.

Monitoring job status using REST API to ensure successful start and execution.

Automating test cases and validation checks in a Databricks notebook.

Process of cleaning up after tests are completed using REST API to cancel and delete jobs.

Integration of REST API usage in a full-fledged Capstone project for end-to-end automation.

Introduction to Databricks SDK as an alternative approach for automation.

Brief on Databricks CLI as a command-line tool for automation.

Invitation for questions while waiting for a job cluster to start.

Transcripts

play00:00

okay so in today's session I want to

play00:03

talk about uh automation tools offered

play00:06

by data braks

play00:08

so what are the things that we want to

play00:10

automate basically uh you are you have

play00:13

been learning datab brakes you know that

play00:16

we can connect to the datab work space

play00:18

it gives us a UI browser based UI and in

play00:22

the datab work space we can create

play00:24

notebooks we can write code there then

play00:27

uh we can execute those notebooks we can

play00:29

create clusters we can attach notebooks

play00:30

to the cluster we can create that uh run

play00:33

that Notebook on the cluster we can

play00:35

Define jobs using workflows and then we

play00:37

can schedule the jobs uh to

play00:39

automatically trigger at certain time or

play00:41

maybe we can manually go and Trigger the

play00:43

jobs all that you already learned how to

play00:45

do using uh the datab Bri UI but in uh

play00:49

projects everything is not done manually

play00:51

right at the end of the day when your

play00:53

project is complete you want to deploy

play00:55

it in the production uh environment

play00:57

there are lot of things that you want to

play00:59

uh automate uh or uh write code uh for

play01:05

doing things you using the code right

play01:08

for

play01:08

example U you might want to build a cicd

play01:11

pipeline for your project and automate

play01:13

few things using the cicd pipeline like

play01:16

as soon as you commit your code in your

play01:18

code repository it should automatically

play01:20

trigger a build when we say build then

play01:23

uh your pipeline should automatically

play01:25

pull code from latest code from the

play01:27

repository execute uh uh test cases unit

play01:31

test cases package everything and then

play01:34

deploy all your notebooks and pack uh

play01:36

other uh type of code files that you

play01:38

have created to the datab brick work

play01:40

space environment for your production

play01:42

environment that's one kind of

play01:43

automation then maybe you have written

play01:46

um automation uh integration test maybe

play01:48

you have written a notebook where uh you

play01:51

have a code for doing integration

play01:52

testing so you might want to

play01:54

automatically trigger that uh

play01:56

integration test from your cicd pipeline

play01:58

itself right so what you want to do is

play02:00

create a job cluster uh and run your

play02:03

integration test as a job on the cluster

play02:06

right and once that integration test is

play02:09

uh executed it is passed everything is

play02:10

passed run a cleanup is script all those

play02:12

things you want to do using uh maybe

play02:15

cicd pipeline or for different other

play02:17

various reasons you might want to write

play02:19

code for creating a job in your uh uh

play02:22

data briak uh production environment you

play02:24

don't want uh to uh want someone to go

play02:27

and manually create the uh job right so

play02:30

how do we automate that what kind of

play02:31

tools and what kind of capabilities

play02:32

datab braks offers us for the automation

play02:35

that's the topic for uh the today so

play02:38

datab BR offers three approaches for

play02:41

automating your work uh first approach

play02:45

and the most frequently used approach is

play02:47

datab Bri rest API uh documentation for

play02:50

rest API is you can find it here I'll

play02:52

show you and then next thing is datab

play02:54

SDK so rest API is like rest based API

play02:57

you can use these apis to

play03:00

uh uh from any language that supports

play03:02

calling a rest based API mostly all the

play03:04

languages even python Java Scala every

play03:06

language supports calling rest based API

play03:08

so these are like Universal from any

play03:10

language you learn only rest API working

play03:12

with rest API and you can use the same

play03:14

from any language then databas also

play03:16

offers a databas SDK which are language

play03:18

specific sdks so they have SDK for

play03:20

python they have SDK for Escala they

play03:22

have SDK for uh Java so the second

play03:25

approach is SDK less used but you have

play03:28

that option and this third approach is

play03:30

datab CLI so it's a command line tool

play03:32

and you can use commands from the

play03:35

command line tool to do almost

play03:37

everything that you can do using the UI

play03:39

right so datab CLI is another tool for

play03:41

automation so if we want to use datab

play03:43

CLI then most likely we will be writing

play03:46

uh shell scripts to call different uh

play03:48

CLI commands and to do whatever we want

play03:51

to do and if you want to use rest API

play03:53

then most likely we will be writing

play03:54

python code for doing the automation so

play03:56

I'll give you a quick demo of all these

play03:58

three approaches uh a small small uh

play04:01

examples I'll show how to use rest API

play04:03

how to use datab SDK and how to use

play04:05

datab CLI a more integrated and more

play04:08

elaborate example is uh given in your

play04:11

Capstone project where we we are using

play04:12

database rest API for automating few

play04:14

things and we are U using datab Bri CLI

play04:17

for building in the entire automated uh

play04:19

devops pipeline so you will get a

play04:21

full-fledged example there uh but today

play04:24

we want to learn how to use these tools

play04:26

and what we can do uh with these tools

play04:28

so let's uh close the slides and then we

play04:32

will go to the browser and I'll uh help

play04:39

you with the documentation link so you

play04:41

can refer because rest API is huge so

play04:44

you go to datab Bri rest API

play04:45

documentation uh this is common

play04:47

documentation uh across all platforms so

play04:50

you are working in azour or in AWS or in

play04:53

Google cloud or whatever rest API is

play04:54

same for every platform documentation is

play04:57

also same so

play05:00

if you look at the rest API it is uh

play05:02

broken down into uh different um uh

play05:06

areas so you have datab brick workpace

play05:10

rest API which allows you to do

play05:11

everything in the work space write code

play05:14

to do everything that you want to do in

play05:15

the working space like you can create

play05:17

grid credentials you can do repository

play05:19

operations you can um do Secret

play05:22

operations uh I mean you can save uh

play05:25

credentials and create uh databas

play05:26

secrets and inside the work space you

play05:28

can do uh um get workpace object

play05:32

permissions get work space uh object

play05:34

permission lbel delete workpace object

play05:37

create a directory everything and

play05:38

anything that you do using UI everything

play05:40

there is a rest API for everything right

play05:42

so and for compute uh related activities

play05:45

like cluster creation cluster policy

play05:47

cluster pools all that you can handle

play05:49

using uh compute based rest apis then

play05:51

you have rest API for workflows for

play05:53

Delta DLT for dbfs for machine learning

play05:57

for uh realtime serving for access

play06:00

management for data break SQL for Unity

play06:02

catalog for Delta sharing for other

play06:04

tokens and all that everything

play06:06

everything that you can do through the

play06:07

UI you can do the do it through the rest

play06:10

API so uh and these rest apis let's say

play06:14

let's come to the um jobs API where is

play06:17

it so jobs API jobs API is one of the

play06:21

rest apis uh which allows you to uh work

play06:23

with data bricks jobs right workflow

play06:26

jobs so you can list the job and this is

play06:27

the API for this uh API is this API 2.1

play06:33

jobs list 2.1 is the API version they

play06:35

have multiple versions so latest one is

play06:37

2.1 so uh and and this is the API URL

play06:41

and it is a get API you if you know rest

play06:43

little bit about rest API there are some

play06:44

get API there are some post apis so uh

play06:47

you you get to know what is the rest API

play06:49

for job list maybe you have a API for

play06:52

creating a new job so this is the API

play06:54

for uh creating a job create API and

play06:57

this is a post API and if you look look

play06:59

at details uh you have documentation for

play07:02

how to use it but at high level every

play07:04

API works in the same way for making a

play07:07

call to that API you have to provide

play07:08

some

play07:09

input that input is a Json message uh

play07:12

that we call a request input and once

play07:15

API is executed it will give you a

play07:17

response response is also a Json output

play07:19

and that's how every API works I will

play07:21

show you demo for this and in the

play07:23

documentation you can see uh you have

play07:25

sample here so a typical request for

play07:27

create API looks like this for create

play07:30

API jobs create API will create a job in

play07:32

your uh data frame uh in your databas

play07:34

work space and for creating a job you

play07:36

have to specify a lot of things what is

play07:37

the job name uh how what is the schedule

play07:40

for the job what are the different tasks

play07:42

in that job right so all that you can

play07:45

specify using Json you don't have to

play07:47

manually write this Json uh we will see

play07:49

how to generate this Json but uh for

play07:51

creating a job input is a Json message

play07:54

that tells the API about the job

play07:56

definition and in the response is simple

play07:58

response is is also a Json response

play08:01

which tells you the job ID so you are

play08:02

telling you are using this rest API to

play08:04

create the job and once job is created

play08:06

it will return you the job ID which you

play08:07

can

play08:08

fetch and uh uh at the bottom you can

play08:12

see that you um have uh 200 is the

play08:15

success response so if your response is

play08:17

200 uh you you got a success your job is

play08:21

successful and it will also give you a

play08:22

job ID it might give you some other uh

play08:25

HTTP codes so if you know programming or

play08:27

working with rest API in any uh

play08:29

reference same concept applies here so

play08:33

now you know where to look for different

play08:35

kind of apis and details of the inputs

play08:37

and uh outputs uh because this is an

play08:40

exhaustive list now let me show you one

play08:42

demo how to use it so we'll go to our um

play08:46

azour account and in azour I already

play08:50

have

play08:52

one I already have one workpace which I

play08:55

created so let's go to the workpace and

play08:58

we will see now

play09:00

uh one example

play09:02

there so this is my workpace uh for

play09:05

doing anything I'll need a cluster so

play09:06

let me create a cluster here uh create

play09:09

compute uh let's create a single node

play09:11

cluster without Photon terminate it

play09:13

after maybe 60 minutes right DB is 0.75

play09:17

so that's cheap and cluster is creating

play09:21

it will get created in a few minutes now

play09:23

uh I have this notebook I stream test

play09:26

notebook uh where um I'm doing some work

play09:30

so and using rest API for it so let me

play09:34

explain what what this notebook is so

play09:36

let's assume I created an application

play09:38

right and for uh once that application

play09:41

is done I also want to create a

play09:43

integration test case for that

play09:44

application right so I created an

play09:47

integration test case and my application

play09:49

is a streaming application so it's a

play09:50

stream processing real time stream

play09:51

processing application which reads data

play09:53

from some source and then I have a three

play09:56

layer architecture implemented bronze

play09:57

layer and silver layer and then gold

play10:00

layer so for bronze layer silver layer

play10:02

gold layer there are so many jobs that

play10:03

I've defined so many processes for

play10:05

bronze layer I have defined three four

play10:07

processes which will read data from a

play10:08

landing zone or from some sources

play10:10

injested into the bronze table then uh

play10:12

there are some uh processes written uh

play10:15

to uh read data from the bronze layer

play10:17

and fill the silver layer and similarly

play10:19

create the uh gold layer that's a

play10:21

typical project and now I want to write

play10:23

an automation test case for that

play10:26

streaming application and that's what

play10:27

this notebook is uh uh trying to do

play10:30

right so at high level uh what I want to

play10:32

do as an automation test is first step I

play10:35

want to create a job and Trigger that

play10:37

job and that job will be uh will uh run

play10:41

the entire workflow right but I don't

play10:43

want to go and manually create that job

play10:45

that job is like I can go to the

play10:47

workflow databas workflow and create a

play10:50

job manually but I don't want to do that

play10:52

what I want to do is I want to write

play10:53

code for creating job triggering the job

play10:56

and then executing my uh test case and

play10:59

then performing validation and then

play11:01

performing cleanup after that for

play11:03

everything that entire test I want to

play11:06

write code so that I can automate it so

play11:09

how do we do that you I want to use rest

play11:10

API for doing that so let's see what I'm

play11:13

doing so basically this notebook is uh

play11:15

taking some uh inputs at the beginning

play11:18

uh three inputs environment name host

play11:20

and access token and then taking that

play11:22

into extracting that into the python

play11:25

variable you already learned all that

play11:26

and then I have a set up notebooks

play11:28

written so I'm importing the setup

play11:30

notebook and then creating uh instance

play11:32

of the setup notebook and uh running the

play11:34

cleanup uh method from the setup module

play11:37

you will have a good sense of this uh

play11:39

when you come to the Capstone project

play11:40

because this is kind of part of the

play11:42

Capstone project your Capstone project

play11:44

so but I'm using it for the demo so

play11:47

cleanup method will clean everything

play11:48

clean the environment remove everything

play11:50

and after the cleanup is done I what I

play11:52

want to do I want to create a workflow

play11:53

job and Trigger that job uh so for

play11:57

creating a workflow job I need to use

play11:59

the uh rest API and this is the rest API

play12:03

for creating a job so api.

play12:05

22.1 jobs. create this is the call I

play12:08

want to make and it's a post call I'm

play12:10

using python so how do I do it in Python

play12:12

in Python uh there is a request package

play12:15

which I will import there is one more

play12:17

package Json because I'll be passing

play12:19

argument as a Json response will come as

play12:20

a Json uh so I need to uh handle some

play12:23

Json operations so I'm importing Json

play12:25

package from python this is this is

play12:26

nothing to do with these spark these are

play12:27

pure python packages so I import that

play12:29

and then using the requests I'm making a

play12:32

post call right request. poost that's

play12:34

how we make a rest API call in Python so

play12:37

request. poost and why post because this

play12:40

is a post method if it is a get method

play12:42

I'll use request.get so request. poost

play12:45

and request. poost takes three arguments

play12:47

first is the URL for the rest API so URL

play12:50

should be host name my workpace host

play12:54

name right in which datab workpace I

play12:57

want to run this

play12:59

um API right so host name uh which is a

play13:02

variable I created here in the beginning

play13:04

host name so I'll take the host name as

play13:06

an input for this notebook so host name

play13:09

plus the rest API sorry plus the rest

play13:12

API rest API URL you already learned

play13:14

from the documentation this is the URL

play13:16

so uh SL API 2.1 jobs create so that's

play13:20

your rest API and then next argument is

play13:23

uh the input parameter the input Json

play13:26

right so we know that rest API takes

play13:28

this kind of Json uh which defines the

play13:30

job what job I want to create right so

play13:32

which defines the job so I'm passing the

play13:35

Json and that Json payload I already

play13:38

defined here so if you look at the Json

play13:39

payload this is my job definition right

play13:42

this is my job definition from there to

play13:44

here so how it looks what is the name of

play13:46

the job do I need email notification no

play13:49

web hooks no timeout no Max concurrent

play13:51

runs I want one what is the task in the

play13:53

that job so task name is is as bit

play13:57

stream and it's a notebook task ask so

play13:59

notebook can be found at this place so

play14:01

in the work space so what I want this

play14:03

job to do is to run this notebook right

play14:05

and uh which cluster this notebook

play14:08

should run so it should run on the job

play14:09

cluster and then I Define job cluster

play14:11

here and for job cluster uh spark

play14:13

version should be this and maybe it's a

play14:16

single node cluster so spark Master

play14:18

should be this right and all those

play14:21

definitions are uh defined here for the

play14:23

cluster and that's how I Define the job

play14:26

what job I want to create right uh but

play14:28

how do I this Json so either you know

play14:31

all the Json syntax from the

play14:32

documentation everything is defined here

play14:34

right uh or one easy way is to uh go to

play14:38

workflow right I'll come to workflow now

play14:42

I know I want to create a job but I want

play14:44

to automate that job creation but uh

play14:46

let's see how we will create the same

play14:48

job manually so I can go to create job

play14:50

create the job name here uh let's say

play14:53

sbit stream test and then one task in

play14:57

that job I want only one task so I'll um

play15:00

uh create one task run

play15:03

sbit notebook I gave the name task name

play15:06

is notebook type Source notebook can be

play15:08

found in the working space what is the

play15:09

path so I'll come to uh notebook I know

play15:12

where this notebook can be found so I

play15:14

will come and maybe I want this notebook

play15:16

to run right so I'll click this task is

play15:20

defined where I want this uh job to run

play15:23

on the job cluster yes so let me edit

play15:25

that job cluster I don't want this big

play15:27

job cluster I want a single note because

play15:29

my job is a small single note should

play15:31

work and uh that's all so this is my uh

play15:36

definition for the cluster confirm that

play15:39

right so I manually defined the job

play15:41

dependent libraries no I don't want any

play15:43

libraries uh to install what are the

play15:44

parameters so and uh notebook parameters

play15:48

so what parameters my notebook will need

play15:50

so let me um come to working space check

play15:55

my notebook so let's say my notebook uh

play15:58

this run notebook what parameters it

play16:00

takes so it takes three parameters uh

play16:02

environment name run type and processing

play16:04

time but all comes with the default

play16:06

value so let

play16:08

me give one parameter which is default

play16:10

as once I want to set it to uh something

play16:13

else right so in the job create and I

play16:15

can tell this is the parameter name and

play16:17

value should be uh continuous it should

play16:20

not be default one and I defined this

play16:23

job and that's how we do it using the UI

play16:26

right

play16:27

so

play16:29

let me create this job or uh create this

play16:32

job I'm not running going to run this

play16:34

job I just created the definition of the

play16:35

job and then uh

play16:39

maybe I can come here and see view Json

play16:42

and this is my Json for the job

play16:44

definition so I can copy it and use it

play16:48

for uh my automation when I'm defining

play16:50

the job payload uh job definition right

play16:52

when I'm defining the input so rest this

play16:55

create job API requires an input Json

play16:57

you can manually prepare it but nobody

play16:59

does that manually so what we do we

play17:00

generate the Json uh copy it and then

play17:03

use it in our automation script use it

play17:05

in our code right so now I have uh I'm

play17:08

not going to replace that that I already

play17:10

have similar Json here so that's how we

play17:11

Define the Json so once we have the Json

play17:14

uh maybe here I come here cancel it come

play17:19

back to jobs uh let's delete this job so

play17:23

we have a clean slate here right uh so

play17:27

that's that's how we are uh defining the

play17:30

Json and once you have the Json it's

play17:32

simple uh what I'm doing is uh create uh

play17:36

uh request. poost passing the URL

play17:41

passing the Json uh for converting this

play17:44

payload variable into a valid Json this

play17:47

looks like Json but it is not a Json

play17:48

right it uh it is a python dictionary

play17:51

object so we need to convert that job

play17:54

payload into a Json so for that I'm

play17:55

using json. dumps I'm passing the

play17:57

variable here which will converted it

play17:59

into a valid Json and then for running

play18:01

the rest API we also need to provide the

play18:03

authentication token right so Au token

play18:05

uh last parameter which I'm taking as a

play18:08

uh input for this notebook which can be

play18:11

supplied here right so what we will do

play18:13

what this code is doing is making a call

play18:16

to the create job um rest API and once

play18:21

this is executed we take the response in

play18:23

the create response and from that create

play18:25

response using the Json uh method I can

play18:28

take the job ID right so this is uh

play18:30

python code uh to take out to pass the

play18:33

response Json and take out the element

play18:34

whatever element you want right so job

play18:36

ID element element I want in return so

play18:38

I'll take that job ID into the into a

play18:39

variable and print the job ID and that's

play18:42

how we use the rest API so you uh and uh

play18:46

I'm using some more rest apis here like

play18:48

next is

play18:51

uh jobs run now rest API if you come to

play18:53

the documentation uh create job API you

play18:55

saw how to use it uh maybe

play18:59

run

play19:00

job um API where it is list job get

play19:03

single job trigger a new job so trigger

play19:06

a new job is a post API uh this is the

play19:08

URL so once I created the job it will

play19:11

only create the job in this databas

play19:12

workflow uh right but we need to trigger

play19:14

the job so to run that job uh we have

play19:16

another API run Now API as per the

play19:18

documentation here right so I'm making

play19:20

another call to run API and for running

play19:23

the API I need input uh run API takes uh

play19:26

job ID at the minimum to run the given

play19:30

job ID there are a lot of other things

play19:31

you can uh provide but bare minimum is

play19:33

the job ID uh it will run once if you

play19:36

want to schedule it on a regular

play19:37

interval and all that you have to

play19:38

provide all the details so this run

play19:41

payload Json I have built here which

play19:43

gives only job ID and uh some notebook

play19:45

params parameters

play19:47

so I'm passing those notebook you you

play19:49

saw my run note this job is supposed to

play19:51

run the Run notebook and run notebook

play19:53

takes three arguments right so I can

play19:54

also pass those arguments from here so

play19:56

environment I'm passing environment run

play19:58

type I'm passing a streaming processing

play20:00

time I'm passing one second so these

play20:01

three arguments I'm passing and then

play20:02

making a call and once it is executed

play20:05

I'm taking a response back in the Run

play20:06

response variable and out of from this

play20:08

run response I'm taking the Run ID so

play20:10

this code will give me a run ID I'll

play20:11

print it why I need a run ID because I

play20:13

want to monitor I created a job I took

play20:15

the job ID and then using this job ID

play20:18

I'm running the job uh here passing the

play20:21

job ID in the Json input and then took

play20:23

the Run ID and using that run ID I'm

play20:26

waiting for the status of the job so

play20:28

another rest API I'm calling here is

play20:30

jobs. getet right so if you uh get a

play20:34

single job run so this API gives me the

play20:37

status of the current job or whatever

play20:39

job we want so we I want to monitor that

play20:41

job uh because triggering a job will

play20:44

launch a new job cluster and then start

play20:45

the job so until that cluster is um

play20:48

created and the job status is changed

play20:50

from pending to running I want to wait

play20:52

here and that's why I created a sorry a

play20:54

while loop here uh and in the while loop

play20:57

I'll sleep for 10 seconds and then get

play21:00

the job status and take the response in

play21:02

the status and from the status I'll take

play21:05

out task. estate. life cycle estate that

play21:08

I learned from documentation about the

play21:09

response here so if you look at

play21:13

the

play21:16

uh where it

play21:19

is okay so this is a response sample

play21:21

itself so uh once job is created I if in

play21:24

the get you will get everything in the

play21:26

response so in the response what I want

play21:28

to track is tasks State and life cycle

play21:31

uh life cycle state so uh I'll have

play21:35

tasks here and in the tasks it's an

play21:37

array of tasks because job can have

play21:39

multiple tasks I know my job is having a

play21:41

single task so I take the first element

play21:42

of the task and then look at the state

play21:44

so in the task I have state and inside

play21:47

the state I have life cycle state which

play21:49

will be pending in the beginning later

play21:51

it will be running and finally once uh

play21:53

if I terminate the job it will be

play21:54

terminated right so uh I'm taking life

play21:58

cycle State and then it comes into the

play22:00

job State and we keep on looping until

play22:02

it is pending right so that's how we

play22:04

build the logic so once my job is

play22:07

started then I want to write my uh uh

play22:09

test cases so in test cases I want to

play22:11

load some historical data then I want uh

play22:14

these are the packages I import uh so

play22:16

code is here so I want to load some uh

play22:20

sorry

play22:22

uh these are uh object Creations so once

play22:25

the job is started I want to call uh

play22:28

produce first batch of data then

play22:30

validate first batch of data uh then

play22:32

maybe sleep for two minutes so that data

play22:33

is picked up by the uh bronze layer so

play22:37

then I validate picked up by the all

play22:39

three layers right so then I validate

play22:41

bronze layer then I call the validate

play22:44

function for silver layer I can call the

play22:45

validate function for gold gold layer

play22:46

then I produce second batch of data all

play22:49

that is uh there and once I'm done with

play22:51

if all this runs successfully all my

play22:53

automation uh or my integration test is

play22:55

passed it is assumed to be passed if it

play22:56

fails then uh you will get get the error

play22:58

and once I everything is done then again

play23:00

I'll use the one more rest API to cancel

play23:03

the job because I don't it's a test job

play23:04

right uh once testing is done I want to

play23:06

cancel that job and then I also want to

play23:08

delete that job so for that I'm writing

play23:10

Rest API and at the end I have a success

play23:12

method so that's how I automated it if I

play23:14

run it you will see everything happening

play23:17

so uh let's run it let me close

play23:20

everything else uh uh for running it I

play23:23

need to connect it with a running

play23:26

cluster and

play23:28

I need to provide these inputs right so

play23:31

Dev is fine uh datab Bri workpace URL uh

play23:34

because I want to create job in this

play23:36

workpace itself but if you want to

play23:37

create job in your QA workpace you will

play23:40

provide QA workpace URL as an input but

play23:42

I want to run it in the same working

play23:44

space so I can provide the URL for the

play23:46

same working space and URL is like start

play23:48

from here up to this place rest all our

play23:51

URL arguments so you copy this and uh

play23:54

paste it here maybe slash is not

play23:56

required at the end and that's your work

play23:58

space URL this uh code will create a job

play24:01

so for creating a job you also need to

play24:02

authenticate for authentication we need

play24:04

a uh token so we already know where to

play24:06

find a where to uh do where to create a

play24:09

token right so you can go to user

play24:10

settings right and uh UI keep on

play24:14

changing so they they changed the user

play24:16

settings page now in the preferences

play24:18

sorry in the developer menu you will

play24:19

find access token click manage uh

play24:22

already two tokens I created so let me

play24:23

delete these I don't need this so

play24:25

generate a new token uh you can give

play24:27

comment uh this is temporary I will

play24:29

delete it later and lifetime for this

play24:32

token one day only this token is created

play24:34

so I copy this token and paste it here

play24:36

as an argument right so now we are ready

play24:38

to run right so uh I can run one by one

play24:42

or I can run all this notebook will run

play24:44

and you will see everything in action

play24:45

but let's go one by one so that we can

play24:47

see what is

play24:48

happening variables are defined uh we

play24:51

got the variables in the python this

play24:54

imported the setup and then I ran and

play24:58

the cleanup so cleanup module will do

play25:01

the cleanup uh clean the environment and

play25:03

prepare it for uh running my integration

play25:05

test that's part of a typical project

play25:07

almost every project may have a cleanup

play25:09

is script if you are willing to do an

play25:11

automation testing so cleanup is done

play25:13

then I Define my job

play25:15

uh payload and then from here I can

play25:20

start making call to rest API so as soon

play25:21

as I execute that what it will do uh

play25:25

this code will create a new job right so

play25:28

let's open workflow and

play25:30

see uh we don't have any job here right

play25:32

so if this works correctly then I should

play25:34

automatically get a job created here so

play25:37

let me run this done job created and

play25:40

this is the job ID right if you come

play25:41

here you can see SBT stream job is

play25:43

created it's not running but job is

play25:45

created you can click here and see it

play25:48

says run now it's not running come to

play25:50

task there is a one task definition all

play25:52

the uh notebook details job cluster

play25:55

configuration everything is in place so

play25:58

job is created it's not running but I

play25:59

have a code to run it also so here is an

play26:01

example how to run a job using rest API

play26:03

so let me run this and it should job is

play26:06

started run ID this if you come here you

play26:08

can

play26:09

confirm

play26:12

uh okay job runs shows here so job runs

play26:15

it's running and it's in pending State

play26:17

because it's launching a job cluster and

play26:18

it takes maybe four five minutes to

play26:19

launch a cluster so let's run this part

play26:22

this part we keep on waiting until the

play26:23

job cluster is created and job is in

play26:25

running state so we know that status is

play26:29

pending and that's what we are tracking

play26:30

here so this will wait for 10 second and

play26:33

print pending wait for another 10 second

play26:35

and then check for the status print

play26:36

pending and we we are keep on waiting

play26:39

until the job starts because until the

play26:40

cluster is job cluster is created job is

play26:42

started there is no point in running our

play26:44

uh test cases they will definitely fail

play26:45

they won't find any data so all the

play26:47

validation will fail so this is how we

play26:49

can wait for the job to start uh as soon

play26:52

as it is started we will uh come out of

play26:55

it so maybe uh all that we can run it

play26:59

will perform the validation and at the

play27:01

end I have a script to do the cleanup so

play27:04

uh that's how we use rest API

play27:07

so uh hope it made sense and you uh

play27:11

you're you become familiar with how to

play27:12

use rest API for automating things in

play27:14

datab briak U environment and we will

play27:17

use this technique fully with a proper

play27:19

example end to end example uh in our

play27:21

Capstone project uh we are nearing to

play27:24

the point where we should start talking

play27:25

about the Capstone databas Capstone

play27:27

project

play27:28

so uh you will learn that uh soon what

play27:32

we are left with two more approaches for

play27:34

U

play27:36

um uh automation uh datab brick SDK and

play27:39

datab brick CLI so we are already uh

play27:42

consumed a lot of time so maybe it won't

play27:45

be possible to

play27:47

cover SD can CLI today maybe I'll cover

play27:49

it in the next session uh and yeah while

play27:53

this is waiting for cluster to start we

play27:54

can take some

play27:56

questions

play28:01

[Music]

Rate This

5.0 / 5 (0 votes)

Связанные теги
DatabricksAutomationREST APISDKCLIWorkflowDevOpsPythonIntegration TestingJob Scheduling
Вам нужно краткое изложение на английском?