Dagster Crash Course: develop data assets in under ten minutes

Dagster
10 Oct 202228:38

Summary

TLDRIn this video, Pete takes you on a crash course for building an ETL pipeline with Dagster, a data orchestration platform. Starting from scratch, he guides you through the process of fetching data from the GitHub API, transforming it, visualizing it in a Jupyter notebook, and uploading the notebook as a GitHub Gist. Along the way, he covers key Dagster concepts like software-defined assets, resources for managing external dependencies, testing strategies, and scheduling pipelines to run automatically. By the end, you'll have a solid understanding of how Dagster streamlines data workflows and empowers you to build robust, testable, and production-ready ETL pipelines with ease.

Takeaways

  • 🔑 Dagster is a tool for building data pipelines and ETL (Extract, Transform, Load) workflows as a DAG (Directed Acyclic Graph) of software-defined assets.
  • 📦 Dagster provides a command-line interface and UI for scaffolding projects, managing dependencies, and running pipelines.
  • 🧩 Software-defined assets are Python functions that represent data assets (e.g., reports, models, databases) and can be interconnected to form a pipeline.
  • 🔄 Dagster caches intermediate computations, allowing efficient re-execution and iteration on specific pipeline steps.
  • ⚙️ Dagster's resource system facilitates secure configuration, secret management, and test-driven development through dependency injection.
  • 📅 Dagster supports scheduling pipelines to run at regular intervals, enabling automated data workflows.
  • 🔬 The presenter demonstrated building an ETL pipeline to fetch GitHub repository stars, transform the data, visualize it in a Jupyter notebook, and publish the result as a GitHub Gist.
  • 🧪 Test-driven development is encouraged in Dagster, with utilities for mocking external dependencies and asserting pipeline outputs.
  • 📚 Extensive documentation and tutorials are available to learn more about Dagster's features and best practices.
  • 🌟 The presenter encouraged users to explore Dagster further and contribute to the open-source project by starring the repository.

Q & A

  • What is the purpose of this video?

    -The video provides a crash course on how to build an ETL (Extract, Transform, Load) pipeline using Dagster, a data orchestration platform.

  • What is the example pipeline demonstrated in the video?

    -The example pipeline fetches GitHub stars data for the Dagster repository, transforms the data into a week-by-week count, creates a visualization in a Jupyter notebook, and uploads the notebook as a GitHub Gist.

  • What is the role of software-defined assets in Dagster?

    -Software-defined assets are functions that return data representing a data asset in the pipeline graph, such as a machine learning model, a report, or a database table. These assets can have dependencies on other assets, forming a data pipeline.

  • How does Dagster handle caching and reusing computations?

    -Dagster uses a system called IO managers to cache the output of computations in persistent storage like local disk or S3. This allows reusing cached data instead of recomputing it, improving efficiency and iteration speed.

  • What is the purpose of the resources system in Dagster?

    -The resources system in Dagster allows abstracting away external dependencies, like API clients, into configurable resources. This enables testability, secret management, and swapping in test doubles for external services.

  • How does the video demonstrate secret management?

    -The video shows how to move the GitHub API access token out of the source code and into an environment variable, which is then read by the GitHub API resource. This prevents secrets from being stored in the codebase.

  • What is the role of the test demonstrated in the video?

    -The test is a smoke test that verifies the happy path of the pipeline by mocking the GitHub API and asserting expected outputs from the software-defined assets. It demonstrates Dagster's testability features.

  • How does Dagster enable scheduling pipelines?

    -Dagster allows defining jobs and schedules within a repository. The video demonstrates adding a daily schedule to the ETL pipeline job, which can then be run automatically by Dagster's scheduler daemon.

  • What is the role of the Dagster UI in the development process?

    -The Dagster UI provides a visual interface for launching and monitoring pipeline runs, inspecting asset metadata, and managing schedules. It aids in the development and operation of Dagster pipelines.

  • What are some potential next steps after completing this tutorial?

    -The video suggests exploring Dagster's documentation, tutorials, and guides further, as the tutorial covers only a basic introduction. There are more advanced features and best practices to learn for production-ready Dagster pipelines.

Outlines

00:00

📽️ Introduction to Building an ETL Pipeline with Dagster

Pete, an employee at Dagster, introduces the video and provides context. He explains that the video serves as a companion to a blog post, offering a crash course on building an ETL (Extract, Transform, Load) pipeline with Dagster. The completed code is available on GitHub, and Pete will be using an in-browser editor called Gitpod to start from scratch. The objective is to create a report visualizing GitHub stars over time for the Dagster repository, utilizing the GitHub API, Dagster's software-defined assets, and the Jupyter API.

05:02

🔧 Setting Up the Development Environment and Fetching GitHub Data

Pete installs Dagster and its dependencies, explaining the project structure and components. He then creates a software-defined asset to fetch stargazers data from the GitHub API using a GitHub token. This asset returns the raw API response containing timestamps and usernames of stargazers.

10:04

🔄 Transforming and Visualizing the GitHub Data

Pete creates another software-defined asset to transform the raw GitHub stargazers data into weekly counts. He then uses a Jupyter notebook to visualize the transformed data, creating a software-defined asset that generates the notebook as a markdown string, executes it, and writes the output as a string readable by Jupyter. Finally, he creates an asset to upload the notebook as a GitHub gist, enabling sharing the visualization with stakeholders.

15:05

🚀 Running the Pipeline and Introducing Schedules

Pete demonstrates running the pipeline in the Dagster UI by materializing all assets. He highlights the caching mechanism that avoids redundant data fetching. Pete then introduces schedules, modifying the code to create a daily job that refreshes all assets. He shows the new daily schedule in the Dagster UI but notes that the daemon for running schedules is not running in this example.

20:05

🔐 Addressing Production Readiness: Secrets Management and Testing

Pete identifies two issues to address before considering the project production-ready: the hardcoded GitHub token (secret) in the source code and the lack of tests. He introduces the concepts of resources and configuration in Dagster, refactoring the code to use a resource for the GitHub API client and configuring it to read the token from an environment variable. This removes the secret from the source code.

25:06

🧪 Writing Tests for the ETL Pipeline

Pete demonstrates writing unit tests for the ETL pipeline using Dagster's testing utilities and the Python mock library. He creates a smoke test that simulates GitHub stargazers data, mocks the GitHub API, and asserts the expected behavior of the software-defined assets. Running the tests locally verifies their correctness without interacting with external services.

Mindmap

Keywords

💡ETL Pipeline

An ETL (Extract, Transform, Load) pipeline is a process that involves extracting data from various sources, transforming or processing the data into a desired format, and then loading it into a target destination, such as a data warehouse or a data lake. In the context of this video, the speaker is demonstrating how to build an ETL pipeline using Dagster, a data orchestration platform, to fetch data from the GitHub API, transform it, and visualize it in a notebook.

💡Software-Defined Assets

Software-defined assets are functions in Dagster that represent data assets, such as a machine learning model, a report, or a database table. These functions return data that represents the asset in the computational graph. In the video, the speaker creates software-defined assets like `get_github_stargazers` and `get_github_stargazers_by_week` to fetch and transform data from the GitHub API.

💡Dagster

Dagster is a data orchestration platform that allows developers to build and manage data pipelines. It provides a user interface, a scheduling system, and a set of tools for defining, executing, and monitoring data pipelines. The video is a tutorial on how to use Dagster to build an ETL pipeline for visualizing GitHub stars over time.

💡Data Dependencies

Data dependencies refer to the relationships between different data assets or operations in a pipeline. In Dagster, these dependencies are explicitly defined using the `asset` decorator and by passing other asset names as arguments to an asset function. For example, the `get_github_stargazers_by_week` asset depends on the `get_github_stargazers` asset, which means Dagster will ensure that `get_github_stargazers` runs first and provides its output to `get_github_stargazers_by_week`.

💡Caching

Caching is a technique used to store the results of computations or data fetches, so that they can be reused later without having to recompute or refetch the data. In the video, the speaker mentions that Dagster uses a caching system called I/O managers to store the output of assets, which can significantly speed up subsequent runs of the pipeline, especially for expensive operations like fetching data from the GitHub API.

💡Resources

Resources in Dagster are objects that represent external systems or services that your assets interact with, such as databases, APIs, or cloud storage. The video demonstrates how to use resources to abstract away the GitHub API client, making it easier to configure and swap out with a test double for writing tests.

💡Secrets Management

Secrets management refers to the practice of securely storing and managing sensitive information, such as API keys, passwords, or access tokens. In the video, the speaker shows how to use Dagster's resources and configuration system to read the GitHub API access token from an environment variable, avoiding the insecure practice of hard-coding secrets in the source code.

💡Testing

Testing is the process of evaluating software to ensure that it meets its requirements and works as expected. In the video, the speaker demonstrates how to write tests for Dagster assets using Python's unittest framework and the `materialize_to_memory` utility provided by Dagster. This approach allows testing assets in isolation, without making actual API calls or triggering external side effects.

💡Schedules

Schedules in Dagster are used to define when and how often a pipeline or job should run. In the video, the speaker shows how to create a daily schedule for the ETL pipeline, which would refresh the GitHub stars data and generate a new visualization on a daily basis. Schedules are managed by the Dagster daemon process.

💡Data Visualization

Data visualization is the graphical representation of data or information, often using charts, plots, or other visual aids. In the context of the video, the speaker creates a Jupyter notebook as part of the ETL pipeline to visualize the GitHub stars data over time, using libraries like Matplotlib and Pandas. The notebook is then uploaded to a GitHub Gist for sharing with stakeholders.

Highlights

Dagster is a framework for building ETL pipelines, where data assets are defined as software-defined assets and linked together to form a data pipeline.

Software-defined assets are functions that return data representing a data asset, such as a machine learning model, report, or database table.

Dagster uses an asset decorator to mark functions as software-defined assets, allowing it to understand the relationships and dependencies between assets.

Dagster caches intermediate computations, enabling faster iteration and bug fixing by reusing cached results instead of recomputing from scratch.

Dagster provides a resource system to abstract external dependencies, like API clients, enabling better testability and secrets management.

Resources can be configured separately from the assets, allowing different configurations for different environments or assets.

Dagster's resource system supports dependency injection, making it easier to swap out resources with test doubles for better testing.

Dagster uses a string source config schema to read secrets from environment variables, avoiding hard-coding secrets in source code.

Dagster provides utilities for writing smoke tests, which test the happy path and common cases of the pipeline.

Tests can mock out external dependencies like API calls, allowing tests to run quickly without talking to external systems or triggering real-world effects.

Dagster's materialize_to_memory function allows testing assets in memory, overriding resources with test doubles as needed.

Dagster supports defining jobs, which are collections of assets to be materialized together, and scheduling those jobs to run periodically.

Dagster provides a UI for visualizing assets, launching runs, and monitoring schedules.

Dagster uses a workspace.yaml file to configure the code locations it should load.

Dagster supports various Python environments and operating systems, making it easy to get started with a simple command line tool.

Transcripts

play00:00

hi I'm Pete I work on dagster and today

play00:03

I'm going to take you through a quick

play00:04

crash course for how to build an ETL

play00:06

Pipeline with dagster this is a

play00:09

companion video to a blog post that we

play00:11

wrote you can go check it out here to

play00:13

get some additional contacts and links

play00:15

to various documentation if you want to

play00:17

just skip straight to the code the

play00:19

completed code is available here at this

play00:20

GitHub repo and finally as we go along

play00:23

I'm going to be using an in-browser

play00:25

editor called gitpod and starting from

play00:27

scratch and then building it up from

play00:29

there so you can just go here to get

play00:32

started with that it's free and it makes

play00:35

getting spun up with a python

play00:36

environment very easy just to set the

play00:39

context here we're going to create this

play00:41

report and store it in GitHub gist this

play00:44

is going to be created from an IPython

play00:46

notebook and it's going to visualize

play00:48

GitHub Stars over time for the Dax to

play00:51

repo

play00:52

so we're going to use the dagster API or

play00:54

sorry the GitHub API and Dexter software

play00:57

defined assets and the jupyter API to to

play01:00

do all this

play01:02

um so to get started I've got this Cloud

play01:04

development environment here it's using

play01:05

a service called gitpod there's a link

play01:07

in the in the blog post it's a free

play01:09

Cloud development environment that

play01:10

guarantees a pretty stable and specific

play01:13

python version you can use whatever

play01:15

python environment you want dagster

play01:16

works on on you know Mac Linux and

play01:18

windows but I'm just going to use this

play01:20

for consistency so the first thing we're

play01:22

going to do is PIP install dagster

play01:24

this installs our command line tools

play01:27

that help you scaffold out a project so

play01:30

we're just going to use the the default

play01:32

simple project but there are a number of

play01:35

different templates that you can use

play01:37

um by accessing the help command but I'm

play01:39

just going to save Dexter project

play01:40

scaffold name my dagster project

play01:47

and you can see here that we've created

play01:49

a project

play01:51

in this uh in this workspace so if I CD

play01:55

my Dexter project I'm going to install

play01:57

the dependencies that this example

play01:58

project needs so this is just how you do

play02:01

it while it installs we'll give you a

play02:03

little tour of

play02:04

um of what's in here we have a little

play02:07

readme describing the project linking to

play02:09

the documentation we've got our normal

play02:11

kind of python set of dot Pi where we

play02:13

list all the dependencies that we need

play02:15

this workspace.yaml tells dagster when

play02:18

we run it locally where to find the code

play02:21

so dagster looks for this workspace.yaml

play02:24

file to figure out what code load and

play02:27

then we've got my Dexter project which

play02:29

contains

play02:31

um you know the the source code for our

play02:33

project uh we'll be doing most of our

play02:35

work in there and then we have the my

play02:36

dagster project tests uh package which

play02:39

contains all the unit tests uh for our

play02:41

project

play02:43

so I'm going to start dagster

play02:45

using this command dag it there's

play02:47

actually two components to dagster there

play02:50

is the UI and then there's the Daemon

play02:51

that runs the schedules because we're

play02:54

not going to be using schedules

play02:55

um right now I'm just going to launch

play02:57

the UI and so you can see here that

play02:59

we've got this this empty UI because we

play03:02

haven't done anything yet but you just

play03:04

run that one command dag it it looks at

play03:06

that workspace.eml file and it loads up

play03:09

your UI

play03:11

um so like I said uh we are building

play03:14

that um GitHub Stars dashboard so we

play03:16

need a number of dependencies in order

play03:19

to do that

play03:20

um so in order so the way we add

play03:22

dependencies is it's like any other

play03:24

python project we just um update this

play03:27

install requires part of the setup.pi

play03:29

and then we rerun the install steps this

play03:32

pip install Dash e command

play03:35

and so this is going to install Pi

play03:36

GitHub so we can access the GitHub API

play03:39

I'm going to install matplotlib which

play03:41

will help us visualize

play03:43

um what's going on with the GitHub Stars

play03:45

pandas is our data frame which is how we

play03:47

manipulate and transform the data and

play03:49

then these four packages are what's

play03:50

needed in order to render a notebook

play03:55

um so the first thing we're going to do

play03:56

is we're going to want to fetch the raw

play03:59

data from GitHub so this is going to

play04:01

create

play04:02

um you know basically the raw response

play04:03

from the GitHub API

play04:05

so

play04:06

um we're going to use software-defined

play04:08

assets to do that we're going to use

play04:09

software-defined assets to begin to

play04:13

build our application

play04:15

so I um I just copied and pasted this in

play04:17

before the cut

play04:19

this is our example

play04:22

um you know software-defined asset for

play04:25

fetching the GitHub star gazers from the

play04:27

GitHub API

play04:28

and so

play04:30

um I'm going to actually put a real

play04:31

GitHub token in here

play04:33

by the time you see this video I'm gonna

play04:35

have deleted it so you can't use it for

play04:37

anything

play04:38

um obviously inlining a token

play04:41

um or any sort of Secret In Your source

play04:42

code is a really bad idea

play04:44

we will fix that by the end of this

play04:45

tutorial but for now we're just gonna

play04:46

put that in there and we're going to

play04:48

create what's called a software defined

play04:50

asset software defined asset is a

play04:52

function that returns

play04:54

um some data that represents an asset a

play04:58

data asset in your graph so this could

play04:59

be a machine learning model a report or

play05:02

a database table in this case we're just

play05:04

returning um the response from the

play05:06

GitHub API so we instantiate the pi

play05:08

GitHub client we pass it the access

play05:11

token

play05:12

we do a little bit of function calls to

play05:14

get the star gazers with the dates this

play05:16

is effectively like the username and

play05:18

then the date that they started the repo

play05:20

and that the exact time stamp that they

play05:22

start the repo and marking it with an

play05:24

asset decorator

play05:26

um indicates to dagster that this is a

play05:28

software-defined asset

play05:30

um so uh the next thing we're going to

play05:32

need to do

play05:34

is now that we have that raw API

play05:36

response from GitHub we're going to need

play05:38

to transform that into week by week

play05:41

counts so we need to go from these pairs

play05:43

of timestamps and usernames to the

play05:47

number of unique users that have starred

play05:49

the repo in a given week

play05:52

and so I'm going to just paste some of

play05:55

this code in from the blog post you can

play05:58

follow follow along with a blog post if

play05:59

you like

play06:01

and so right here I'll take you through

play06:03

how this works uh again we have a second

play06:06

software-defined asset which is called

play06:09

GitHub stargazers by week

play06:11

this takes a parameter here called

play06:13

GitHub star gazers what this actually is

play06:16

is a it's got a special name because it

play06:19

references this name right here now

play06:21

dagster via the magic of

play06:23

software-defined Assets in this asset

play06:24

decorator knows how to match these two

play06:26

up and so this basically declares a data

play06:29

dependency between the GitHub Star

play06:31

gazers by week and the get up stargazers

play06:33

asset

play06:35

um so uh what dagster will do is it will

play06:38

know to materialize the get up

play06:40

stargazer's asset before materializing

play06:42

Get Up stargazers by week so now that we

play06:44

have that data we iterate through it

play06:46

here

play06:47

and we create a new data frame where

play06:50

um you know we we basically create

play06:54

um one row for every user and when they

play06:56

start it except we convert the timestamp

play06:58

from a um the exact time stamp to just

play07:02

the the start of the week and then we

play07:04

will aggregate by the start of the week

play07:06

so you can see here

play07:09

um we call Group by week which

play07:11

Aggregates everything into week by week

play07:13

Aggregates we call count which counts

play07:16

within the week and then we sort

play07:18

chronologically by the week so then we

play07:19

get an ordered a data frame

play07:21

um the start of the week and then the

play07:23

number of users that started during that

play07:25

week

play07:27

um if you have questions about how this

play07:29

works check out the pandas documentation

play07:32

so the next thing we need to do is go

play07:34

from this data frame to some sort of

play07:37

visualization Jupiter notebooks are a

play07:39

really common way to do it normally I

play07:42

would you know open up Jupiter to

play07:44

develop the notebook but there's a

play07:45

little library that makes it easier for

play07:47

example it's called jupitxt where you

play07:49

can write a notebook just as a as a

play07:51

string of markdown inside of your uh

play07:53

inside of your project

play07:54

so I'm gonna just paste in some

play07:56

additional code here

play08:01

um

play08:02

and you can see we've added another

play08:04

software defined asset called GitHub

play08:06

Stars notebook

play08:07

so this takes in GitHub Star gazers by

play08:10

week which we defined up here

play08:13

we create markdown representing the um

play08:16

the notebook so you can just think of

play08:17

this as like an IPython notebook like an

play08:20

ipymb file but just encode it as a

play08:23

markdown string using this this Library

play08:25

here we convert it to an actual IPython

play08:27

notebook right here

play08:29

um we call this execute

play08:31

pre-processor.preprocess this is

play08:32

something that we've imported from the

play08:35

Jupiter Library which will basically

play08:36

execute the notebook and put the results

play08:38

into it

play08:39

and then finally we call mbformat.write

play08:41

which will write out the notebook as a

play08:44

string that can be read by Jupiter or

play08:47

any other service that supports

play08:48

notebooks

play08:50

um

play08:51

finally we want to take this notebook

play08:55

and upload it to GitHub as a gist so we

play08:58

can take that URL and then share it with

play09:00

stakeholders they can see the

play09:01

visualization

play09:02

so let's go and do that

play09:05

um I'm going to to just paste in the the

play09:08

code here

play09:13

so you can see here we've defined

play09:14

another software-defined asset with this

play09:16

asset decorator

play09:17

GitHub Stars notebook just it takes in

play09:21

the GitHub star's notebook

play09:24

um and uh this is uh we don't need this

play09:27

right now

play09:28

um

play09:30

it takes in the GitHub Stars notebook it

play09:32

calls the GitHub API it tells it to

play09:34

create a gist and then uploads the

play09:36

contents of that notebook as a file

play09:39

attached to the gist and then we just

play09:41

log out the the URL

play09:45

and so

play09:47

we um we've basically created all of our

play09:51

um all of our software-defined assets

play09:53

now let's

play09:55

um let's try to take a look at them in

play09:56

the in the dagster UI

play09:59

so we've started up dag it let's open it

play10:01

up

play10:02

and now you can see our four assets that

play10:04

we created the GitHub star gazers get up

play10:06

Star gazers by week GitHub Stars

play10:08

notebook and then the GitHub Stars

play10:10

notebook gist

play10:12

so I can just click this materialize all

play10:13

button

play10:15

and you can see that we launched a run

play10:18

and it'll it'll go

play10:22

so what this is doing right now is this

play10:23

is fetching all of the Star gazers from

play10:25

GitHub

play10:26

this is actually quite a long operation

play10:29

because it is fetching all of the

play10:31

stargazers from the beginning of time

play10:32

which is a quite expensive operation you

play10:35

have to do multiple calls to the GitHub

play10:37

API in order to do it and as you can see

play10:39

it takes a long time one of the

play10:41

advantages to modeling your computation

play10:43

as a dag of assets the way that that

play10:45

Daxter does is that we can cache this

play10:48

computation and reuse it in the future

play10:50

so for example if we want to iterate on

play10:53

how the notebook works or how we're

play10:55

transforming the data basically The Core

play10:57

Business logic we don't have to do that

play10:59

fetch again dagster uses a system called

play11:01

i o managers and it stores that in

play11:04

persistent storage in this case it's

play11:06

stored on on my local disk

play11:09

um but uh but you know in the future

play11:11

um you know if you kind of go to

play11:13

production you can use um S3 uh it will

play11:16

be stored on um you know in a in a more

play11:18

kind of production ready blob store so

play11:21

as you can see

play11:22

I made an edit uh to my code

play11:25

um and I'm missing my context variable

play11:28

so uh I'm gonna add that back

play11:32

so context is not the name of an asset

play11:35

context is actually a kind of a special

play11:39

magic context that's passed through to

play11:42

every asset if it's asked for so if your

play11:44

first argument of your asset is called

play11:45

context you get this context object it

play11:48

has a number of things on it including

play11:50

this log function or this logger where

play11:53

you can call log log.info

play11:55

so as you can see this failed because I

play11:58

introduced an error

play12:00

what I can actually do here is just

play12:03

click on this and uh I can re-execute

play12:07

the GitHub Stars notebook gist

play12:10

so basically I don't have to sit and

play12:12

wait for all of that fetching from the

play12:13

GitHub API I can instead just run that

play12:16

one step and it happens really quickly

play12:18

so you know one of the things to take

play12:20

away from this here is that fixing bugs

play12:22

can be really fast when you model your

play12:23

your computation this way

play12:25

and so if I take a look here

play12:29

I have a brand new

play12:31

visualization

play12:33

uploaded to um GitHub

play12:36

and I can go and share this with with

play12:38

any of my stakeholders looks like the

play12:40

stars are going up that's great

play12:43

um one thing to to kind of note here

play12:46

um that I didn't cover uh is how this

play12:49

notebook is created we basically pickle

play12:51

the stargazers data in order to get it

play12:54

in the notebook so the notebook can

play12:55

actually like visualize it if you're

play12:58

using a different visualization tool you

play13:00

might do it a different way

play13:03

um the last thing

play13:05

um I want to show you is how to add a

play13:08

schedule

play13:09

and so

play13:12

you know we we basically have created a

play13:14

one-off job at this point

play13:16

um so

play13:18

effectively you can go into dagster

play13:21

click on the launch run button or click

play13:23

on the materialize button and do a

play13:26

one-off run but really when you get to

play13:27

production you want to

play13:29

um

play13:30

you really want to put things on a

play13:31

schedule

play13:32

let's all show you how to do that we

play13:34

open up this repository here

play13:36

and I'm going to make an edit here so

play13:38

you can see that right now our

play13:39

repository which is kind of dagster's

play13:41

word for project in many ways

play13:43

um it just contains all the assets from

play13:45

our assets package

play13:47

I'm going to put some additional things

play13:49

in here

play13:51

um we've got a

play13:54

defined asset job function here this

play13:57

defines what's called a dagster job so

play14:00

when we click that materialize button

play14:02

that created what we call a job

play14:04

and and that job then materialize those

play14:07

assets and we could run that job

play14:09

multiple times and each one of those is

play14:11

called a run

play14:13

so we basically say hey we want to have

play14:16

a job called daily refresh and it will

play14:19

refresh all of the Assets in the project

play14:21

that's what the star means and we will

play14:24

put it on a daily schedule

play14:26

and then we simply add the job and the

play14:29

schedule to our project

play14:31

and we can take a look at that in the

play14:34

dagster UI so I go to my workspace I

play14:37

want to just reload all my all my

play14:39

schedules here

play14:41

and then if I go to status and schedules

play14:43

you can see here we've got a daily

play14:46

schedule

play14:47

linked to the job and the job you know

play14:51

materializes all four of those assets

play14:54

this warning over here by the way is

play14:57

because our Damon isn't running like I

play14:58

said earlier there's the UI and then

play14:59

there's the Daemon these are the two big

play15:01

processes that you have to think about

play15:02

with with dagster the Damon runs the

play15:05

schedules I didn't start the the Daemon

play15:06

for this example

play15:08

um so uh

play15:10

you know that's why we have that warning

play15:13

all right and there is a couple problems

play15:15

now with this project even though it

play15:17

works we are able to build a full ETL

play15:19

pipeline from GitHub to a visualization

play15:22

and run it on a schedule there's still

play15:24

some things we need to do before we can

play15:26

consider this project production ready

play15:29

um the biggest problem I think is this

play15:32

um this secret just sitting around in

play15:34

your source code that's really bad the

play15:36

second thing is we haven't written any

play15:37

tests

play15:39

um and the good news is dagster is

play15:40

designed from the ground up to support

play15:43

um you know really great testing um as

play15:45

well as uh dealing with secrets

play15:48

um so in order to do that we're going to

play15:50

use two the extra Concepts the first one

play15:51

is called config and the second one is

play15:53

called resources

play15:55

so I'm going to create a file here

play15:58

called resources.pi

play16:01

it's going to contain our example

play16:02

resource so a resource is basically

play16:06

um it's usually like a client that talks

play16:08

to an external system so in our case

play16:11

it's going to be our GitHub client we're

play16:13

going to abstract that away into

play16:14

something called a resource and what the

play16:16

resource lets us do is it can be

play16:17

configured separately from the rest of

play16:18

the application

play16:20

and the assets will depend on the GitHub

play16:22

resource rather than the GitHub API

play16:24

itself so we can specify you know

play16:26

different access tokens to different

play16:27

assets we can also swap it out and swap

play16:31

in a test double for example so when we

play16:33

write a test we'll use the resource

play16:35

system in order to to test without

play16:37

hitting the GitHub API

play16:39

so just talking talking this through we

play16:42

use the resource decorator to indicate

play16:44

that we've defined a resource

play16:46

um technically this is called a resource

play16:47

definition uh the name of the definition

play16:50

is called GitHub API and it simply

play16:52

returns a pi GitHub client that takes in

play16:56

the the the token that we are going to

play16:59

include in the config

play17:01

we also need to give it a config schema

play17:04

this also this can take regular python

play17:07

type so for example I could just say hey

play17:09

this takes an X token as a string but

play17:11

there's a super powered

play17:14

um you know dagster object called string

play17:16

source which has some extra features

play17:19

including reading values from the

play17:21

environment

play17:22

um as or reading from environment

play17:24

variables which which is very useful for

play17:27

secrets so we're actually going to use

play17:28

string Source here

play17:32

um now that we've added the resource or

play17:35

that we've created the resource we have

play17:37

to actually add it to our project

play17:39

so um I'm going to go back into our

play17:42

Repository

play17:43

I'm going to import the resource

play17:46

and then I'm going to add it to our

play17:49

project here so the way that we do that

play17:52

is we use this function called with

play17:54

resources

play17:55

um which I will import up here

play18:00

and we say

play18:02

with resources and then that takes in

play18:05

definitions and then the resource

play18:07

definition so the first thing we'll do

play18:09

is just pass in our assets what this

play18:11

basically says is it'll give the

play18:13

resources to all of our Assets in our

play18:15

project

play18:16

and then

play18:18

we provide a resource name so in this

play18:20

case we're going to call it GitHub API

play18:22

and then it's going to take in

play18:25

the GitHub API with a configuration so

play18:30

um

play18:31

the way that we do this is

play18:34

um this can be a little confusing the

play18:36

GitHub API name here is the name of the

play18:38

resource definition

play18:40

um you can actually reuse that

play18:41

definition in multiple contexts in your

play18:42

application and so there's a resource

play18:44

key which is kind of like the instance

play18:45

of that resource and so the way we

play18:48

create an instance the resource is we

play18:50

call configured which basically passes a

play18:52

configuration to it and this can also be

play18:54

provided as an external configuration

play18:56

file but

play18:57

for this we're going to include in the

play18:59

code and we can say access token which

play19:01

was the the name of the the field in the

play19:04

config schema that we defined over here

play19:06

in resources

play19:09

um I'm going to show you this this

play19:10

feature of string Source now where we

play19:12

can actually pass an object with the key

play19:15

of N and then the name of the

play19:17

environment variable

play19:20

foreign

play19:28

configured let me just make sure I I

play19:30

mapped all my

play19:32

stuff correctly I think I did

play19:37

um what this basically does is it

play19:39

um it creates a new GitHub client it

play19:42

reads the secret or the token out of the

play19:44

environment and then it gives that

play19:46

resource to all of our software-defined

play19:48

assets

play19:50

um

play19:51

so let me just um reload the project and

play19:54

make sure that I didn't

play19:55

um didn't introduce any errors

play19:59

okay that looks good

play20:03

up next we have to actually use this

play20:05

resource from the uh the assets

play20:08

themselves

play20:09

and so if we go over here we really use

play20:12

this GitHub client in two places uh the

play20:14

first is in this GitHub star gazers

play20:16

asset and the second is in this get up

play20:18

Stars notebook just asset so one reads

play20:21

from GitHub the other one writes to

play20:23

GitHub

play20:26

um and so uh

play20:30

right here we provide required resource

play20:33

Keys it's basically defines a dependency

play20:36

on that resource called GitHub API

play20:40

and what this means is that the resource

play20:41

is now available to this resource to

play20:43

this uh asset and we can get at it by

play20:46

just saying

play20:49

context.resources.github API

play20:51

and passing in um or taking the context

play20:54

as a parameter I talked about this

play20:55

before if your first argument to your

play20:58

asset is called context you get a

play21:00

context object from dagster that

play21:02

includes things like the logger and also

play21:05

the resources

play21:07

and let's make a similar refactor

play21:10

um to this GitHub Stars notebook just

play21:13

down here we will say required resource

play21:18

keys

play21:19

API

play21:21

and then we will say

play21:24

context.resources.github API

play21:28

um by the way other other Frameworks or

play21:30

Technologies may have worked with might

play21:32

call this dependency injection it's very

play21:34

similar concept

play21:36

so let's

play21:38

um let's actually test this here now we

play21:40

don't need this access token in our

play21:42

source anymore

play21:44

we pass it as an environment variable

play21:49

and then we'll run the UI

play21:55

and if all goes well

play21:59

I should be able to materialize all

play22:08

and it looks like this is fetching

play22:11

um all the GitHub stars from the API now

play22:14

because we've restarted the service and

play22:15

I'm using the developer mode we lose our

play22:18

cache in between restarts of of the

play22:19

dagged service but you can configure it

play22:21

with a custom what we call i o manager

play22:24

in order to persist that data somewhere

play22:25

in between restarts so for example S3

play22:31

um but as that runs

play22:33

um you know it's taking a long time so

play22:34

I'm pretty sure that it's accessing the

play22:35

GitHub API correctly so this is this is

play22:37

huge we've basically just gotten rid of

play22:40

that secret from our source code

play22:42

um and we're reading it from the

play22:43

environment that can be provided using

play22:45

any um any sort of automation that you

play22:47

want

play22:48

um

play22:50

so the final thing that we need to do is

play22:52

we need to write some tests and the

play22:55

resource system helps us write tests as

play22:57

well

play22:58

so if I go over here

play23:00

and I open up the my dagster project

play23:02

tests I'm gonna I'm gonna shut down the

play23:04

UI because we don't need it anymore

play23:06

um I'm going to open up this test assets

play23:08

dot Pi file and I'm going to bring in

play23:12

um some some test code here

play23:15

first thing I'm going to do is

play23:18

bring in a bunch of imports so I'm going

play23:20

to import the software defined assets

play23:22

from our project I'm going to import a

play23:24

python utility called Magic Mock and

play23:26

we're going to use a lot of dagster test

play23:28

utility called materialize to memory and

play23:30

then some helper functions so pandas to

play23:33

create data frames and then date time to

play23:35

help us create some test data

play23:38

then I'm going to create our our smoke

play23:41

test a smoke test is kind of like a

play23:43

really um you know not not a super

play23:46

comprehensive test of every Edge case

play23:48

it's just testing the happy path and

play23:49

making sure that you know the common

play23:51

case works and we have a blog post

play23:53

coming out about this as well

play23:55

um so we're going to just simulate you

play23:58

know three users starring this repo

play24:01

um two of them on the same day in

play24:03

January of 2021 and another one in

play24:05

February

play24:08

then we are going to create a mock

play24:11

GitHub API so using the magic mock

play24:14

Library

play24:15

um we instantiate it and then you know I

play24:18

would definitely check out the

play24:20

documentation for um for magic mock uh

play24:24

because

play24:25

um

play24:27

You Know It uh it has some subtleties

play24:32

let me just make sure I've done this oh

play24:33

yeah okay so that's correct um

play24:35

so so just just to hand me over this a

play24:39

little bit um this mocks out this call

play24:41

from our asset that calls get stargazes

play24:44

with dates and it mocks it out to return

play24:46

uh effectively this this data set so

play24:49

we're simulating that the GitHub API is

play24:51

going to return that data

play24:55

um another thing we need to Mark out is

play24:56

the right path so we've mocked out the

play24:58

read path from GitHub but let's also

play25:00

mock out the right path and the reason

play25:01

we're mocking this out is so that our

play25:03

test can run

play25:05

um and it's fast and it doesn't talk to

play25:07

any external service and it also doesn't

play25:09

trigger any effects in the real world so

play25:11

it doesn't if you know you can imagine

play25:12

these apis could cost money or have

play25:15

quotas associated with them and you

play25:16

don't want your tests burning through

play25:18

that

play25:19

um so this mocks out

play25:21

um you know the create just function to

play25:23

return a fixed URL it's not actually uh

play25:26

you know upload a real gist

play25:28

and then finally

play25:31

we're going to actually materialize our

play25:34

Assets in the test so we have this

play25:36

function materialized to memory we pass

play25:38

in the assets that we want to

play25:39

materialize and we can also override

play25:42

um you know specific resources that we

play25:44

would like to use or provide them so

play25:47

this is a resource key and this is the

play25:48

resource definition

play25:51

and then the last step is we want to

play25:53

actually write some assertions here and

play25:55

I will paste these in

play25:57

um

play25:58

the first one just checks that there was

play26:02

actually a successful run and that the

play26:04

Run didn't throw any errors the second

play26:07

is we look at the output of the GitHub

play26:08

Star gazers by weak asset we do a little

play26:12

bit of um pandas magic to compare it

play26:14

with the expected data so if you look at

play26:16

our mock data set we would expect two

play26:19

weeks of data the first week in January

play26:21

to have two stars in the first week of

play26:23

February to have one star

play26:25

we've got that here and you can look at

play26:27

the original mock data if you don't

play26:29

believe me

play26:32

um then we also want to assert that we

play26:35

have actually called that create just

play26:37

function and that we are returning the

play26:39

URL correctly so we do that here

play26:42

uh and then finally I added a little

play26:44

smoke test just to make sure that the

play26:45

GitHub star's notebook content like the

play26:47

notebook file that was written out

play26:49

contained

play26:51

um you know data that we would expect to

play26:53

be there and that the gist was created

play26:56

as a a private gist not a public gist

play27:00

so if we want to run this test we say Pi

play27:03

test s

play27:05

um my dagster

play27:07

project tests

play27:10

the dash s is kind of running it in

play27:13

verbose mode so we see all the log

play27:14

output and any print statements that we

play27:16

might put in there and you can see here

play27:18

that our test passed

play27:21

um in uh in a pretty short amount of

play27:23

time so it wasn't pulling all that data

play27:25

from GitHub it wasn't talking to any

play27:26

external system it's basically just good

play27:28

old-fashioned in-memory testing

play27:31

um and it's always good practice to try

play27:33

to break your test and make sure that

play27:35

your test is actually testing something

play27:36

so this is changing it to assert that

play27:38

we've actually made a public gist

play27:40

instead of a private gist

play27:42

and

play27:44

you should be able to see that this test

play27:45

will break

play27:46

yep it breaks

play27:48

and then we can fix it again

play27:52

so anyway

play27:54

um this was a a crash course into

play27:56

um uh building an ETL Pipeline with

play27:59

dagster you learned a couple of things

play28:00

you learned the the primary way of

play28:02

development which is using

play28:03

software-defined assets you learned a

play28:06

little bit about how to use the user

play28:07

interface

play28:08

um you learned how to kind of migrate

play28:11

from a hello world to a more

play28:12

production-ready application through

play28:14

using the resource system which lets you

play28:16

both unlock testability as well as you

play28:21

know unlock Secrets management and

play28:22

configuration

play28:25

um but there's a lot more to learn and

play28:26

so we have a lot of documentation both

play28:28

linked from the blog post as well as

play28:30

additional tutorials and guides on our

play28:32

website

play28:33

um please check it out and uh if you

play28:34

like what you see please start the repo

play28:36

thank you very much