Dagster Crash Course: develop data assets in under ten minutes
Summary
TLDRIn this video, Pete takes you on a crash course for building an ETL pipeline with Dagster, a data orchestration platform. Starting from scratch, he guides you through the process of fetching data from the GitHub API, transforming it, visualizing it in a Jupyter notebook, and uploading the notebook as a GitHub Gist. Along the way, he covers key Dagster concepts like software-defined assets, resources for managing external dependencies, testing strategies, and scheduling pipelines to run automatically. By the end, you'll have a solid understanding of how Dagster streamlines data workflows and empowers you to build robust, testable, and production-ready ETL pipelines with ease.
Takeaways
- 🔑 Dagster is a tool for building data pipelines and ETL (Extract, Transform, Load) workflows as a DAG (Directed Acyclic Graph) of software-defined assets.
- 📦 Dagster provides a command-line interface and UI for scaffolding projects, managing dependencies, and running pipelines.
- 🧩 Software-defined assets are Python functions that represent data assets (e.g., reports, models, databases) and can be interconnected to form a pipeline.
- 🔄 Dagster caches intermediate computations, allowing efficient re-execution and iteration on specific pipeline steps.
- ⚙️ Dagster's resource system facilitates secure configuration, secret management, and test-driven development through dependency injection.
- 📅 Dagster supports scheduling pipelines to run at regular intervals, enabling automated data workflows.
- 🔬 The presenter demonstrated building an ETL pipeline to fetch GitHub repository stars, transform the data, visualize it in a Jupyter notebook, and publish the result as a GitHub Gist.
- 🧪 Test-driven development is encouraged in Dagster, with utilities for mocking external dependencies and asserting pipeline outputs.
- 📚 Extensive documentation and tutorials are available to learn more about Dagster's features and best practices.
- 🌟 The presenter encouraged users to explore Dagster further and contribute to the open-source project by starring the repository.
Q & A
What is the purpose of this video?
-The video provides a crash course on how to build an ETL (Extract, Transform, Load) pipeline using Dagster, a data orchestration platform.
What is the example pipeline demonstrated in the video?
-The example pipeline fetches GitHub stars data for the Dagster repository, transforms the data into a week-by-week count, creates a visualization in a Jupyter notebook, and uploads the notebook as a GitHub Gist.
What is the role of software-defined assets in Dagster?
-Software-defined assets are functions that return data representing a data asset in the pipeline graph, such as a machine learning model, a report, or a database table. These assets can have dependencies on other assets, forming a data pipeline.
How does Dagster handle caching and reusing computations?
-Dagster uses a system called IO managers to cache the output of computations in persistent storage like local disk or S3. This allows reusing cached data instead of recomputing it, improving efficiency and iteration speed.
What is the purpose of the resources system in Dagster?
-The resources system in Dagster allows abstracting away external dependencies, like API clients, into configurable resources. This enables testability, secret management, and swapping in test doubles for external services.
How does the video demonstrate secret management?
-The video shows how to move the GitHub API access token out of the source code and into an environment variable, which is then read by the GitHub API resource. This prevents secrets from being stored in the codebase.
What is the role of the test demonstrated in the video?
-The test is a smoke test that verifies the happy path of the pipeline by mocking the GitHub API and asserting expected outputs from the software-defined assets. It demonstrates Dagster's testability features.
How does Dagster enable scheduling pipelines?
-Dagster allows defining jobs and schedules within a repository. The video demonstrates adding a daily schedule to the ETL pipeline job, which can then be run automatically by Dagster's scheduler daemon.
What is the role of the Dagster UI in the development process?
-The Dagster UI provides a visual interface for launching and monitoring pipeline runs, inspecting asset metadata, and managing schedules. It aids in the development and operation of Dagster pipelines.
What are some potential next steps after completing this tutorial?
-The video suggests exploring Dagster's documentation, tutorials, and guides further, as the tutorial covers only a basic introduction. There are more advanced features and best practices to learn for production-ready Dagster pipelines.
Outlines
📽️ Introduction to Building an ETL Pipeline with Dagster
Pete, an employee at Dagster, introduces the video and provides context. He explains that the video serves as a companion to a blog post, offering a crash course on building an ETL (Extract, Transform, Load) pipeline with Dagster. The completed code is available on GitHub, and Pete will be using an in-browser editor called Gitpod to start from scratch. The objective is to create a report visualizing GitHub stars over time for the Dagster repository, utilizing the GitHub API, Dagster's software-defined assets, and the Jupyter API.
🔧 Setting Up the Development Environment and Fetching GitHub Data
Pete installs Dagster and its dependencies, explaining the project structure and components. He then creates a software-defined asset to fetch stargazers data from the GitHub API using a GitHub token. This asset returns the raw API response containing timestamps and usernames of stargazers.
🔄 Transforming and Visualizing the GitHub Data
Pete creates another software-defined asset to transform the raw GitHub stargazers data into weekly counts. He then uses a Jupyter notebook to visualize the transformed data, creating a software-defined asset that generates the notebook as a markdown string, executes it, and writes the output as a string readable by Jupyter. Finally, he creates an asset to upload the notebook as a GitHub gist, enabling sharing the visualization with stakeholders.
🚀 Running the Pipeline and Introducing Schedules
Pete demonstrates running the pipeline in the Dagster UI by materializing all assets. He highlights the caching mechanism that avoids redundant data fetching. Pete then introduces schedules, modifying the code to create a daily job that refreshes all assets. He shows the new daily schedule in the Dagster UI but notes that the daemon for running schedules is not running in this example.
🔐 Addressing Production Readiness: Secrets Management and Testing
Pete identifies two issues to address before considering the project production-ready: the hardcoded GitHub token (secret) in the source code and the lack of tests. He introduces the concepts of resources and configuration in Dagster, refactoring the code to use a resource for the GitHub API client and configuring it to read the token from an environment variable. This removes the secret from the source code.
🧪 Writing Tests for the ETL Pipeline
Pete demonstrates writing unit tests for the ETL pipeline using Dagster's testing utilities and the Python mock library. He creates a smoke test that simulates GitHub stargazers data, mocks the GitHub API, and asserts the expected behavior of the software-defined assets. Running the tests locally verifies their correctness without interacting with external services.
Mindmap
Keywords
💡ETL Pipeline
💡Software-Defined Assets
💡Dagster
💡Data Dependencies
💡Caching
💡Resources
💡Secrets Management
💡Testing
💡Schedules
💡Data Visualization
Highlights
Dagster is a framework for building ETL pipelines, where data assets are defined as software-defined assets and linked together to form a data pipeline.
Software-defined assets are functions that return data representing a data asset, such as a machine learning model, report, or database table.
Dagster uses an asset decorator to mark functions as software-defined assets, allowing it to understand the relationships and dependencies between assets.
Dagster caches intermediate computations, enabling faster iteration and bug fixing by reusing cached results instead of recomputing from scratch.
Dagster provides a resource system to abstract external dependencies, like API clients, enabling better testability and secrets management.
Resources can be configured separately from the assets, allowing different configurations for different environments or assets.
Dagster's resource system supports dependency injection, making it easier to swap out resources with test doubles for better testing.
Dagster uses a string source config schema to read secrets from environment variables, avoiding hard-coding secrets in source code.
Dagster provides utilities for writing smoke tests, which test the happy path and common cases of the pipeline.
Tests can mock out external dependencies like API calls, allowing tests to run quickly without talking to external systems or triggering real-world effects.
Dagster's materialize_to_memory function allows testing assets in memory, overriding resources with test doubles as needed.
Dagster supports defining jobs, which are collections of assets to be materialized together, and scheduling those jobs to run periodically.
Dagster provides a UI for visualizing assets, launching runs, and monitoring schedules.
Dagster uses a workspace.yaml file to configure the code locations it should load.
Dagster supports various Python environments and operating systems, making it easy to get started with a simple command line tool.
Transcripts
hi I'm Pete I work on dagster and today
I'm going to take you through a quick
crash course for how to build an ETL
Pipeline with dagster this is a
companion video to a blog post that we
wrote you can go check it out here to
get some additional contacts and links
to various documentation if you want to
just skip straight to the code the
completed code is available here at this
GitHub repo and finally as we go along
I'm going to be using an in-browser
editor called gitpod and starting from
scratch and then building it up from
there so you can just go here to get
started with that it's free and it makes
getting spun up with a python
environment very easy just to set the
context here we're going to create this
report and store it in GitHub gist this
is going to be created from an IPython
notebook and it's going to visualize
GitHub Stars over time for the Dax to
repo
so we're going to use the dagster API or
sorry the GitHub API and Dexter software
defined assets and the jupyter API to to
do all this
um so to get started I've got this Cloud
development environment here it's using
a service called gitpod there's a link
in the in the blog post it's a free
Cloud development environment that
guarantees a pretty stable and specific
python version you can use whatever
python environment you want dagster
works on on you know Mac Linux and
windows but I'm just going to use this
for consistency so the first thing we're
going to do is PIP install dagster
this installs our command line tools
that help you scaffold out a project so
we're just going to use the the default
simple project but there are a number of
different templates that you can use
um by accessing the help command but I'm
just going to save Dexter project
scaffold name my dagster project
and you can see here that we've created
a project
in this uh in this workspace so if I CD
my Dexter project I'm going to install
the dependencies that this example
project needs so this is just how you do
it while it installs we'll give you a
little tour of
um of what's in here we have a little
readme describing the project linking to
the documentation we've got our normal
kind of python set of dot Pi where we
list all the dependencies that we need
this workspace.yaml tells dagster when
we run it locally where to find the code
so dagster looks for this workspace.yaml
file to figure out what code load and
then we've got my Dexter project which
contains
um you know the the source code for our
project uh we'll be doing most of our
work in there and then we have the my
dagster project tests uh package which
contains all the unit tests uh for our
project
so I'm going to start dagster
using this command dag it there's
actually two components to dagster there
is the UI and then there's the Daemon
that runs the schedules because we're
not going to be using schedules
um right now I'm just going to launch
the UI and so you can see here that
we've got this this empty UI because we
haven't done anything yet but you just
run that one command dag it it looks at
that workspace.eml file and it loads up
your UI
um so like I said uh we are building
that um GitHub Stars dashboard so we
need a number of dependencies in order
to do that
um so in order so the way we add
dependencies is it's like any other
python project we just um update this
install requires part of the setup.pi
and then we rerun the install steps this
pip install Dash e command
and so this is going to install Pi
GitHub so we can access the GitHub API
I'm going to install matplotlib which
will help us visualize
um what's going on with the GitHub Stars
pandas is our data frame which is how we
manipulate and transform the data and
then these four packages are what's
needed in order to render a notebook
um so the first thing we're going to do
is we're going to want to fetch the raw
data from GitHub so this is going to
create
um you know basically the raw response
from the GitHub API
so
um we're going to use software-defined
assets to do that we're going to use
software-defined assets to begin to
build our application
so I um I just copied and pasted this in
before the cut
this is our example
um you know software-defined asset for
fetching the GitHub star gazers from the
GitHub API
and so
um I'm going to actually put a real
GitHub token in here
by the time you see this video I'm gonna
have deleted it so you can't use it for
anything
um obviously inlining a token
um or any sort of Secret In Your source
code is a really bad idea
we will fix that by the end of this
tutorial but for now we're just gonna
put that in there and we're going to
create what's called a software defined
asset software defined asset is a
function that returns
um some data that represents an asset a
data asset in your graph so this could
be a machine learning model a report or
a database table in this case we're just
returning um the response from the
GitHub API so we instantiate the pi
GitHub client we pass it the access
token
we do a little bit of function calls to
get the star gazers with the dates this
is effectively like the username and
then the date that they started the repo
and that the exact time stamp that they
start the repo and marking it with an
asset decorator
um indicates to dagster that this is a
software-defined asset
um so uh the next thing we're going to
need to do
is now that we have that raw API
response from GitHub we're going to need
to transform that into week by week
counts so we need to go from these pairs
of timestamps and usernames to the
number of unique users that have starred
the repo in a given week
and so I'm going to just paste some of
this code in from the blog post you can
follow follow along with a blog post if
you like
and so right here I'll take you through
how this works uh again we have a second
software-defined asset which is called
GitHub stargazers by week
this takes a parameter here called
GitHub star gazers what this actually is
is a it's got a special name because it
references this name right here now
dagster via the magic of
software-defined Assets in this asset
decorator knows how to match these two
up and so this basically declares a data
dependency between the GitHub Star
gazers by week and the get up stargazers
asset
um so uh what dagster will do is it will
know to materialize the get up
stargazer's asset before materializing
Get Up stargazers by week so now that we
have that data we iterate through it
here
and we create a new data frame where
um you know we we basically create
um one row for every user and when they
start it except we convert the timestamp
from a um the exact time stamp to just
the the start of the week and then we
will aggregate by the start of the week
so you can see here
um we call Group by week which
Aggregates everything into week by week
Aggregates we call count which counts
within the week and then we sort
chronologically by the week so then we
get an ordered a data frame
um the start of the week and then the
number of users that started during that
week
um if you have questions about how this
works check out the pandas documentation
so the next thing we need to do is go
from this data frame to some sort of
visualization Jupiter notebooks are a
really common way to do it normally I
would you know open up Jupiter to
develop the notebook but there's a
little library that makes it easier for
example it's called jupitxt where you
can write a notebook just as a as a
string of markdown inside of your uh
inside of your project
so I'm gonna just paste in some
additional code here
um
and you can see we've added another
software defined asset called GitHub
Stars notebook
so this takes in GitHub Star gazers by
week which we defined up here
we create markdown representing the um
the notebook so you can just think of
this as like an IPython notebook like an
ipymb file but just encode it as a
markdown string using this this Library
here we convert it to an actual IPython
notebook right here
um we call this execute
pre-processor.preprocess this is
something that we've imported from the
Jupiter Library which will basically
execute the notebook and put the results
into it
and then finally we call mbformat.write
which will write out the notebook as a
string that can be read by Jupiter or
any other service that supports
notebooks
um
finally we want to take this notebook
and upload it to GitHub as a gist so we
can take that URL and then share it with
stakeholders they can see the
visualization
so let's go and do that
um I'm going to to just paste in the the
code here
so you can see here we've defined
another software-defined asset with this
asset decorator
GitHub Stars notebook just it takes in
the GitHub star's notebook
um and uh this is uh we don't need this
right now
um
it takes in the GitHub Stars notebook it
calls the GitHub API it tells it to
create a gist and then uploads the
contents of that notebook as a file
attached to the gist and then we just
log out the the URL
and so
we um we've basically created all of our
um all of our software-defined assets
now let's
um let's try to take a look at them in
the in the dagster UI
so we've started up dag it let's open it
up
and now you can see our four assets that
we created the GitHub star gazers get up
Star gazers by week GitHub Stars
notebook and then the GitHub Stars
notebook gist
so I can just click this materialize all
button
and you can see that we launched a run
and it'll it'll go
so what this is doing right now is this
is fetching all of the Star gazers from
GitHub
this is actually quite a long operation
because it is fetching all of the
stargazers from the beginning of time
which is a quite expensive operation you
have to do multiple calls to the GitHub
API in order to do it and as you can see
it takes a long time one of the
advantages to modeling your computation
as a dag of assets the way that that
Daxter does is that we can cache this
computation and reuse it in the future
so for example if we want to iterate on
how the notebook works or how we're
transforming the data basically The Core
Business logic we don't have to do that
fetch again dagster uses a system called
i o managers and it stores that in
persistent storage in this case it's
stored on on my local disk
um but uh but you know in the future
um you know if you kind of go to
production you can use um S3 uh it will
be stored on um you know in a in a more
kind of production ready blob store so
as you can see
I made an edit uh to my code
um and I'm missing my context variable
so uh I'm gonna add that back
so context is not the name of an asset
context is actually a kind of a special
magic context that's passed through to
every asset if it's asked for so if your
first argument of your asset is called
context you get this context object it
has a number of things on it including
this log function or this logger where
you can call log log.info
so as you can see this failed because I
introduced an error
what I can actually do here is just
click on this and uh I can re-execute
the GitHub Stars notebook gist
so basically I don't have to sit and
wait for all of that fetching from the
GitHub API I can instead just run that
one step and it happens really quickly
so you know one of the things to take
away from this here is that fixing bugs
can be really fast when you model your
your computation this way
and so if I take a look here
I have a brand new
visualization
uploaded to um GitHub
and I can go and share this with with
any of my stakeholders looks like the
stars are going up that's great
um one thing to to kind of note here
um that I didn't cover uh is how this
notebook is created we basically pickle
the stargazers data in order to get it
in the notebook so the notebook can
actually like visualize it if you're
using a different visualization tool you
might do it a different way
um the last thing
um I want to show you is how to add a
schedule
and so
you know we we basically have created a
one-off job at this point
um so
effectively you can go into dagster
click on the launch run button or click
on the materialize button and do a
one-off run but really when you get to
production you want to
um
you really want to put things on a
schedule
let's all show you how to do that we
open up this repository here
and I'm going to make an edit here so
you can see that right now our
repository which is kind of dagster's
word for project in many ways
um it just contains all the assets from
our assets package
I'm going to put some additional things
in here
um we've got a
defined asset job function here this
defines what's called a dagster job so
when we click that materialize button
that created what we call a job
and and that job then materialize those
assets and we could run that job
multiple times and each one of those is
called a run
so we basically say hey we want to have
a job called daily refresh and it will
refresh all of the Assets in the project
that's what the star means and we will
put it on a daily schedule
and then we simply add the job and the
schedule to our project
and we can take a look at that in the
dagster UI so I go to my workspace I
want to just reload all my all my
schedules here
and then if I go to status and schedules
you can see here we've got a daily
schedule
linked to the job and the job you know
materializes all four of those assets
this warning over here by the way is
because our Damon isn't running like I
said earlier there's the UI and then
there's the Daemon these are the two big
processes that you have to think about
with with dagster the Damon runs the
schedules I didn't start the the Daemon
for this example
um so uh
you know that's why we have that warning
all right and there is a couple problems
now with this project even though it
works we are able to build a full ETL
pipeline from GitHub to a visualization
and run it on a schedule there's still
some things we need to do before we can
consider this project production ready
um the biggest problem I think is this
um this secret just sitting around in
your source code that's really bad the
second thing is we haven't written any
tests
um and the good news is dagster is
designed from the ground up to support
um you know really great testing um as
well as uh dealing with secrets
um so in order to do that we're going to
use two the extra Concepts the first one
is called config and the second one is
called resources
so I'm going to create a file here
called resources.pi
it's going to contain our example
resource so a resource is basically
um it's usually like a client that talks
to an external system so in our case
it's going to be our GitHub client we're
going to abstract that away into
something called a resource and what the
resource lets us do is it can be
configured separately from the rest of
the application
and the assets will depend on the GitHub
resource rather than the GitHub API
itself so we can specify you know
different access tokens to different
assets we can also swap it out and swap
in a test double for example so when we
write a test we'll use the resource
system in order to to test without
hitting the GitHub API
so just talking talking this through we
use the resource decorator to indicate
that we've defined a resource
um technically this is called a resource
definition uh the name of the definition
is called GitHub API and it simply
returns a pi GitHub client that takes in
the the the token that we are going to
include in the config
we also need to give it a config schema
this also this can take regular python
type so for example I could just say hey
this takes an X token as a string but
there's a super powered
um you know dagster object called string
source which has some extra features
including reading values from the
environment
um as or reading from environment
variables which which is very useful for
secrets so we're actually going to use
string Source here
um now that we've added the resource or
that we've created the resource we have
to actually add it to our project
so um I'm going to go back into our
Repository
I'm going to import the resource
and then I'm going to add it to our
project here so the way that we do that
is we use this function called with
resources
um which I will import up here
and we say
with resources and then that takes in
definitions and then the resource
definition so the first thing we'll do
is just pass in our assets what this
basically says is it'll give the
resources to all of our Assets in our
project
and then
we provide a resource name so in this
case we're going to call it GitHub API
and then it's going to take in
the GitHub API with a configuration so
um
the way that we do this is
um this can be a little confusing the
GitHub API name here is the name of the
resource definition
um you can actually reuse that
definition in multiple contexts in your
application and so there's a resource
key which is kind of like the instance
of that resource and so the way we
create an instance the resource is we
call configured which basically passes a
configuration to it and this can also be
provided as an external configuration
file but
for this we're going to include in the
code and we can say access token which
was the the name of the the field in the
config schema that we defined over here
in resources
um I'm going to show you this this
feature of string Source now where we
can actually pass an object with the key
of N and then the name of the
environment variable
foreign
configured let me just make sure I I
mapped all my
stuff correctly I think I did
um what this basically does is it
um it creates a new GitHub client it
reads the secret or the token out of the
environment and then it gives that
resource to all of our software-defined
assets
um
so let me just um reload the project and
make sure that I didn't
um didn't introduce any errors
okay that looks good
up next we have to actually use this
resource from the uh the assets
themselves
and so if we go over here we really use
this GitHub client in two places uh the
first is in this GitHub star gazers
asset and the second is in this get up
Stars notebook just asset so one reads
from GitHub the other one writes to
GitHub
um and so uh
right here we provide required resource
Keys it's basically defines a dependency
on that resource called GitHub API
and what this means is that the resource
is now available to this resource to
this uh asset and we can get at it by
just saying
context.resources.github API
and passing in um or taking the context
as a parameter I talked about this
before if your first argument to your
asset is called context you get a
context object from dagster that
includes things like the logger and also
the resources
and let's make a similar refactor
um to this GitHub Stars notebook just
down here we will say required resource
keys
API
and then we will say
context.resources.github API
um by the way other other Frameworks or
Technologies may have worked with might
call this dependency injection it's very
similar concept
so let's
um let's actually test this here now we
don't need this access token in our
source anymore
we pass it as an environment variable
and then we'll run the UI
and if all goes well
I should be able to materialize all
and it looks like this is fetching
um all the GitHub stars from the API now
because we've restarted the service and
I'm using the developer mode we lose our
cache in between restarts of of the
dagged service but you can configure it
with a custom what we call i o manager
in order to persist that data somewhere
in between restarts so for example S3
um but as that runs
um you know it's taking a long time so
I'm pretty sure that it's accessing the
GitHub API correctly so this is this is
huge we've basically just gotten rid of
that secret from our source code
um and we're reading it from the
environment that can be provided using
any um any sort of automation that you
want
um
so the final thing that we need to do is
we need to write some tests and the
resource system helps us write tests as
well
so if I go over here
and I open up the my dagster project
tests I'm gonna I'm gonna shut down the
UI because we don't need it anymore
um I'm going to open up this test assets
dot Pi file and I'm going to bring in
um some some test code here
first thing I'm going to do is
bring in a bunch of imports so I'm going
to import the software defined assets
from our project I'm going to import a
python utility called Magic Mock and
we're going to use a lot of dagster test
utility called materialize to memory and
then some helper functions so pandas to
create data frames and then date time to
help us create some test data
then I'm going to create our our smoke
test a smoke test is kind of like a
really um you know not not a super
comprehensive test of every Edge case
it's just testing the happy path and
making sure that you know the common
case works and we have a blog post
coming out about this as well
um so we're going to just simulate you
know three users starring this repo
um two of them on the same day in
January of 2021 and another one in
February
then we are going to create a mock
GitHub API so using the magic mock
Library
um we instantiate it and then you know I
would definitely check out the
documentation for um for magic mock uh
because
um
You Know It uh it has some subtleties
let me just make sure I've done this oh
yeah okay so that's correct um
so so just just to hand me over this a
little bit um this mocks out this call
from our asset that calls get stargazes
with dates and it mocks it out to return
uh effectively this this data set so
we're simulating that the GitHub API is
going to return that data
um another thing we need to Mark out is
the right path so we've mocked out the
read path from GitHub but let's also
mock out the right path and the reason
we're mocking this out is so that our
test can run
um and it's fast and it doesn't talk to
any external service and it also doesn't
trigger any effects in the real world so
it doesn't if you know you can imagine
these apis could cost money or have
quotas associated with them and you
don't want your tests burning through
that
um so this mocks out
um you know the create just function to
return a fixed URL it's not actually uh
you know upload a real gist
and then finally
we're going to actually materialize our
Assets in the test so we have this
function materialized to memory we pass
in the assets that we want to
materialize and we can also override
um you know specific resources that we
would like to use or provide them so
this is a resource key and this is the
resource definition
and then the last step is we want to
actually write some assertions here and
I will paste these in
um
the first one just checks that there was
actually a successful run and that the
Run didn't throw any errors the second
is we look at the output of the GitHub
Star gazers by weak asset we do a little
bit of um pandas magic to compare it
with the expected data so if you look at
our mock data set we would expect two
weeks of data the first week in January
to have two stars in the first week of
February to have one star
we've got that here and you can look at
the original mock data if you don't
believe me
um then we also want to assert that we
have actually called that create just
function and that we are returning the
URL correctly so we do that here
uh and then finally I added a little
smoke test just to make sure that the
GitHub star's notebook content like the
notebook file that was written out
contained
um you know data that we would expect to
be there and that the gist was created
as a a private gist not a public gist
so if we want to run this test we say Pi
test s
um my dagster
project tests
the dash s is kind of running it in
verbose mode so we see all the log
output and any print statements that we
might put in there and you can see here
that our test passed
um in uh in a pretty short amount of
time so it wasn't pulling all that data
from GitHub it wasn't talking to any
external system it's basically just good
old-fashioned in-memory testing
um and it's always good practice to try
to break your test and make sure that
your test is actually testing something
so this is changing it to assert that
we've actually made a public gist
instead of a private gist
and
you should be able to see that this test
will break
yep it breaks
and then we can fix it again
so anyway
um this was a a crash course into
um uh building an ETL Pipeline with
dagster you learned a couple of things
you learned the the primary way of
development which is using
software-defined assets you learned a
little bit about how to use the user
interface
um you learned how to kind of migrate
from a hello world to a more
production-ready application through
using the resource system which lets you
both unlock testability as well as you
know unlock Secrets management and
configuration
um but there's a lot more to learn and
so we have a lot of documentation both
linked from the blog post as well as
additional tutorials and guides on our
website
um please check it out and uh if you
like what you see please start the repo
thank you very much
Ver Más Videos Relacionados
Apache Airflow vs. Dagster
What is ETL Pipeline? | ETL Pipeline Tutorial | How to Build ETL Pipeline | Simplilearn
Airflow Vs. Dagster: The Full Breakdown!
What Is A Data Pipeline - Data Engineering 101 (FT. Alexey from @DataTalksClub )
Data Collection Stratergy For Machine Learning Projects With API's- RapidAPI
Azure Data Factory Part 3 - Creating first ADF Pipeline
5.0 / 5 (0 votes)