DE Zoomcamp 1.2.1 - Introduction to Docker
Summary
TLDRThe video introduces docker, explaining key concepts like containers and isolation to demonstrate why docker is useful for data engineers. It walks through examples of running different docker containers locally to showcase reproducibility across environments and support for local testing and integration. The presenter sets up a basic python data pipeline docker container, parameters it to process data for a specific day, and previews connecting it to a local postgres database in the next video to load the New York Taxi dataset for SQL practice.
Takeaways
- 😀 Docker provides software delivery through isolated packages called containers
- 👍 Containers ensure reproducibility across environments like local and cloud
- 📦 Docker helps setup local data pipelines for experiments and tests
- 🛠 Docker images allow building reproducible and portable services
- 🔌 Docker enables running databases like Postgres without installing them
- ⚙️ Docker containers can be parameterized to pass inputs like dates
- 🎯 Docker entrypoint overrides default container command on run
- 🚆 Dockerfiles define instructions to build Docker images
- 🖥 Docker builds images automatically from Dockerfiles
- 🌎 Docker Hub provides public Docker images to use
Q & A
What is Docker and what are some of its key features?
-Docker is a software platform that allows you to build, run, test, and deploy applications quickly using containers. Key features include portability, isolation, and reproducibility.
Why is Docker useful for data engineers?
-Docker is useful for data engineers because it enables local testing and experimentation, integration testing, reproducibility across environments, and deployment to production platforms like AWS and Kubernetes.
What is a Docker image?
-A Docker image is a snapshot or template that contains all the dependencies and configurations needed to run an application or service inside a Docker container.
What is a Docker container?
-A Docker container is an isolated, self-contained execution environment created from a Docker image. The container has its own filesystem, resources, dependencies and configurations.
What is a Dockerfile?
-A Dockerfile is a text file with instructions for building a Docker image. It specifies the base image to use, commands to run, files and directories to copy, environment variables, dependencies to install, and more.
How do you build a Docker image?
-You build a Docker image by running the 'docker build' command, specifying the path of the Dockerfile and a tag for the image. This runs the instructions in the Dockerfile and creates an image.
How do you run a Docker container?
-You run a Docker container using the 'docker run' command along with parameters for the image name, ports to expose, environment variables to set, volumes to mount, and the command to execute inside the container.
What data set will be used in the course?
-The New York City taxi rides dataset will be used throughout the course for tasks like practicing SQL, building data pipelines, and data processing.
What tools will be used alongside Docker?
-The course will use PostgresSQL in a Docker container for a database. The pgAdmin tool will also be run via Docker to communicate with Postgres and run SQL queries.
What will be covered in the next video?
-The next video will cover running PostgresSQL locally with Docker, loading sample data into it using a Python script, and dockerizing that script to automate data loading.
Outlines
😊 Introducing Docker and its uses for data engineering
The paragraph introduces Docker, explaining that it delivers software in isolated containers. It discusses using Docker to run data pipelines and databases locally for experiments and testing. Docker provides reproducibility to run the same environments locally and on the cloud. It enables local integration testing by isolating different database instances. The paragraph explains why Docker is useful for data engineers.
😃 Docker enables reproducible and self-contained environments
This paragraph further discusses Docker's advantages. Docker images allow reproducing identical environments across different systems. Containers contain all dependencies, simplifying local experiments without installing anything locally. Multiple isolated containers can run simultaneously on one host without conflicts. The paragraph gives examples of running Postgres and pgAdmin with Docker without installations.
🤓 Using Windows terminal, Docker and Python in examples
The speaker is using Windows with Git Bash terminal for the demos. The paragraph shows basic Docker commands to run some images like Ubuntu and Python. It demonstrates interacting with the containers, installing packages, and persisting changes by committing into new images based on Dockerfiles. An example sets up Python with Pandas using a Dockerfile and builds an image that runs a pipeline script.
💡Persisting state in Docker containers using Dockerfiles
The paragraph continues the Dockerfile example to persist Pandas installation in the container. It explains that running containers from images resets state unlike committing containers into images. The Dockerfile installs Pandas and sets the entry point to get a Bash shell to run commands. After building this image with Pandas, import works in the Python container.
📝 Complete pipeline script example with Docker
The paragraph completes the Dockerfile pipeline example. It copies a pipeline script into the image that prints a success message after importing Pandas. The Dockerfile overrides the entry point to directly run this script. After building the updated image, it demonstrates passing arguments to this script when running the container.
Mindmap
Keywords
💡Docker
💡Container
💡Data pipeline
💡Reproducibility
💡Integration testing
💡CLI
💡Entrypoint
💡Dockerfile
💡Docker Hub
💡Docker image
Highlights
Docker delivers software in isolated packages called containers
Data pipelines transform input data into output data in isolated steps
Docker containers provide reproducible environments to run data pipelines
Docker enables local data pipeline experiments without installing dependencies
Docker allows running databases like Postgres without installing them
Docker containers run in isolation on the host machine without conflicts
Docker images snapshot container environments for portability across systems
CI/CD pipelines use Docker images to ensure reproducible deployments
Spark and serverless platforms utilize Docker for environment configuration
The Dockerfile defines the environment and commands to build a Docker image
Docker images allow installing libraries like pandas once for all containers
Docker containers can be parameterized with command line arguments
Entrypoint configures the default command run when starting a container
Volumes mount host directories into containers for file access
Next video will cover running Postgres in Docker and loading data
Transcripts
welcome to our data engineering course
and in this series of videos i'll talk
about docker and this skill we'll start
with okra
this is actually what we'll cover in
that part so we'll start with docker
will tell you why we need docker why
should we care as data engineers about
docker
and
then after that we'll use docker to run
postgres which is a database quite
powerful database which we will use to
practice some sql
and then in the meantime while doing
that we'll also take a look at the data
set we will use for this course so this
data set is this taxi rights new york
data set we'll take a look at this data
set and
we will use this for practice in sql and
we will use this data set also
throughout the course
for building data pipelines and for
processing this data okay so let's start
we'll start with docker so let me just
go to google and
look at what docker is so if i type in
docker
it tells you a bunch of things the
interesting one is that it delivers
software in packages called containers
and containers are isolated from one
another so this is important for us both
things containers and isolation suppose
we have a data pipeline we want to run
this data pipeline in a docker container
so it is this data pipeline is isolated
from the rest of things
let's start with data pipelines so data
pipeline is a fancy name of
process service that gets in data and
produces more data and this could be
let's say a python script that gets some
csv files some data
so let's say it can be csv files and
then it takes in this data and does
something with this data some processing
some transformation some cleaning and
then it produces other data and this
could be for example a table in postgres
with something else so we have some
input data and we have output so or this
could be source and this could be a
destination for example and yeah so this
goes in to our data pipeline of course
the data pipeline can contain multiple
data sources it outputs data to some
destination and of course inside data
pipeline there could be many many
different steps that
also follow the same pattern we can have
this mini pipelines that we can chain
and then this whole thing
would also be called the data pipeline
and now let's focus on one particular
step so we have a script that gets in
some data in cc format and writes it to
address
and we want to run it on our computer on
a host machine
so this is host computer i use windows
but if you use linux or macos this is
the environment you have and then on
this host computer you can have multiple
containers that
we run with docker so for example one
container could be this our
data pipeline and for example for
running this data pipeline we need let's
say we want to
use ubuntu 2004 right in this pipeline
and then there are a bunch of things we
depend on in order to run this pipeline
so for example can be python 3.9 and
then let's say if we want to read the
csv file we will
use pandas which is library in python
for processing data and let's say this
data pipeline will write results to our
postgres database so then the data
pipeline needs to know how to
communicate with postgres so it needs
postgres connection library and probably
a bunch of other things so we can put
this in a self-contained container and
this container will have everything that
this particular service this particular
thing needs version of python all
versions of libraries it will contain
everything it needs for running this
pipeline and we can actually have
multiple containers in one host machine
so for example we can also run our
database postgres in a container so we
will not need to install anything on our
host computer we will not need to
install postgres we will only need
docker to be able to run a database it
can also happen that we already have
postgres on our computer on our host
machine that we installed that we don't
use through docker but we just installed
it on our host computer and in this case
this database and this database they
will not conflict with each other so
they will run in complete isolation and
what is more we can actually have
multiple databases grinding on our host
computer inside docker and they will not
know anything about each other they will
not interfere with each other so this is
quite good we can have more things so
for example for accessing for
communicating with postgres for brian x
equal queries we will use a tool called
pg admin we also don't need to install
it we can also do we use this from
docker so we can just run this as a
container on our host computer we will
not need to worry about installing
anything as long as we have docker we
can just run this pg admin
and communicate with postgres run sql
queries do testing and so on and another
advantage that docker gives us is
reproducibility so let's say we created
a docker image a docker image is like a
snapshot sort of of your container it
has all the instructions that are needed
to set up this particular environment so
you have this docker image and you can
take this docker image and run it in a
different environment say
we want now to take the data pipeline we
developed and we want to run it in a
different environment in google cloud in
kubernetes or it could be aws watch or
some other environment it doesn't matter
and we can take this image and just run
it there as a container and it will be
the same container exactly the same
container as we have locally so we
this way we make sure that we have 100
reproducibility because this image and
this image are identical they have the
same versions of libraries they have the
same operating system there so they are
identical and this way we make sure that
if it works on my computer then it will
also work there so this is the main
advantage of docker so why should we
care about docker as data engineers we
already mentioned the reproducibility so
that's quite useful then setting up
things for local experiments
this is quite useful so this is what we
are going to do in this series of videos
we will use postgres we will run it
locally so local experiments and not
only experiments but also local tests
integration tests so for example let's
say we have this data pipeline which is
a complex pipeline that is doing some
something with the data and we want to
make sure that
whatever it's doing we expect these
results so we can come up with a bunch
of tests to make sure that the behavior
is what we expect and when we run this
this data pipeline against that database
and we go to this database to make sure
that all the records we expect to be
there are there and records that we do
not expect to be there are not there
these things are called integration
tests and docker is quite useful for
setting up this integration tests in
cicd this is not something we'll cover
in this course i think this is general
software engineering best practices to
have things like that and uh yeah you
can look up what icd is for that we
usually use things like github actions
or gitlab cicd or jenkins things like
that so you can take a look at that if
you're interested if you haven't come
across this concept before of cicd we
will not be covering these things in
this course but this is very useful i do
recommend learning about this
and then many times when we write by
jobs like these data pipelines we want
to make sure that they are reproducible
and we can run them so
here and
running on the cloud it can be aws batch
kubernetes jobs and and so on so we just
take our image and we run it on the
cloud
and then things like spark or
serverless
we can specify in spark so the spark is
a thing for uh also defining data
pipelines so we can specify all the
dependencies we need for our data
pipeline in spark with docker and then
serverless this is a concept that is
quite useful for processing data usually
one record at a time so these are things
like aws though and
i don't remember it's called google
functions maybe i'm not sure but
these things
usually let us define the environment
also as a docker image now you can see
containers are everywhere and for data
engineers it's quite important to know
how to use docker how to use containers
to be able to run local experiments to
make sure they are reproducible and to
use in different environments
there everywhere okay by now i think i
convinced you that docker is useful so
let's see this in action so right now i
am in
our course repo
in week one basics and setup
so i will create now a directory i'll
call it
docker
sql let me see to this directory by the
way i think i mentioned i use windows
and on windows i use a thing called min
gv gw i don't know how to actually
pronounce this but this is a linux like
environment in windows so it has always
linux commands like ls and
others and this mean gw comes from
commit bash when you install git for
windows it comes with patch called bash
emulator this mean gw
and i use that as a terminal
i think you can also use like standard
command prompt or powershell but i would
recommend to actually use mean gv
or segwin or something like this if you
are on windows or you can use
windows subsystem for linux which could
be even better so i also have it here
so this is a usual ubuntu but i run it
on windows you can experiment with both
and see what you like yeah on mark you
don't have this problem you can just use
the usual unix
terminal and uh of course on ubuntu or
on linux you don't have this problem at
all so i'm using gitbash what i want to
do now is i want to execute
to start my editor for editing i use
visual studio code again you don't have
to use it you can use something else you
can use pycharm you can use sublime
editor you can use notepad plus plus you
can use beam you can use whatever you
want if you don't have any preferences
and if you don't know what to use you
can just pick visual studio code this is
what i personally will use for that
course and
here we can create a new file so this
file should be called docker file and
this is where we will specify our image
actually before that i wanted to show
you something once you install docker
you can test it using docker run
hello world and what will
it will do it will go to docker hub this
is a place where docker keeps all the
images and it will look for an image
with this name hello world and it will
download this this image and it will run
this image and what we see here this is
actually output
from docker from this image around this
and it outputs something it means that
docker works and now we can it suggests
to do something more ambitious let's say
we can run that docker run ubuntu so run
it means that we want to run this image
minus 80 means we want to do this in
interactive mode i interactive t means
terminal so it means that we want to be
able to type something and then docker
will react to that so let's run this and
now what i mean by typing is yeah so you
see i can type things here so ubuntu is
the name of the image we want to run and
then bash is here is a command that we
want to execute in this image so this is
like a parameter so everything that
comes after the image name
is parameter to
this container
and in this way we just say that we want
to execute a bash on this image and this
way we get this patch prompt so we can
execute things and let's say we want to
do something stupid here like
remove everything
that we have in our image so do this
and it says yeah it's dangerous okay
whatever i want to execute it anyways
yeah so now i did a stupid thing i don't
even have a less because unless is also
a program i deleted it so i deleted this
user bin that cannot even ls things so i
don't know what is left on this
container we don't have any files
anymore and let us exit this container
and run it again and when we run it
we are again back to the state than we
were before so this container is not
affected by anything we did previously
this is what isolation actually means if
an app does something stupid our host
machine is not affected
okay so we again can do less here this
is not super exciting let's do something
even more interesting let us run
python let's say 3.9 here we specify the
image that we want to run and this is
attack the tech you can think of tech as
a specific version of what you want to
run so this is attack and for us it
means that we will run python 3.9
let me execute this
because i already run this image it
doesn't download it it already used an
image i downloaded i have downloaded
previously but for you if you use it for
the first time on your computer you will
first see that it unloads an image and
then runs it and then we get this python
prompt and we can do things like
hello world
we can
write any python code like input others
and for example we can do python stuff
here
what if we wanted to write a python
script in this data pipeline we'll need
to use pandas actually this one right so
say it needs pandas so now i write
import pandas and it says there is no
module named pandas so we need to be
able to install pandas here and what we
usually do is something like peep
install pandas but we do this outside of
python prompt so we can just do this in
python and install things there so let
me exit this i pressed ctrl d right now
to lift the python prompt
so now we somehow need to get to bash to
be able to install a command and for
that we need to overwrite the entry
point entry point is what exactly is
executed when we run this container and
entry point can be let's say bash and
now instead of having a python prompt we
have a brush prompt and we can execute
much commands and then we can do deep
install fundus
and right now we are installing pandas
on this specific docker container
and it needs to install a bunch of
things for pandas so pandas depends on
numpy
okay now we installed pandas and we do
python now and we enter the python
prompt and we can execute things here
for example invert bundles and now it
works so you can see what is the version
of pandas for example
yeah we can execute things here with
pandas the problem here is now when we
leave it so press ctrl d again now when
we leave it and we execute this again
and we run python again and go to import
pandas there is no module named pandas
for the same reason as our mrf slash we
were able to recover from this so when
we run this we run this specific
container at this at that state so it
was before we installed pandas so when
we run it again it doesn't know that
there should be pandas because this
particular image doesn't have pandas
even though we started the container
based on this image we did some changes
but the next time we started the
container from this image we get the
same state as uh before running all
these things so somehow we need to add
pandas to make sure that the pandas
library is there when we run our code so
let me exit this and for that we
go back to this docker file that i
created so here we can specify in the
docker file we can specify all the
instructions all the things that we want
to run in order to create a new image
based on whatever we want the docker
file starts with usually with the from
statements and in this statement we say
what kind of base image we want to use
for example we want to base our image on
python 3.9 whatever we will run after
that we'll use python 3.9 as a base
image and then we can run a command so
this command can be people installed on
us this will install pandas inside the
container and it will create a new image
based on that and then we can also
override so remember when we do docker
run we get
python prompt we can override it and get
brush entry point can be brushed yeah so
this is as simple simplest possible
docker file with just two instructions
we install pandas and we overwrite this
entry point now we can build it
for that we do docker build docker build
command builds and image from dockerfile
it's a couple of things so first of all
it needs attack type could be
for example let's call it test and then
we can
just leave it like that and the image
name will be test or we can add a text
here like a version could be test pandas
or whatever and then we need a dotted
plant dot means that we want docker to
build an image in this directory and in
this directory in the current directory
it will look for docker file and it will
execute this docker file and we'll
create an image
with this name so that's right
yeah actually i was doing some
experiments before so you see that it
says cached so when you run this it will
be a little bit different so it will
actually run pip install pandas in
docker you will see this for me i
already did this yesterday when i was
preparing the materials so that's why
for me it's crushed maybe to see how it
actually works let me take a specific
version of python 3.9.1 i hope this
version exists and then let's see what
happens
so now it downloads this specific
version of python and then after it
finishes downloading it it will also
install pandas on top of that image
now it's running in this thing so we can
see the output what it is doing so it's
installing dependencies for pandas
okay so now it finished installing
it took two minutes
and now we can run this so let's do this
docker brand minus 18. we don't need any
parameters here so move around this
image with this stock and because entry
point here is bash we get a batch
problem here and now if we do python and
do import and you see that this is
actually the version of python we have
it's a bit an older version it's almost
one year old so now we do when we do
import pandas
it successfully can import pandas you
can also check the version of pandas
which is in this version so now we have
this image and let's do something a bit
more exciting so let me create a
pipeline
this will be our data pipeline in this
data pipeline we will use pandas usually
the convention when we import pandas
import pandas spd i don't know why let's
just people use it this way so let's do
this and then we will do
some
francis stuff with pandas like loading
csv file and
yeah let's just do print
job finished successfully so this will
be our data pipeline that will do some
fancy stuff with pandas so for us it
will be just a way to check that pandas
can be imported it will not do much and
now we can copy this file
from this directory from our current
working directory to the docker image
line dot file so first we specify the
name in the source on our host machine
and then the name on the destination we
can keep the same name you can also
specify the working directory so work
directory this will be the location in
the image in the container where we will
copy the file so i'll just call it up to
create a slash app directory and it will
do cd slash up to this directory and
then it will copy the file there so let
me execute
i will build it um i will keep the same
tech so it will overwrite the previous
stack
okay that was quite fast because we
didn't need to install pandas so it used
the cast version and now let me run it
and now we you see we are in this slash
app directory so if i do pwd this is our
current directory so current directory
is up because this is what we specified
here and we have our pipeline.pi file
there now let me run it
and oh it finished the job successfully
but in order to call it like a data
pipeline this container has to be
self-sufficient so we don't want to run
the container
go there and execute python pipeline.pi
we also want to add some parameters
there like for example we want to run
this pipeline for a specific day it will
pull all the data for this specific day
apply some transformation and save the
results somewhere so let me configure it
i will use uh cli things import this
and then this is our key so these are
the command line arguments that we pass
to the script so let me just print
everything like all the arguments for
you to see what can be there and then i
think
argument number zero is the name of the
file the argument number one is whatever
with us so let's say here we can have a
variable that will call day that will be
the first
command line argument and you can see a
job finished successfully for the equals
p
now let's see what happens i will
rebuild it again
one more thing i want to do okay so now
we specify this pipeline i want to
override this entry point so i want to
say that when we do docker run i want
docker to do python
pipeline.time so this is what i want
docker to do so let me build it one more
time
and now i want to run it now when i do
this it will run this pipeline and i
want to configure it to run it for a
specific day so let's say today is
15th of january and when i write an
argument like this it will be an
argument for the thing running in the
container now let me execute you will
see
so this is the
this arc b thing this one so it shows
all the arguments and we use this day
parameter and then we see that our job
finished successfully for this
particular day i don't need f here
and if i put more arguments uh here so
let's say one two three hello so all
these arguments will be passed also to
arcv you see we have a larger longer
list with more arguments so this is how
we can parameterize our
data pipeline scripts okay this i just
wanted to give you a test uh what we can
do and
in the next video we will see how we can
run postgres
locally with docker and we will also see
how to put some data in this postgres
with python we actually will keep
working this pipeline script and then we
will dockerize the script and it will
put this new york taxi rights data set
to postgres that's all for this
video and see you soon
5.0 / 5 (0 votes)