DE Zoomcamp 1.2.1 - Introduction to Docker

DataTalksClub ⬛
15 Jan 202223:55

Summary

TLDRThe video introduces docker, explaining key concepts like containers and isolation to demonstrate why docker is useful for data engineers. It walks through examples of running different docker containers locally to showcase reproducibility across environments and support for local testing and integration. The presenter sets up a basic python data pipeline docker container, parameters it to process data for a specific day, and previews connecting it to a local postgres database in the next video to load the New York Taxi dataset for SQL practice.

Takeaways

  • πŸ˜€ Docker provides software delivery through isolated packages called containers
  • πŸ‘ Containers ensure reproducibility across environments like local and cloud
  • πŸ“¦ Docker helps setup local data pipelines for experiments and tests
  • πŸ›  Docker images allow building reproducible and portable services
  • πŸ”Œ Docker enables running databases like Postgres without installing them
  • βš™οΈ Docker containers can be parameterized to pass inputs like dates
  • 🎯 Docker entrypoint overrides default container command on run
  • πŸš† Dockerfiles define instructions to build Docker images
  • πŸ–₯ Docker builds images automatically from Dockerfiles
  • 🌎 Docker Hub provides public Docker images to use

Q & A

  • What is Docker and what are some of its key features?

    -Docker is a software platform that allows you to build, run, test, and deploy applications quickly using containers. Key features include portability, isolation, and reproducibility.

  • Why is Docker useful for data engineers?

    -Docker is useful for data engineers because it enables local testing and experimentation, integration testing, reproducibility across environments, and deployment to production platforms like AWS and Kubernetes.

  • What is a Docker image?

    -A Docker image is a snapshot or template that contains all the dependencies and configurations needed to run an application or service inside a Docker container.

  • What is a Docker container?

    -A Docker container is an isolated, self-contained execution environment created from a Docker image. The container has its own filesystem, resources, dependencies and configurations.

  • What is a Dockerfile?

    -A Dockerfile is a text file with instructions for building a Docker image. It specifies the base image to use, commands to run, files and directories to copy, environment variables, dependencies to install, and more.

  • How do you build a Docker image?

    -You build a Docker image by running the 'docker build' command, specifying the path of the Dockerfile and a tag for the image. This runs the instructions in the Dockerfile and creates an image.

  • How do you run a Docker container?

    -You run a Docker container using the 'docker run' command along with parameters for the image name, ports to expose, environment variables to set, volumes to mount, and the command to execute inside the container.

  • What data set will be used in the course?

    -The New York City taxi rides dataset will be used throughout the course for tasks like practicing SQL, building data pipelines, and data processing.

  • What tools will be used alongside Docker?

    -The course will use PostgresSQL in a Docker container for a database. The pgAdmin tool will also be run via Docker to communicate with Postgres and run SQL queries.

  • What will be covered in the next video?

    -The next video will cover running PostgresSQL locally with Docker, loading sample data into it using a Python script, and dockerizing that script to automate data loading.

Outlines

00:00

😊 Introducing Docker and its uses for data engineering

The paragraph introduces Docker, explaining that it delivers software in isolated containers. It discusses using Docker to run data pipelines and databases locally for experiments and testing. Docker provides reproducibility to run the same environments locally and on the cloud. It enables local integration testing by isolating different database instances. The paragraph explains why Docker is useful for data engineers.

05:02

πŸ˜ƒ Docker enables reproducible and self-contained environments

This paragraph further discusses Docker's advantages. Docker images allow reproducing identical environments across different systems. Containers contain all dependencies, simplifying local experiments without installing anything locally. Multiple isolated containers can run simultaneously on one host without conflicts. The paragraph gives examples of running Postgres and pgAdmin with Docker without installations.

10:03

πŸ€“ Using Windows terminal, Docker and Python in examples

The speaker is using Windows with Git Bash terminal for the demos. The paragraph shows basic Docker commands to run some images like Ubuntu and Python. It demonstrates interacting with the containers, installing packages, and persisting changes by committing into new images based on Dockerfiles. An example sets up Python with Pandas using a Dockerfile and builds an image that runs a pipeline script.

15:04

πŸ’‘Persisting state in Docker containers using Dockerfiles

The paragraph continues the Dockerfile example to persist Pandas installation in the container. It explains that running containers from images resets state unlike committing containers into images. The Dockerfile installs Pandas and sets the entry point to get a Bash shell to run commands. After building this image with Pandas, import works in the Python container.

20:07

πŸ“ Complete pipeline script example with Docker

The paragraph completes the Dockerfile pipeline example. It copies a pipeline script into the image that prints a success message after importing Pandas. The Dockerfile overrides the entry point to directly run this script. After building the updated image, it demonstrates passing arguments to this script when running the container.

Mindmap

Keywords

πŸ’‘Docker

Docker is a software platform that allows applications to be packaged in containers, which are isolated from each other and the underlying infrastructure. The transcript discusses using Docker to package a data pipeline application with its dependencies into a container. This ensures reproducibility across environments.

πŸ’‘Container

A container is a standardized unit that packages up code and dependencies so an application can run quickly and reliably across computing environments. The transcript talks about running the data pipeline in a Docker container so it remains isolated from other processes on the host machine.

πŸ’‘Data pipeline

A data pipeline is a automated process that pulls in data from one or more sources, executes some data transformation tasks, and outputs the results to a destination. The transcript discusses dockerizing a sample data pipeline script that processes CSV data using Pandas.

πŸ’‘Reproducibility

Reproducibility refers to the ability to reliably recreate the same computational environment and obtain the same results. Docker enables reproducibility by allowing container images to be easily shared and deployed across machines.

πŸ’‘Integration testing

Integration testing verifies that different modules or services of an application pass data and work correctly together. The transcript suggests Docker is useful for setting up integration tests where the data pipeline is tested against a database.

πŸ’‘CLI

CLI stands for command-line interface, where you interact with the computer by typing commands instead of using the graphical interface. The transcript shows an example of passing arguments via the CLI to the Dockerized data pipeline.

πŸ’‘Entrypoint

The entrypoint is the command that is run when a Docker container starts from an image. The transcript overrides the entrypoint to directly execute the Python data pipeline script.

πŸ’‘Dockerfile

A Dockerfile is a text file with instructions for building a Docker image automatically. The transcript creates a Dockerfile that starts from a Python image, installs Pandas, and sets the entrypoint.

πŸ’‘Docker Hub

Docker Hub is Docker's public registry that hosts container images which can be downloaded and run on your machine. The transcript shows pulling example images like hello-world from Docker Hub.

πŸ’‘Docker image

A Docker image is a read-only template used for creating container instances. Built images act as snapshots to reliably recreate configured containers across environments.

Highlights

Docker delivers software in isolated packages called containers

Data pipelines transform input data into output data in isolated steps

Docker containers provide reproducible environments to run data pipelines

Docker enables local data pipeline experiments without installing dependencies

Docker allows running databases like Postgres without installing them

Docker containers run in isolation on the host machine without conflicts

Docker images snapshot container environments for portability across systems

CI/CD pipelines use Docker images to ensure reproducible deployments

Spark and serverless platforms utilize Docker for environment configuration

The Dockerfile defines the environment and commands to build a Docker image

Docker images allow installing libraries like pandas once for all containers

Docker containers can be parameterized with command line arguments

Entrypoint configures the default command run when starting a container

Volumes mount host directories into containers for file access

Next video will cover running Postgres in Docker and loading data

Transcripts

play00:00

welcome to our data engineering course

play00:02

and in this series of videos i'll talk

play00:04

about docker and this skill we'll start

play00:06

with okra

play00:11

this is actually what we'll cover in

play00:14

that part so we'll start with docker

play00:16

will tell you why we need docker why

play00:18

should we care as data engineers about

play00:20

docker

play00:22

and

play00:23

then after that we'll use docker to run

play00:25

postgres which is a database quite

play00:28

powerful database which we will use to

play00:30

practice some sql

play00:31

and then in the meantime while doing

play00:34

that we'll also take a look at the data

play00:35

set we will use for this course so this

play00:38

data set is this taxi rights new york

play00:41

data set we'll take a look at this data

play00:42

set and

play00:44

we will use this for practice in sql and

play00:47

we will use this data set also

play00:48

throughout the course

play00:50

for building data pipelines and for

play00:52

processing this data okay so let's start

play00:55

we'll start with docker so let me just

play00:57

go to google and

play00:59

look at what docker is so if i type in

play01:02

docker

play01:03

it tells you a bunch of things the

play01:05

interesting one is that it delivers

play01:08

software in packages called containers

play01:11

and containers are isolated from one

play01:13

another so this is important for us both

play01:16

things containers and isolation suppose

play01:18

we have a data pipeline we want to run

play01:21

this data pipeline in a docker container

play01:24

so it is this data pipeline is isolated

play01:27

from the rest of things

play01:29

let's start with data pipelines so data

play01:31

pipeline is a fancy name of

play01:35

process service that gets in data and

play01:38

produces more data and this could be

play01:41

let's say a python script that gets some

play01:44

csv files some data

play01:46

so let's say it can be csv files and

play01:49

then it takes in this data and does

play01:51

something with this data some processing

play01:54

some transformation some cleaning and

play01:56

then it produces other data and this

play01:59

could be for example a table in postgres

play02:03

with something else so we have some

play02:04

input data and we have output so or this

play02:08

could be source and this could be a

play02:10

destination for example and yeah so this

play02:13

goes in to our data pipeline of course

play02:16

the data pipeline can contain multiple

play02:18

data sources it outputs data to some

play02:20

destination and of course inside data

play02:23

pipeline there could be many many

play02:24

different steps that

play02:26

also follow the same pattern we can have

play02:29

this mini pipelines that we can chain

play02:34

and then this whole thing

play02:36

would also be called the data pipeline

play02:38

and now let's focus on one particular

play02:40

step so we have a script that gets in

play02:42

some data in cc format and writes it to

play02:46

address

play02:47

and we want to run it on our computer on

play02:48

a host machine

play02:50

so this is host computer i use windows

play02:53

but if you use linux or macos this is

play02:57

the environment you have and then on

play02:59

this host computer you can have multiple

play03:02

containers that

play03:03

we run with docker so for example one

play03:05

container could be this our

play03:07

data pipeline and for example for

play03:09

running this data pipeline we need let's

play03:12

say we want to

play03:13

use ubuntu 2004 right in this pipeline

play03:16

and then there are a bunch of things we

play03:19

depend on in order to run this pipeline

play03:21

so for example can be python 3.9 and

play03:24

then let's say if we want to read the

play03:27

csv file we will

play03:29

use pandas which is library in python

play03:31

for processing data and let's say this

play03:34

data pipeline will write results to our

play03:37

postgres database so then the data

play03:40

pipeline needs to know how to

play03:41

communicate with postgres so it needs

play03:43

postgres connection library and probably

play03:47

a bunch of other things so we can put

play03:49

this in a self-contained container and

play03:52

this container will have everything that

play03:55

this particular service this particular

play03:57

thing needs version of python all

play03:59

versions of libraries it will contain

play04:01

everything it needs for running this

play04:03

pipeline and we can actually have

play04:06

multiple containers in one host machine

play04:09

so for example we can also run our

play04:11

database postgres in a container so we

play04:15

will not need to install anything on our

play04:18

host computer we will not need to

play04:19

install postgres we will only need

play04:21

docker to be able to run a database it

play04:24

can also happen that we already have

play04:25

postgres on our computer on our host

play04:27

machine that we installed that we don't

play04:29

use through docker but we just installed

play04:31

it on our host computer and in this case

play04:34

this database and this database they

play04:37

will not conflict with each other so

play04:39

they will run in complete isolation and

play04:41

what is more we can actually have

play04:43

multiple databases grinding on our host

play04:46

computer inside docker and they will not

play04:50

know anything about each other they will

play04:52

not interfere with each other so this is

play04:54

quite good we can have more things so

play04:56

for example for accessing for

play04:59

communicating with postgres for brian x

play05:01

equal queries we will use a tool called

play05:03

pg admin we also don't need to install

play05:05

it we can also do we use this from

play05:07

docker so we can just run this as a

play05:10

container on our host computer we will

play05:12

not need to worry about installing

play05:13

anything as long as we have docker we

play05:15

can just run this pg admin

play05:17

and communicate with postgres run sql

play05:21

queries do testing and so on and another

play05:24

advantage that docker gives us is

play05:26

reproducibility so let's say we created

play05:29

a docker image a docker image is like a

play05:31

snapshot sort of of your container it

play05:33

has all the instructions that are needed

play05:36

to set up this particular environment so

play05:38

you have this docker image and you can

play05:40

take this docker image and run it in a

play05:42

different environment say

play05:44

we want now to take the data pipeline we

play05:46

developed and we want to run it in a

play05:48

different environment in google cloud in

play05:51

kubernetes or it could be aws watch or

play05:54

some other environment it doesn't matter

play05:57

and we can take this image and just run

play06:00

it there as a container and it will be

play06:02

the same container exactly the same

play06:04

container as we have locally so we

play06:06

this way we make sure that we have 100

play06:08

reproducibility because this image and

play06:12

this image are identical they have the

play06:14

same versions of libraries they have the

play06:16

same operating system there so they are

play06:19

identical and this way we make sure that

play06:21

if it works on my computer then it will

play06:24

also work there so this is the main

play06:26

advantage of docker so why should we

play06:29

care about docker as data engineers we

play06:31

already mentioned the reproducibility so

play06:33

that's quite useful then setting up

play06:36

things for local experiments

play06:38

this is quite useful so this is what we

play06:40

are going to do in this series of videos

play06:42

we will use postgres we will run it

play06:45

locally so local experiments and not

play06:48

only experiments but also local tests

play06:50

integration tests so for example let's

play06:51

say we have this data pipeline which is

play06:53

a complex pipeline that is doing some

play06:55

something with the data and we want to

play06:57

make sure that

play06:58

whatever it's doing we expect these

play07:00

results so we can come up with a bunch

play07:03

of tests to make sure that the behavior

play07:05

is what we expect and when we run this

play07:08

this data pipeline against that database

play07:11

and we go to this database to make sure

play07:13

that all the records we expect to be

play07:14

there are there and records that we do

play07:17

not expect to be there are not there

play07:18

these things are called integration

play07:20

tests and docker is quite useful for

play07:22

setting up this integration tests in

play07:24

cicd this is not something we'll cover

play07:27

in this course i think this is general

play07:29

software engineering best practices to

play07:31

have things like that and uh yeah you

play07:33

can look up what icd is for that we

play07:35

usually use things like github actions

play07:38

or gitlab cicd or jenkins things like

play07:41

that so you can take a look at that if

play07:43

you're interested if you haven't come

play07:45

across this concept before of cicd we

play07:48

will not be covering these things in

play07:50

this course but this is very useful i do

play07:52

recommend learning about this

play07:54

and then many times when we write by

play07:56

jobs like these data pipelines we want

play07:59

to make sure that they are reproducible

play08:01

and we can run them so

play08:04

here and

play08:05

running on the cloud it can be aws batch

play08:08

kubernetes jobs and and so on so we just

play08:11

take our image and we run it on the

play08:14

cloud

play08:15

and then things like spark or

play08:18

serverless

play08:19

we can specify in spark so the spark is

play08:22

a thing for uh also defining data

play08:25

pipelines so we can specify all the

play08:27

dependencies we need for our data

play08:28

pipeline in spark with docker and then

play08:30

serverless this is a concept that is

play08:33

quite useful for processing data usually

play08:36

one record at a time so these are things

play08:38

like aws though and

play08:40

i don't remember it's called google

play08:41

functions maybe i'm not sure but

play08:44

these things

play08:45

usually let us define the environment

play08:47

also as a docker image now you can see

play08:49

containers are everywhere and for data

play08:51

engineers it's quite important to know

play08:53

how to use docker how to use containers

play08:55

to be able to run local experiments to

play08:57

make sure they are reproducible and to

play08:59

use in different environments

play09:01

there everywhere okay by now i think i

play09:04

convinced you that docker is useful so

play09:06

let's see this in action so right now i

play09:09

am in

play09:10

our course repo

play09:12

in week one basics and setup

play09:16

so i will create now a directory i'll

play09:18

call it

play09:20

docker

play09:21

sql let me see to this directory by the

play09:23

way i think i mentioned i use windows

play09:25

and on windows i use a thing called min

play09:28

gv gw i don't know how to actually

play09:30

pronounce this but this is a linux like

play09:33

environment in windows so it has always

play09:36

linux commands like ls and

play09:39

others and this mean gw comes from

play09:44

commit bash when you install git for

play09:47

windows it comes with patch called bash

play09:50

emulator this mean gw

play09:53

and i use that as a terminal

play09:56

i think you can also use like standard

play09:58

command prompt or powershell but i would

play10:00

recommend to actually use mean gv

play10:03

or segwin or something like this if you

play10:05

are on windows or you can use

play10:07

windows subsystem for linux which could

play10:09

be even better so i also have it here

play10:13

so this is a usual ubuntu but i run it

play10:16

on windows you can experiment with both

play10:18

and see what you like yeah on mark you

play10:20

don't have this problem you can just use

play10:22

the usual unix

play10:24

terminal and uh of course on ubuntu or

play10:26

on linux you don't have this problem at

play10:28

all so i'm using gitbash what i want to

play10:31

do now is i want to execute

play10:33

to start my editor for editing i use

play10:35

visual studio code again you don't have

play10:38

to use it you can use something else you

play10:40

can use pycharm you can use sublime

play10:42

editor you can use notepad plus plus you

play10:45

can use beam you can use whatever you

play10:47

want if you don't have any preferences

play10:49

and if you don't know what to use you

play10:51

can just pick visual studio code this is

play10:53

what i personally will use for that

play10:55

course and

play10:56

here we can create a new file so this

play10:58

file should be called docker file and

play11:01

this is where we will specify our image

play11:04

actually before that i wanted to show

play11:05

you something once you install docker

play11:07

you can test it using docker run

play11:10

hello world and what will

play11:13

it will do it will go to docker hub this

play11:17

is a place where docker keeps all the

play11:18

images and it will look for an image

play11:21

with this name hello world and it will

play11:23

download this this image and it will run

play11:25

this image and what we see here this is

play11:28

actually output

play11:30

from docker from this image around this

play11:33

and it outputs something it means that

play11:35

docker works and now we can it suggests

play11:37

to do something more ambitious let's say

play11:39

we can run that docker run ubuntu so run

play11:43

it means that we want to run this image

play11:45

minus 80 means we want to do this in

play11:47

interactive mode i interactive t means

play11:49

terminal so it means that we want to be

play11:52

able to type something and then docker

play11:54

will react to that so let's run this and

play11:56

now what i mean by typing is yeah so you

play11:59

see i can type things here so ubuntu is

play12:02

the name of the image we want to run and

play12:04

then bash is here is a command that we

play12:07

want to execute in this image so this is

play12:10

like a parameter so everything that

play12:11

comes after the image name

play12:13

is parameter to

play12:15

this container

play12:18

and in this way we just say that we want

play12:20

to execute a bash on this image and this

play12:22

way we get this patch prompt so we can

play12:25

execute things and let's say we want to

play12:27

do something stupid here like

play12:30

remove everything

play12:31

that we have in our image so do this

play12:36

and it says yeah it's dangerous okay

play12:39

whatever i want to execute it anyways

play12:43

yeah so now i did a stupid thing i don't

play12:46

even have a less because unless is also

play12:48

a program i deleted it so i deleted this

play12:51

user bin that cannot even ls things so i

play12:54

don't know what is left on this

play12:55

container we don't have any files

play12:57

anymore and let us exit this container

play13:00

and run it again and when we run it

play13:04

we are again back to the state than we

play13:06

were before so this container is not

play13:08

affected by anything we did previously

play13:12

this is what isolation actually means if

play13:14

an app does something stupid our host

play13:16

machine is not affected

play13:18

okay so we again can do less here this

play13:21

is not super exciting let's do something

play13:23

even more interesting let us run

play13:26

python let's say 3.9 here we specify the

play13:30

image that we want to run and this is

play13:31

attack the tech you can think of tech as

play13:34

a specific version of what you want to

play13:35

run so this is attack and for us it

play13:38

means that we will run python 3.9

play13:41

let me execute this

play13:43

because i already run this image it

play13:46

doesn't download it it already used an

play13:48

image i downloaded i have downloaded

play13:50

previously but for you if you use it for

play13:53

the first time on your computer you will

play13:55

first see that it unloads an image and

play13:57

then runs it and then we get this python

play14:00

prompt and we can do things like

play14:02

print

play14:03

hello world

play14:05

we can

play14:06

write any python code like input others

play14:09

and for example we can do python stuff

play14:12

here

play14:13

what if we wanted to write a python

play14:15

script in this data pipeline we'll need

play14:17

to use pandas actually this one right so

play14:19

say it needs pandas so now i write

play14:22

import pandas and it says there is no

play14:25

module named pandas so we need to be

play14:27

able to install pandas here and what we

play14:29

usually do is something like peep

play14:30

install pandas but we do this outside of

play14:34

python prompt so we can just do this in

play14:36

python and install things there so let

play14:38

me exit this i pressed ctrl d right now

play14:41

to lift the python prompt

play14:43

so now we somehow need to get to bash to

play14:45

be able to install a command and for

play14:47

that we need to overwrite the entry

play14:49

point entry point is what exactly is

play14:51

executed when we run this container and

play14:54

entry point can be let's say bash and

play14:56

now instead of having a python prompt we

play14:58

have a brush prompt and we can execute

play15:00

much commands and then we can do deep

play15:02

install fundus

play15:04

and right now we are installing pandas

play15:06

on this specific docker container

play15:09

and it needs to install a bunch of

play15:10

things for pandas so pandas depends on

play15:12

numpy

play15:17

okay now we installed pandas and we do

play15:21

python now and we enter the python

play15:24

prompt and we can execute things here

play15:26

for example invert bundles and now it

play15:28

works so you can see what is the version

play15:30

of pandas for example

play15:33

yeah we can execute things here with

play15:35

pandas the problem here is now when we

play15:38

leave it so press ctrl d again now when

play15:40

we leave it and we execute this again

play15:42

and we run python again and go to import

play15:45

pandas there is no module named pandas

play15:48

for the same reason as our mrf slash we

play15:51

were able to recover from this so when

play15:53

we run this we run this specific

play15:54

container at this at that state so it

play15:57

was before we installed pandas so when

play16:00

we run it again it doesn't know that

play16:02

there should be pandas because this

play16:03

particular image doesn't have pandas

play16:05

even though we started the container

play16:07

based on this image we did some changes

play16:09

but the next time we started the

play16:10

container from this image we get the

play16:13

same state as uh before running all

play16:15

these things so somehow we need to add

play16:17

pandas to make sure that the pandas

play16:19

library is there when we run our code so

play16:22

let me exit this and for that we

play16:26

go back to this docker file that i

play16:27

created so here we can specify in the

play16:29

docker file we can specify all the

play16:31

instructions all the things that we want

play16:33

to run in order to create a new image

play16:36

based on whatever we want the docker

play16:37

file starts with usually with the from

play16:40

statements and in this statement we say

play16:43

what kind of base image we want to use

play16:45

for example we want to base our image on

play16:48

python 3.9 whatever we will run after

play16:51

that we'll use python 3.9 as a base

play16:53

image and then we can run a command so

play16:56

this command can be people installed on

play16:59

us this will install pandas inside the

play17:01

container and it will create a new image

play17:03

based on that and then we can also

play17:06

override so remember when we do docker

play17:08

run we get

play17:10

python prompt we can override it and get

play17:13

brush entry point can be brushed yeah so

play17:16

this is as simple simplest possible

play17:18

docker file with just two instructions

play17:20

we install pandas and we overwrite this

play17:23

entry point now we can build it

play17:25

for that we do docker build docker build

play17:28

command builds and image from dockerfile

play17:32

it's a couple of things so first of all

play17:33

it needs attack type could be

play17:35

for example let's call it test and then

play17:38

we can

play17:39

just leave it like that and the image

play17:41

name will be test or we can add a text

play17:43

here like a version could be test pandas

play17:45

or whatever and then we need a dotted

play17:47

plant dot means that we want docker to

play17:50

build an image in this directory and in

play17:53

this directory in the current directory

play17:54

it will look for docker file and it will

play17:56

execute this docker file and we'll

play17:58

create an image

play18:00

with this name so that's right

play18:03

yeah actually i was doing some

play18:04

experiments before so you see that it

play18:06

says cached so when you run this it will

play18:09

be a little bit different so it will

play18:11

actually run pip install pandas in

play18:13

docker you will see this for me i

play18:15

already did this yesterday when i was

play18:18

preparing the materials so that's why

play18:20

for me it's crushed maybe to see how it

play18:23

actually works let me take a specific

play18:25

version of python 3.9.1 i hope this

play18:28

version exists and then let's see what

play18:29

happens

play18:32

so now it downloads this specific

play18:34

version of python and then after it

play18:37

finishes downloading it it will also

play18:40

install pandas on top of that image

play18:43

now it's running in this thing so we can

play18:45

see the output what it is doing so it's

play18:47

installing dependencies for pandas

play18:52

okay so now it finished installing

play18:55

it took two minutes

play18:56

and now we can run this so let's do this

play19:00

docker brand minus 18. we don't need any

play19:03

parameters here so move around this

play19:05

image with this stock and because entry

play19:08

point here is bash we get a batch

play19:10

problem here and now if we do python and

play19:13

do import and you see that this is

play19:15

actually the version of python we have

play19:17

it's a bit an older version it's almost

play19:19

one year old so now we do when we do

play19:21

import pandas

play19:23

it successfully can import pandas you

play19:25

can also check the version of pandas

play19:27

which is in this version so now we have

play19:29

this image and let's do something a bit

play19:31

more exciting so let me create a

play19:33

pipeline

play19:35

this will be our data pipeline in this

play19:37

data pipeline we will use pandas usually

play19:39

the convention when we import pandas

play19:42

import pandas spd i don't know why let's

play19:44

just people use it this way so let's do

play19:46

this and then we will do

play19:48

some

play19:49

francis stuff with pandas like loading

play19:53

csv file and

play19:55

yeah let's just do print

play19:58

job finished successfully so this will

play20:01

be our data pipeline that will do some

play20:04

fancy stuff with pandas so for us it

play20:06

will be just a way to check that pandas

play20:08

can be imported it will not do much and

play20:10

now we can copy this file

play20:13

from this directory from our current

play20:15

working directory to the docker image

play20:17

line dot file so first we specify the

play20:19

name in the source on our host machine

play20:21

and then the name on the destination we

play20:23

can keep the same name you can also

play20:25

specify the working directory so work

play20:28

directory this will be the location in

play20:31

the image in the container where we will

play20:34

copy the file so i'll just call it up to

play20:37

create a slash app directory and it will

play20:40

do cd slash up to this directory and

play20:43

then it will copy the file there so let

play20:45

me execute

play20:48

i will build it um i will keep the same

play20:50

tech so it will overwrite the previous

play20:52

stack

play20:54

okay that was quite fast because we

play20:55

didn't need to install pandas so it used

play20:58

the cast version and now let me run it

play21:01

and now we you see we are in this slash

play21:04

app directory so if i do pwd this is our

play21:07

current directory so current directory

play21:08

is up because this is what we specified

play21:11

here and we have our pipeline.pi file

play21:14

there now let me run it

play21:17

and oh it finished the job successfully

play21:19

but in order to call it like a data

play21:21

pipeline this container has to be

play21:23

self-sufficient so we don't want to run

play21:25

the container

play21:26

go there and execute python pipeline.pi

play21:29

we also want to add some parameters

play21:30

there like for example we want to run

play21:32

this pipeline for a specific day it will

play21:35

pull all the data for this specific day

play21:37

apply some transformation and save the

play21:39

results somewhere so let me configure it

play21:43

i will use uh cli things import this

play21:47

and then this is our key so these are

play21:50

the command line arguments that we pass

play21:53

to the script so let me just print

play21:54

everything like all the arguments for

play21:56

you to see what can be there and then i

play21:59

think

play22:00

argument number zero is the name of the

play22:02

file the argument number one is whatever

play22:05

with us so let's say here we can have a

play22:07

variable that will call day that will be

play22:10

the first

play22:11

command line argument and you can see a

play22:13

job finished successfully for the equals

play22:16

p

play22:17

now let's see what happens i will

play22:19

rebuild it again

play22:20

one more thing i want to do okay so now

play22:22

we specify this pipeline i want to

play22:24

override this entry point so i want to

play22:27

say that when we do docker run i want

play22:29

docker to do python

play22:31

pipeline.time so this is what i want

play22:34

docker to do so let me build it one more

play22:36

time

play22:37

and now i want to run it now when i do

play22:39

this it will run this pipeline and i

play22:43

want to configure it to run it for a

play22:44

specific day so let's say today is

play22:47

15th of january and when i write an

play22:51

argument like this it will be an

play22:52

argument for the thing running in the

play22:55

container now let me execute you will

play22:56

see

play22:58

so this is the

play23:00

this arc b thing this one so it shows

play23:02

all the arguments and we use this day

play23:05

parameter and then we see that our job

play23:08

finished successfully for this

play23:09

particular day i don't need f here

play23:13

and if i put more arguments uh here so

play23:16

let's say one two three hello so all

play23:19

these arguments will be passed also to

play23:21

arcv you see we have a larger longer

play23:23

list with more arguments so this is how

play23:26

we can parameterize our

play23:28

data pipeline scripts okay this i just

play23:30

wanted to give you a test uh what we can

play23:32

do and

play23:33

in the next video we will see how we can

play23:35

run postgres

play23:37

locally with docker and we will also see

play23:40

how to put some data in this postgres

play23:42

with python we actually will keep

play23:44

working this pipeline script and then we

play23:46

will dockerize the script and it will

play23:48

put this new york taxi rights data set

play23:51

to postgres that's all for this

play23:53

video and see you soon