What Tools Should Data Engineers Know In 2024 - 100 Days Of Data Engineering
Summary
TLDRThe video script discusses the multitude of tools and skills necessary for a successful career as a data engineer. It emphasizes the importance of understanding programming languages like SQL and Python, working with Linux, and mastering version control with Git. The speaker also highlights the significance of working with databases, cloud data platforms, and ETL/data pipelines, as well as the evolving nature of data engineering tools. The video serves as a guide for those looking to break into the field, stressing the value of a solid foundation in both tools and best practices for data management and processing.
Takeaways
- π οΈ The landscape of data engineering tools is vast and constantly evolving, requiring adaptability and continuous learning.
- π§ Core programming languages and technologies like SQL, Python, and Linux are fundamental to a data engineer's skill set.
- π Understanding the basics of object-oriented programming and writing efficient functions is essential for effective data engineering.
- π₯οΈ Familiarity with version control systems like Git is crucial for managing code and collaborating with teams.
- π Knowledge of secure file transfer protocols (SFTP) and encryption tools (PGP) is necessary for data security and compliance.
- πΎ Working with databases, both traditional RDBMS and NoSQL, is a key responsibility of data engineers for data extraction and manipulation.
- π Cloud data platforms and warehouses like Snowflake, Databricks, and BigQuery are becoming increasingly important in modern data engineering.
- π Data orchestration and pipeline tools such as Airflow and Azure Data Factory help automate and manage data workflows.
- π§ A basic understanding of containerization (Docker) and orchestration (Kubernetes) can be beneficial, even if managed by a devops team.
- π The ability to choose the right tool for the job, whether it's a data warehouse, ETL, or data pipeline solution, is a valuable skill for data engineers.
- π― Focusing on building a solid foundation in data engineering principles and tools can lead to a successful and adaptable career in the field.
Q & A
What are some of the core programming languages and technologies a data engineer should be familiar with?
-A data engineer should have a strong understanding of SQL, Python, and Linux. They should also be comfortable working with Bash scripts and have a basic knowledge of networking.
How have the tools used in data engineering evolved over time?
-Data engineering tools have changed significantly over the years. Initially, engineers had to manually manage solutions like Hadoop and Spark by setting up their own infrastructure. Nowadays, cloud-based services like Databricks, Athena, and others have simplified the process.
What is the importance of version control in data engineering?
-Version control is crucial for managing code changes, collaborating with other engineers, and maintaining a record of the development process. Familiarity with tools like Git is essential for any data engineer.
What are some of the basic technical tools and skills that a data engineer should possess?
-Basic technical skills for a data engineer include understanding SFTP for secure file transfers, using PGP for encryption, and having a foundational knowledge of object-oriented programming and writing functions in Python.
How do different databases play a role in data engineering?
-Data engineers often interact with various databases, both traditional relational databases like PostgreSQL and MySQL, as well as NoSQL databases like MongoDB. Understanding how to pull data from these sources and manipulate them is a key part of the job.
What is the role of cloud data platforms and warehouses in data engineering?
-Cloud data platforms and warehouses like Snowflake, Databricks, and Big Query are used to build data lakes or data warehouses. They offer different architectures and features compared to traditional databases, and a data engineer must understand these differences to effectively use them.
Why is it important for a data engineer to understand both tools and the underlying concepts?
-Understanding both tools and concepts allows a data engineer to make informed decisions about which tools to use for specific tasks, optimize their work, and troubleshoot issues effectively. It also helps them adapt to new technologies and stay current in the field.
What are some orchestration and ETL tools that a data engineer might use?
-Orchestration and ETL tools like Airflow, SSIS, Azure Data Factory, and Informatica are used to automate data workflows, extract data from various sources, transform it into the desired format, and load it into target systems.
How does a data engineer decide which cloud platform to learn?
-A data engineer should consider the popularity and prevalence of cloud platforms in the job market, as well as the specific needs of the companies they want to work for. AWS is often a safe bet due to its widespread use, while Azure may be preferred by large enterprises.
What additional tools might a data engineer need to know for containerization and infrastructure management?
-For containerization, a data engineer might need to understand Docker and Kubernetes. For infrastructure management, tools like Terraform can be useful. However, these are often managed by devops teams, so data engineers might not need to be as deeply knowledgeable in these areas.
What advice would you give to someone looking to break into the field of data engineering?
-Focus on building a strong foundation with the core tools and technologies, and don't feel rushed to learn everything at once. It's more important to understand the concepts and how the tools fit into the bigger picture. As you gain experience, you'll naturally learn more advanced tools and techniques.
Outlines
π οΈ Introduction to Data Engineering Tools
This paragraph introduces the vast array of tools available to data engineers and acknowledges the challenge of keeping up with the constantly evolving landscape of data engineering. It reflects on how tools like Hadoop and Spark have changed over time, moving from self-hosted solutions to managed services like Databricks and Athena. The speaker aims to create a video series to help viewers understand which tools are essential for a data engineer, emphasizing the importance of foundational skills like programming languages (SQL, Python), operating systems (Linux), and basic scripting (Bash).
π§ Core Skills and Tools for Data Engineers
The speaker delves into the core skills and tools that a data engineer should possess, starting with programming languages and basic system interactions. It highlights the necessity of understanding networks and having a baseline of coding skills, including object-oriented programming and writing functions in Python. The paragraph also introduces version control systems like Git as essential tools for managing code, along with other technical tools like SFTP and PGP for secure file transfers.
πΎ Working with Databases and Data Platforms
This section focuses on the interaction with databases, both traditional relational databases and NoSQL counterparts, as a key aspect of data engineering. It discusses the importance of understanding different database systems and their nuances, such as indexing and data manipulation. The paragraph then transitions into cloud data platforms and warehouses like Snowflake and Databricks, emphasizing the need to grasp the differences between these services and traditional databases to effectively build data solutions.
π Orchestration, ETL, and Data Pipelines
The speaker addresses the realm of orchestration, ETL processes, and data pipelines, highlighting the various tools and platforms that data engineers may encounter. It mentions the use of Apache Airflow for workflow orchestration and how it can be utilized as an ETL or data pipeline solution. The paragraph also touches on the importance of understanding different data processing engines like Spark, Presto, and Trino, and the decision-making process behind choosing the right tool for the job.
π Cloud Services and Advanced Tools
In this part, the speaker discusses the importance of cloud services in data engineering, suggesting AWS as a good starting point due to its popularity and wide usage. It also mentions other cloud providers like Azure and GCP, and their specific use cases. The paragraph further explores additional tools like Docker and Kubernetes, suggesting that while they may not be an immediate focus, having a basic understanding of these technologies is beneficial for managing infrastructure and containers in a data engineering context.
Mindmap
Keywords
π‘Data Engineer
π‘Tools
π‘SQL
π‘Python
π‘Linux
π‘Version Control
π‘Data Platforms
π‘ETL
π‘Data Pipelines
π‘Cloud Computing
π‘DevOps
Highlights
The ever-changing landscape of data engineering tools
The importance of having a foundational understanding of programming languages like SQL, Python, and Linux
The evolution from self-hosting solutions like Hadoop and Spark to managed services on platforms such as Databricks and Athena
The necessity of understanding basic networking concepts for a data engineer
The role of version control systems like Git in data engineering workflows
The importance of learning and applying basic coding principles even for using drag-and-drop tools
The use of SFTP and PGP for secure data transfer and encryption
The need for familiarity with various databases, both traditional RDBMS and NoSQL
The distinction between data engineers, software engineers, and data scientists based on the tools they use
The concept of cloud data platforms and data warehouses, and how they differ from traditional databases
The learning curve associated with understanding the nuances of different cloud platforms like AWS, Azure, and GCP
The importance of not rushing through learning tools and taking the time to understand their intricacies
The role of orchestration tools like Airflow in ETL and data pipeline processes
The potential need for data engineers to understand and work with Docker and Kubernetes
The value of having a baseline understanding of tools to be competitive in the job market
The importance of continuous learning and growth in the field of data engineering
Transcripts
there are what feel like an infinite
amount of tools you can pick from as a
data engineer and likely if you've
worked in the industry for a while
you've maybe worked with some and heard
of others and are always wondering what
do you actually need to know to be a
successful data engineer and the funny
thing is and I think the challenge is
what we're working on today will
probably change a little bit tomorrow
you know when I first broke into the de
and data engineering world uh Hadoop and
Spark were all the rage and you would
have to figure out how to host it
yourself and spin up zookeeper and it
would be like 30 different solutions
just to get it working and now you know
we're just running things on data bricks
or Athena or something other than uh you
know us manually managing some of these
Solutions so the tools that we use
changed drastically uh over the years
but I wanted to create a video that was
in conjunction with my 100 days of data
engineering video that helps you guys
understand what tools you need to know
as a data engineer and we are taking a
quick pause from uh the AWS cloud videos
but I'll be back on those here shortly
if you haven't watched those uh give
those a checkout later if you'd like to
learn more about how data Engineers can
work with Cloud but for now let's talk
about tools from a high level let's
first cover the basics and this is one
of the challenges is like where do tools
start and where do tools end in terms of
solutions I think it's fair to say that
programming uh and certain languages and
basic solutions kind of fit in the tool
aspect and Tool World right like they
are tools we've built as humans as tool
Builders to help us automate and and
build processes so with that I think the
tools you'll definitely need as a dat
engineer even with things like chat GT
obviously uh SQL python Linux I say
Linux more more overarchingly um most
likely you're going to have to write
your fair share of bash scripts or at
least interact with um servers right you
might not need to be an expert but you
need to be able to interact with those
systems so yeah python SQL Linux some
level of understanding how to work with
networks all of that will likely come
into play and you need these Baseline
skills like they seem basic but you need
them like you can't get around them uh
yeah there's lots of dragon drop tools
but I recall recently I was working with
someone who was working on ssas and they
were like oh yeah I don't do the c um
blocks in ssas cuz I don't know how to
do it which to me is a little bit of a
copout because yeah you kind of should
be able to do at least some baseline
coding doesn't have to be fancy but at
least you know some level of
understanding you know of
object-oriented programming how to write
functions just your Baseline uh
understanding of python now along with
those Basics come kind of your other
technical basic Solutions and tools
right uh things like git right you
probably think of it as GitHub but git
is a broader solution and broader tool
that you will need to know you will need
to know how to Version Control uh all of
your various code that you will be
putting into places whether it's in a
Lambda or in a larger system that you're
developing you know in air flow Etc it
needs to go somewhere and so at least
understanding the four or five commands
in git that you will likely use all the
time you know things like get ad get
commit and get push uh at the very least
at least to understand how those
actually operate and if you want to go
deeper there are plenty of articles that
explain to you how this tool operates
but just being able to understand like
how to create branches and these small
things is very vital to being successful
as a data engineer and again these
skills kind of really build up your
Baseline and I think this is why it can
be hard to break into Data engineering
is because these are the tools that can
take time just in themselves to become
decent T you can probably in the 100
days that I've set up get a good idea of
these Solutions but maybe becoming
really good at them is hard and honestly
I'm still working on these Solutions and
constantly finding new things that I
maybe don't know fully um in these
Solutions uh another kind of basic
technical set of tools that you will
likely need to know is things like SFTP
and pgp these again are kind of this
interesting space I haven't started
talking about like actual tools yet like
uh people probably think of like airf
flow or snowflake but these are Baseline
skills and Baseline tools you will
likely need to work with you know you
likely will have somewhere uh that sstp
will come into play it still exists
today even with data sharing I have to
do it to Facebook I had to use SFTP or
secure file transfer protocol in order
to push files out to external Partners
who would then ingest that data and then
do analytics on it and then give us back
some sort of reporting on it and
similarly as we're going through that
process often we'd encrypt that file uh
with some set of keys so using something
like pgp uh or some similar uh protocol
will likely be required as well and now
you've got a baseline set of tools and I
probably Ed skills and tools here
interchange l in all fairness some of
this is tools some of this is skills but
I think you can't avoid these right like
these are things you have to know how to
work with you're going to have to know
to program you're going to have to know
use SQL you're going to likely have to
interact with a Linux box somewhere
whether it's an ec2 instance or a gcp
cloud compute somewhere you're going to
be interacting with these Solutions and
these tools now you are a data engineer
so you can't just know obviously these
These are tools that likely depending on
how you apply them either make you more
of a software engineer a data scientist
or a data engineer which why we break
these uh names up uh I've seen some
people I think recently kind of poke fun
at these names or be like you know
you're not really an engineer that's
less of the point to me to me it breaks
up the difference of why these different
jobs exist and what they do right sure
we're a data plumber that's fine
plumbers still have specific sets of
tools and plumbers also solve very hard
problems I know I've had them have to
fix a few around my house so as a data
engineer there are specific tools that
we use heavily first often at least
you'll have to interact with databases
in particular likely you'll pull a lot
of things from certain Source databases
and these Source databases tend to be
your traditional relational database
Management systems or something more on
the like nosql side so things that are
maybe document databases uh like mongod
DB uh so that's always great or cander
DB you'll also likely need to know
obviously again the traditional
postgress MySQL you don't have to know
every database that exists right there's
IBM db2 there's you know your Oracle
databases more than likely as long as
you get two or three under your belt and
they're kind of uh you know using some
different uh dialect of SQL you will be
familiar enough to pull from various
sources in the future yes they might all
interact slightly differently like you
they one will use change data Capture
One Way one will do it another way one
will have bin logs one won't but as long
as you get I think three or four that
you're comfortable with that you can
build a basic schema on you can
understand how to insert data into that
you can understand how to update data
you'll likely be okay here you don't
have to be again an expert but you need
to be familiar enough to build on them
to understand why someone might put an
index somewhere it will take again time
these aren't things you have to rush
through don't let 100 my 100 days of dat
engineering make you feel like you have
to run through any of this stuff again
if you don't learn it now if you happen
to get a job as a data engineer
somewhere you'll be learning it there uh
so just make sure you don't run too too
fast otherwise you're going to be
stressed out in your actual job so again
now that you've kind of got your
databases under underway right you've
kind of built a good understanding of
how databases operate this will kind of
give you the next layer of knowledge so
that when you go into what now people
kind of call like cloud data platforms
or cloud data warehouses honestly
there's so many different terms now that
people use for these Solutions because
you look at them and they aren't
actually set up like sometimes
traditional databases which is why it's
good to actually understand how I think
traditional databases operate so that
when you look at snowflake uh you don't
think they're the same thing you don't
think like oh this operates 100% exactly
the same way um as my traditional
database or same thing with data brakes
which is even farther removed from your
traditional databases um and harder to
probably grasp that hey there is a a
compute engine here or there's some sort
of you know uh query engine here sort of
with spark and there is kind of storage
but the some of the traditional stuff
that exists is kind of all peace Meed
out right it doesn't exist in the same
framework and so that's why it's really
important I think to build these steps
slowly so that you understand these
differences um I always remember when I
first uh interacted with my First Data
Warehouse because I had taken like your
tradition relational database course um
in school and I was literally uh
interning at the same time while I was
taking that course uh and looking at
this data warehouse I was like oh these
are kind of the same right like you've
got something called a key here and an
ID here they're the same thing right is
in my mind and obviously a few months
later I I eventually learned that no
these are different things and I had to
like dig into that and start reading
things like Kimble and actually dig into
the differences and that's why I think
again the more you can kind of
understand and see when when things are
different it it just makes you uh more
valuable as a data engineer moving
forward so the next one I'm sorry for
that die tribe but the next one is
really these data platforms so snowflake
data bricks we're going to throw a big
query in there again it doesn't fit
maybe as much of the data platform is
space but I guess if you add in
everything else gcp has it kind of can
so if you add in all the gcp data flows
and things like that it kind of fits but
these are the traditional Solutions you
might have to know um you can also again
throw in red shift there in Azure synaps
analytics which actually does fit more
of that data platform space but those
are kind of your key um data platforms
that you'll likely be building on and in
general building some sort of data
warehouse or dat lake house if that's
your cup of tea um on obviously it's
going to depend on which one of these
Solutions you pick they all do operate
slightly differently um the way I often
feel it is gcp tends to feel to me like
it's very a little more limited in terms
of like what I can find- tune whereas
snowflake tends to be a happy medium and
and data breaks like it gives you a lot
of uh control but then you have to
understand how that operates almost kind
of like the old or orle days where it's
like Oracle gave you a ton of control
but that's why you'd pay a lot for
Oracle Consultants cuz they'd have to
know how to like set up control files
and all this stuff um as you were
loading data and as well as fine-tuning
a lot of other stuff whereas you could
just use SQL Server which I often found
a little easier to work with versus
Oracle now as you're learning these
Solutions you're going to again be
layering more and more of your skills on
top of each other think about it what
are you going to be likely writing when
you are working on snowflake or big
query likely SQL that is how you're
going to intera act with these VAR
Solutions hopefully on top of these
tools you also have the skills and best
practices to build a data warehouse or
data lake house but those that's where I
definitely draw the line in terms of
like what's a tool what's more of a
skill and a best practice right that's
going to fall more in terms of like how
you build data pipelines how you build
data warehouses that comes more into
skills and best practices and design
versus actual tools that can help you um
Implement these uh best practices and
and designs and along with that if
you're on a a data bricks uh fan person
you will have to learn a spark and how
it operates and how you can best
interact with it including when you're
writing things like SQL instead of maybe
uh python or Scala in order to interact
with spark you know what's the best way
to run joins things like that um are
really important to understand why you
might want to use something like an
engine like Spark versus maybe Presto or
trino and where in fact at Facebook we
even had the ability to switch in
between spark and Presto depending on
the job cuz sometimes it was more
efficient to use Presto or trino and
sometimes it was more efficient to know
or to use spark and you'd have to know
why and so it's really good to know um
at least a little bit about all these
tools if not a deep understanding
because it will become valuable as
you're going along I think the important
thing is as you're going through these
steps of learning again you don't need
to feel rushed you will learn all of
this stuff through time as long as
you're putting in the effort you know if
you're putting in 10 minutes a day
probably won't learn it but if you're
putting in hours a day like most of us
have at some point if not still today
you will pick up these Solutions you
will learn them and you will feel
confident in actually being able to
deliver with them all right so now
you've again you've kind of built all of
this Baseline the next set of tools
you'll often see that you need to know
are things like orchestration ETL and
data pipelines you can throw an elt in
there these all kind of fit in this
similar space and I know some people
would get mad if I said that but I say
that because airflow which um obviously
fits in this workflow orchestrator space
often when I see it implemented gets
used as an ETL type solution or data
pipeline
to run a very basic um extraction of
data and then load it somewhere and then
maybe add in snowpipe or something
similar that can just pick up a trigger
of one file dropping into S3 but you're
really going to see that there are a lot
of different ways you can do pipelines
and honestly what you often find is
there's a few kind of types of tools you
can do things very custom you can build
it yourself people love doing that for
some reason even though we we've built a
ton people love having um open- Source
Solutions again airflow and Mage are
examples of that these kind of fit again
in that Orchestra Trader world but also
often just get used as data pipelines or
ETL type flows as well and then you have
things that are maybe fully you know
managed things like ssas very drag and
droppy um so ssas Azure data Factory and
a few others that all involve you know
dragging and dropping and and automating
tasks that way and so those are kind of
the various tools you'll see um there's
a few others that like very much focus
on maybe like just extract and load most
of those tend to be very easy to work
with so I don't think you need to put a
ton of effort learning them off the bat
I think it's very much worth to at some
point but more than likely you'll need
to pick some of these Solutions because
you will uh likely use them and they
tend to be the easiest to learn cuz
there are others even again we could go
on forever in terms of orchestrators
data pipelines Informatica and a few
others that often cost a lot of money to
get access to but for now I think just
understanding the concept and getting a
few of these tools under your belt maybe
one or two is good enough generally to
at least make you uh hirable which is
your first goal is just get hired in a
junior position and then eventually you
can go from there and again like I kind
of referenced earlier the cloud is
another set of tools that you will
eventually learn there are a ton of
clouds you don't need to learn all the
clouds generally I tell most people AWS
is a safe bet because most people use it
if you do want to learn Azure understand
that most of that is going to be uh at
large Enterprises that use it whereas
AWS T to be a broader range and I find
that most people that use gcp use it for
big query because they like big query so
if you are going to pick AWS tools
that's where I'd start and I do have
this AWS video that you can go through
to actually look at all the various uh
tools you'll likely need to know I think
I go through like eight or nine um cuz
you don't need to know every solution
and so the cloud is always a baseline
that you need to know and then there's a
ton of one-off tools that you may or may
not need to know like honestly I have
mixed feelings I will say that it's
worth at least digging into Docker cuz
you probably will have to occasionally
start up a Docker instance here or there
and same thing with kubernetes at least
understand how it operates understands
how to kill you know a pod occasionally
but more than likely you will have a
devops team that manages it and if not
you are the devops team and that's now
your new job like you it's very hard to
like build data pipelines and manage a
bunch of infrastructure that you've
developed similar thing can be said
about terraform right like it's worth
knowing but in theory you should have a
devops team that does that obviously
nowadays people are um starting to I
feel like reduce the amount of people
that work on these teams so maybe maybe
you will be a one One Stop Shop for all
of this stuff but these tend to be the
things you can maybe learn last uh you
don't need to put a ton of effort in
immediately um some of it will just come
through uh you just naturally doing your
work but also you don't want to be
running or trying to figure out how
Docker Works while you're pushing code
to production so make sure you've at
least run it a few times and if it is
your first time seeing it in production
and you've never run it in production
try to find someone to help you out
there cuz it's just there's a lot of
ways it can go wrong and those are most
of the Baseline tools you need to know
I'm sure there's others that people feel
like I've missed please comment below if
you think uh I'll pin it if I think it's
a good tool I'm like yeah that's
actually true I should have covered this
but I think that's the Baseline it takes
a little bit of time to become a data
engineer and I think this is part of it
like you don't have to know all of these
tools super in- depth but you need to
know what they do in an interview
someone likely might ask like where
would you maybe use one of these
Solutions versus another um how do joins
work uh on one solution versus another
uh does red shift have merge if it
doesn't how can you you know end up
running something similar to a merge
statement all of this stuff is important
to at least understand and have touched
here and there I don't want this to be
discouraging again you have long careers
um it took me a few years to get to a
point where I had a title of data
engineer and and even now whether that's
the title whether the title is data
plumber I think is less the point I
think the point is how you do your work
so hopefully this was helpful for you
out there whether you're an analyst an
engineer data scientists um to
understand the tools that a data
engineer will likely need know with that
guys thanks so much for watching and I
will see you guys in the next one thanks
all
goodbye a
Browse More Related Video
![](https://i.ytimg.com/vi/JLK0Emyu2Bw/hq720.jpg)
The Ultimate Big Data Engineering Roadmap: A Guide to Master Data Engineering in 2024
![](https://i.ytimg.com/vi/pDLDgz1_OJk/hqdefault.jpg?sqp=-oaymwEXCJADEOABSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLB4L8gGMRk39ndKqHLRkiX-UG-eyQ)
The Data Engineer Role
![](https://i.ytimg.com/vi/wlIKCJZEImw/hqdefault.jpg?sqp=-oaymwEXCJADEOABSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLD8tkTm_fHs1kZEXy1xaHmNwWqObg)
The Exact Skills and Certifications for an Entry Level Machine Learning Engineer
![](https://i.ytimg.com/vi/Rt6eb9VOFII/hq720.jpg)
How I Would Learn Data Science in 2022
![](https://i.ytimg.com/vi/B5IPb7RAVx4/hq720.jpg)
Les Γ©lΓ©ments INDISPENSABLES pour devenir un VRAI DATA ENGINEER
![](https://i.ytimg.com/vi/C0BkR91cIG4/hq720.jpg)
Apache Airflow vs. Dagster
5.0 / 5 (0 votes)