What Tools Should Data Engineers Know In 2024 - 100 Days Of Data Engineering

Seattle Data Guy
2 Apr 202417:30

Summary

TLDRThe video script discusses the multitude of tools and skills necessary for a successful career as a data engineer. It emphasizes the importance of understanding programming languages like SQL and Python, working with Linux, and mastering version control with Git. The speaker also highlights the significance of working with databases, cloud data platforms, and ETL/data pipelines, as well as the evolving nature of data engineering tools. The video serves as a guide for those looking to break into the field, stressing the value of a solid foundation in both tools and best practices for data management and processing.

Takeaways

  • πŸ› οΈ The landscape of data engineering tools is vast and constantly evolving, requiring adaptability and continuous learning.
  • πŸ”§ Core programming languages and technologies like SQL, Python, and Linux are fundamental to a data engineer's skill set.
  • πŸ“š Understanding the basics of object-oriented programming and writing efficient functions is essential for effective data engineering.
  • πŸ–₯️ Familiarity with version control systems like Git is crucial for managing code and collaborating with teams.
  • πŸ” Knowledge of secure file transfer protocols (SFTP) and encryption tools (PGP) is necessary for data security and compliance.
  • πŸ’Ύ Working with databases, both traditional RDBMS and NoSQL, is a key responsibility of data engineers for data extraction and manipulation.
  • 🌐 Cloud data platforms and warehouses like Snowflake, Databricks, and BigQuery are becoming increasingly important in modern data engineering.
  • πŸ”„ Data orchestration and pipeline tools such as Airflow and Azure Data Factory help automate and manage data workflows.
  • πŸ”§ A basic understanding of containerization (Docker) and orchestration (Kubernetes) can be beneficial, even if managed by a devops team.
  • πŸš€ The ability to choose the right tool for the job, whether it's a data warehouse, ETL, or data pipeline solution, is a valuable skill for data engineers.
  • 🎯 Focusing on building a solid foundation in data engineering principles and tools can lead to a successful and adaptable career in the field.

Q & A

  • What are some of the core programming languages and technologies a data engineer should be familiar with?

    -A data engineer should have a strong understanding of SQL, Python, and Linux. They should also be comfortable working with Bash scripts and have a basic knowledge of networking.

  • How have the tools used in data engineering evolved over time?

    -Data engineering tools have changed significantly over the years. Initially, engineers had to manually manage solutions like Hadoop and Spark by setting up their own infrastructure. Nowadays, cloud-based services like Databricks, Athena, and others have simplified the process.

  • What is the importance of version control in data engineering?

    -Version control is crucial for managing code changes, collaborating with other engineers, and maintaining a record of the development process. Familiarity with tools like Git is essential for any data engineer.

  • What are some of the basic technical tools and skills that a data engineer should possess?

    -Basic technical skills for a data engineer include understanding SFTP for secure file transfers, using PGP for encryption, and having a foundational knowledge of object-oriented programming and writing functions in Python.

  • How do different databases play a role in data engineering?

    -Data engineers often interact with various databases, both traditional relational databases like PostgreSQL and MySQL, as well as NoSQL databases like MongoDB. Understanding how to pull data from these sources and manipulate them is a key part of the job.

  • What is the role of cloud data platforms and warehouses in data engineering?

    -Cloud data platforms and warehouses like Snowflake, Databricks, and Big Query are used to build data lakes or data warehouses. They offer different architectures and features compared to traditional databases, and a data engineer must understand these differences to effectively use them.

  • Why is it important for a data engineer to understand both tools and the underlying concepts?

    -Understanding both tools and concepts allows a data engineer to make informed decisions about which tools to use for specific tasks, optimize their work, and troubleshoot issues effectively. It also helps them adapt to new technologies and stay current in the field.

  • What are some orchestration and ETL tools that a data engineer might use?

    -Orchestration and ETL tools like Airflow, SSIS, Azure Data Factory, and Informatica are used to automate data workflows, extract data from various sources, transform it into the desired format, and load it into target systems.

  • How does a data engineer decide which cloud platform to learn?

    -A data engineer should consider the popularity and prevalence of cloud platforms in the job market, as well as the specific needs of the companies they want to work for. AWS is often a safe bet due to its widespread use, while Azure may be preferred by large enterprises.

  • What additional tools might a data engineer need to know for containerization and infrastructure management?

    -For containerization, a data engineer might need to understand Docker and Kubernetes. For infrastructure management, tools like Terraform can be useful. However, these are often managed by devops teams, so data engineers might not need to be as deeply knowledgeable in these areas.

  • What advice would you give to someone looking to break into the field of data engineering?

    -Focus on building a strong foundation with the core tools and technologies, and don't feel rushed to learn everything at once. It's more important to understand the concepts and how the tools fit into the bigger picture. As you gain experience, you'll naturally learn more advanced tools and techniques.

Outlines

00:00

πŸ› οΈ Introduction to Data Engineering Tools

This paragraph introduces the vast array of tools available to data engineers and acknowledges the challenge of keeping up with the constantly evolving landscape of data engineering. It reflects on how tools like Hadoop and Spark have changed over time, moving from self-hosted solutions to managed services like Databricks and Athena. The speaker aims to create a video series to help viewers understand which tools are essential for a data engineer, emphasizing the importance of foundational skills like programming languages (SQL, Python), operating systems (Linux), and basic scripting (Bash).

05:00

πŸ”§ Core Skills and Tools for Data Engineers

The speaker delves into the core skills and tools that a data engineer should possess, starting with programming languages and basic system interactions. It highlights the necessity of understanding networks and having a baseline of coding skills, including object-oriented programming and writing functions in Python. The paragraph also introduces version control systems like Git as essential tools for managing code, along with other technical tools like SFTP and PGP for secure file transfers.

10:01

πŸ’Ύ Working with Databases and Data Platforms

This section focuses on the interaction with databases, both traditional relational databases and NoSQL counterparts, as a key aspect of data engineering. It discusses the importance of understanding different database systems and their nuances, such as indexing and data manipulation. The paragraph then transitions into cloud data platforms and warehouses like Snowflake and Databricks, emphasizing the need to grasp the differences between these services and traditional databases to effectively build data solutions.

15:01

πŸ”„ Orchestration, ETL, and Data Pipelines

The speaker addresses the realm of orchestration, ETL processes, and data pipelines, highlighting the various tools and platforms that data engineers may encounter. It mentions the use of Apache Airflow for workflow orchestration and how it can be utilized as an ETL or data pipeline solution. The paragraph also touches on the importance of understanding different data processing engines like Spark, Presto, and Trino, and the decision-making process behind choosing the right tool for the job.

🌐 Cloud Services and Advanced Tools

In this part, the speaker discusses the importance of cloud services in data engineering, suggesting AWS as a good starting point due to its popularity and wide usage. It also mentions other cloud providers like Azure and GCP, and their specific use cases. The paragraph further explores additional tools like Docker and Kubernetes, suggesting that while they may not be an immediate focus, having a basic understanding of these technologies is beneficial for managing infrastructure and containers in a data engineering context.

Mindmap

Keywords

πŸ’‘Data Engineer

A data engineer is a professional responsible for designing, building, and maintaining the systems that handle and process data. In the context of the video, the speaker discusses the skills and tools necessary to be successful in this role, emphasizing the importance of understanding various technologies and programming languages to manage and analyze data effectively.

πŸ’‘Tools

In the context of data engineering, tools refer to the software, programming languages, and systems used to manage, process, and analyze data. The video highlights the vast array of tools available, such as SQL, Python, Linux, and specific database systems, and how they are essential for a data engineer to perform their job effectively.

πŸ’‘SQL

SQL, or Structured Query Language, is a domain-specific language used to manage and manipulate relational databases. The video mentions SQL as a fundamental tool for data engineers, as it is used to query, update, and manipulate data stored in databases, which is a core responsibility of the role.

πŸ’‘Python

Python is a high-level, interpreted programming language that is widely used in the field of data engineering. The speaker in the video notes that Python is a crucial tool for data engineers due to its versatility and powerful libraries that facilitate data analysis, automation, and the creation of data pipelines.

πŸ’‘Linux

Linux is an open-source operating system that is widely used in servers and data centers. In the video, the speaker mentions Linux as a necessary tool for data engineers, as they often need to interact with server systems, write bash scripts, and manage data processing tasks on Linux environments.

πŸ’‘Version Control

Version control is a system that records changes to a file or set of files over time, allowing developers to track and manage these changes. The video emphasizes the importance of understanding version control, specifically Git, for data engineers to manage their codebase and collaborate with other team members effectively.

πŸ’‘Data Platforms

Data platforms refer to the cloud-based services and technologies designed to manage and analyze large volumes of data. The video discusses various data platforms such as Snowflake, Databricks, and BigQuery, highlighting their role in modern data engineering practices and how they provide scalable and efficient solutions for data storage and processing.

πŸ’‘ETL

ETL, or Extract, Transform, Load, is a process used in data engineering to move data from one system to another, often involving data cleaning and transformation. The speaker in the video mentions ETL as a core component of data engineering, where tools like Airflow and SaaS can be used to automate and manage the ETL process.

πŸ’‘Data Pipelines

Data pipelines are the infrastructure that facilitates the movement and processing of data within an organization. The video explains that data engineers need to understand how to build and maintain data pipelines, which may involve using various tools and technologies to ensure data flows efficiently and reliably from source to destination.

πŸ’‘Cloud Computing

Cloud computing refers to the delivery of computing services, such as storage, processing power, and databases, over the internet. The video discusses the relevance of cloud computing in data engineering, noting that understanding cloud platforms like AWS, Azure, and GCP is essential for leveraging their services for data storage, processing, and analysis.

πŸ’‘DevOps

DevOps is a set of practices that combines software development and IT operations to shorten the system development life cycle and provide continuous delivery of value to end users. In the video, the speaker touches on the importance of having at least a basic understanding of DevOps tools like Docker and Kubernetes for data engineers, as they may be required to manage infrastructure or work closely with teams that do.

Highlights

The ever-changing landscape of data engineering tools

The importance of having a foundational understanding of programming languages like SQL, Python, and Linux

The evolution from self-hosting solutions like Hadoop and Spark to managed services on platforms such as Databricks and Athena

The necessity of understanding basic networking concepts for a data engineer

The role of version control systems like Git in data engineering workflows

The importance of learning and applying basic coding principles even for using drag-and-drop tools

The use of SFTP and PGP for secure data transfer and encryption

The need for familiarity with various databases, both traditional RDBMS and NoSQL

The distinction between data engineers, software engineers, and data scientists based on the tools they use

The concept of cloud data platforms and data warehouses, and how they differ from traditional databases

The learning curve associated with understanding the nuances of different cloud platforms like AWS, Azure, and GCP

The importance of not rushing through learning tools and taking the time to understand their intricacies

The role of orchestration tools like Airflow in ETL and data pipeline processes

The potential need for data engineers to understand and work with Docker and Kubernetes

The value of having a baseline understanding of tools to be competitive in the job market

The importance of continuous learning and growth in the field of data engineering

Transcripts

play00:00

there are what feel like an infinite

play00:02

amount of tools you can pick from as a

play00:04

data engineer and likely if you've

play00:05

worked in the industry for a while

play00:07

you've maybe worked with some and heard

play00:08

of others and are always wondering what

play00:10

do you actually need to know to be a

play00:12

successful data engineer and the funny

play00:14

thing is and I think the challenge is

play00:16

what we're working on today will

play00:18

probably change a little bit tomorrow

play00:19

you know when I first broke into the de

play00:21

and data engineering world uh Hadoop and

play00:24

Spark were all the rage and you would

play00:25

have to figure out how to host it

play00:26

yourself and spin up zookeeper and it

play00:28

would be like 30 different solutions

play00:30

just to get it working and now you know

play00:31

we're just running things on data bricks

play00:34

or Athena or something other than uh you

play00:36

know us manually managing some of these

play00:38

Solutions so the tools that we use

play00:40

changed drastically uh over the years

play00:43

but I wanted to create a video that was

play00:45

in conjunction with my 100 days of data

play00:47

engineering video that helps you guys

play00:50

understand what tools you need to know

play00:52

as a data engineer and we are taking a

play00:54

quick pause from uh the AWS cloud videos

play00:56

but I'll be back on those here shortly

play00:58

if you haven't watched those uh give

play01:00

those a checkout later if you'd like to

play01:01

learn more about how data Engineers can

play01:03

work with Cloud but for now let's talk

play01:04

about tools from a high level let's

play01:07

first cover the basics and this is one

play01:09

of the challenges is like where do tools

play01:11

start and where do tools end in terms of

play01:13

solutions I think it's fair to say that

play01:14

programming uh and certain languages and

play01:17

basic solutions kind of fit in the tool

play01:19

aspect and Tool World right like they

play01:21

are tools we've built as humans as tool

play01:23

Builders to help us automate and and

play01:26

build processes so with that I think the

play01:28

tools you'll definitely need as a dat

play01:30

engineer even with things like chat GT

play01:31

obviously uh SQL python Linux I say

play01:35

Linux more more overarchingly um most

play01:38

likely you're going to have to write

play01:39

your fair share of bash scripts or at

play01:40

least interact with um servers right you

play01:43

might not need to be an expert but you

play01:45

need to be able to interact with those

play01:47

systems so yeah python SQL Linux some

play01:50

level of understanding how to work with

play01:52

networks all of that will likely come

play01:55

into play and you need these Baseline

play01:56

skills like they seem basic but you need

play01:58

them like you can't get around them uh

play02:00

yeah there's lots of dragon drop tools

play02:02

but I recall recently I was working with

play02:03

someone who was working on ssas and they

play02:05

were like oh yeah I don't do the c um

play02:08

blocks in ssas cuz I don't know how to

play02:10

do it which to me is a little bit of a

play02:12

copout because yeah you kind of should

play02:14

be able to do at least some baseline

play02:16

coding doesn't have to be fancy but at

play02:18

least you know some level of

play02:20

understanding you know of

play02:21

object-oriented programming how to write

play02:23

functions just your Baseline uh

play02:25

understanding of python now along with

play02:27

those Basics come kind of your other

play02:29

technical basic Solutions and tools

play02:31

right uh things like git right you

play02:34

probably think of it as GitHub but git

play02:35

is a broader solution and broader tool

play02:37

that you will need to know you will need

play02:39

to know how to Version Control uh all of

play02:41

your various code that you will be

play02:43

putting into places whether it's in a

play02:45

Lambda or in a larger system that you're

play02:47

developing you know in air flow Etc it

play02:49

needs to go somewhere and so at least

play02:51

understanding the four or five commands

play02:53

in git that you will likely use all the

play02:55

time you know things like get ad get

play02:57

commit and get push uh at the very least

play02:59

at least to understand how those

play03:01

actually operate and if you want to go

play03:02

deeper there are plenty of articles that

play03:04

explain to you how this tool operates

play03:06

but just being able to understand like

play03:08

how to create branches and these small

play03:10

things is very vital to being successful

play03:13

as a data engineer and again these

play03:14

skills kind of really build up your

play03:15

Baseline and I think this is why it can

play03:17

be hard to break into Data engineering

play03:19

is because these are the tools that can

play03:21

take time just in themselves to become

play03:24

decent T you can probably in the 100

play03:26

days that I've set up get a good idea of

play03:28

these Solutions but maybe becoming

play03:30

really good at them is hard and honestly

play03:32

I'm still working on these Solutions and

play03:34

constantly finding new things that I

play03:36

maybe don't know fully um in these

play03:38

Solutions uh another kind of basic

play03:40

technical set of tools that you will

play03:41

likely need to know is things like SFTP

play03:44

and pgp these again are kind of this

play03:46

interesting space I haven't started

play03:47

talking about like actual tools yet like

play03:49

uh people probably think of like airf

play03:50

flow or snowflake but these are Baseline

play03:53

skills and Baseline tools you will

play03:54

likely need to work with you know you

play03:56

likely will have somewhere uh that sstp

play04:00

will come into play it still exists

play04:01

today even with data sharing I have to

play04:03

do it to Facebook I had to use SFTP or

play04:05

secure file transfer protocol in order

play04:07

to push files out to external Partners

play04:10

who would then ingest that data and then

play04:11

do analytics on it and then give us back

play04:13

some sort of reporting on it and

play04:14

similarly as we're going through that

play04:15

process often we'd encrypt that file uh

play04:18

with some set of keys so using something

play04:20

like pgp uh or some similar uh protocol

play04:23

will likely be required as well and now

play04:25

you've got a baseline set of tools and I

play04:27

probably Ed skills and tools here

play04:29

interchange l in all fairness some of

play04:30

this is tools some of this is skills but

play04:32

I think you can't avoid these right like

play04:34

these are things you have to know how to

play04:36

work with you're going to have to know

play04:37

to program you're going to have to know

play04:38

use SQL you're going to likely have to

play04:40

interact with a Linux box somewhere

play04:42

whether it's an ec2 instance or a gcp

play04:45

cloud compute somewhere you're going to

play04:47

be interacting with these Solutions and

play04:48

these tools now you are a data engineer

play04:51

so you can't just know obviously these

play04:53

These are tools that likely depending on

play04:55

how you apply them either make you more

play04:56

of a software engineer a data scientist

play04:58

or a data engineer which why we break

play05:00

these uh names up uh I've seen some

play05:02

people I think recently kind of poke fun

play05:04

at these names or be like you know

play05:05

you're not really an engineer that's

play05:08

less of the point to me to me it breaks

play05:09

up the difference of why these different

play05:12

jobs exist and what they do right sure

play05:13

we're a data plumber that's fine

play05:15

plumbers still have specific sets of

play05:17

tools and plumbers also solve very hard

play05:19

problems I know I've had them have to

play05:21

fix a few around my house so as a data

play05:23

engineer there are specific tools that

play05:26

we use heavily first often at least

play05:29

you'll have to interact with databases

play05:31

in particular likely you'll pull a lot

play05:33

of things from certain Source databases

play05:35

and these Source databases tend to be

play05:37

your traditional relational database

play05:39

Management systems or something more on

play05:41

the like nosql side so things that are

play05:43

maybe document databases uh like mongod

play05:46

DB uh so that's always great or cander

play05:48

DB you'll also likely need to know

play05:50

obviously again the traditional

play05:51

postgress MySQL you don't have to know

play05:54

every database that exists right there's

play05:55

IBM db2 there's you know your Oracle

play05:58

databases more than likely as long as

play06:00

you get two or three under your belt and

play06:02

they're kind of uh you know using some

play06:04

different uh dialect of SQL you will be

play06:07

familiar enough to pull from various

play06:09

sources in the future yes they might all

play06:11

interact slightly differently like you

play06:13

they one will use change data Capture

play06:15

One Way one will do it another way one

play06:16

will have bin logs one won't but as long

play06:19

as you get I think three or four that

play06:21

you're comfortable with that you can

play06:22

build a basic schema on you can

play06:24

understand how to insert data into that

play06:26

you can understand how to update data

play06:28

you'll likely be okay here you don't

play06:30

have to be again an expert but you need

play06:32

to be familiar enough to build on them

play06:34

to understand why someone might put an

play06:36

index somewhere it will take again time

play06:38

these aren't things you have to rush

play06:39

through don't let 100 my 100 days of dat

play06:42

engineering make you feel like you have

play06:43

to run through any of this stuff again

play06:45

if you don't learn it now if you happen

play06:47

to get a job as a data engineer

play06:48

somewhere you'll be learning it there uh

play06:50

so just make sure you don't run too too

play06:53

fast otherwise you're going to be

play06:54

stressed out in your actual job so again

play06:56

now that you've kind of got your

play06:57

databases under underway right you've

play07:00

kind of built a good understanding of

play07:01

how databases operate this will kind of

play07:03

give you the next layer of knowledge so

play07:05

that when you go into what now people

play07:06

kind of call like cloud data platforms

play07:08

or cloud data warehouses honestly

play07:10

there's so many different terms now that

play07:11

people use for these Solutions because

play07:13

you look at them and they aren't

play07:15

actually set up like sometimes

play07:17

traditional databases which is why it's

play07:18

good to actually understand how I think

play07:20

traditional databases operate so that

play07:21

when you look at snowflake uh you don't

play07:23

think they're the same thing you don't

play07:25

think like oh this operates 100% exactly

play07:27

the same way um as my traditional

play07:30

database or same thing with data brakes

play07:31

which is even farther removed from your

play07:33

traditional databases um and harder to

play07:35

probably grasp that hey there is a a

play07:37

compute engine here or there's some sort

play07:39

of you know uh query engine here sort of

play07:41

with spark and there is kind of storage

play07:43

but the some of the traditional stuff

play07:45

that exists is kind of all peace Meed

play07:47

out right it doesn't exist in the same

play07:49

framework and so that's why it's really

play07:50

important I think to build these steps

play07:52

slowly so that you understand these

play07:53

differences um I always remember when I

play07:55

first uh interacted with my First Data

play07:57

Warehouse because I had taken like your

play07:59

tradition relational database course um

play08:01

in school and I was literally uh

play08:03

interning at the same time while I was

play08:04

taking that course uh and looking at

play08:06

this data warehouse I was like oh these

play08:07

are kind of the same right like you've

play08:09

got something called a key here and an

play08:11

ID here they're the same thing right is

play08:14

in my mind and obviously a few months

play08:15

later I I eventually learned that no

play08:17

these are different things and I had to

play08:18

like dig into that and start reading

play08:20

things like Kimble and actually dig into

play08:21

the differences and that's why I think

play08:23

again the more you can kind of

play08:24

understand and see when when things are

play08:26

different it it just makes you uh more

play08:29

valuable as a data engineer moving

play08:30

forward so the next one I'm sorry for

play08:32

that die tribe but the next one is

play08:34

really these data platforms so snowflake

play08:36

data bricks we're going to throw a big

play08:37

query in there again it doesn't fit

play08:39

maybe as much of the data platform is

play08:42

space but I guess if you add in

play08:43

everything else gcp has it kind of can

play08:45

so if you add in all the gcp data flows

play08:47

and things like that it kind of fits but

play08:49

these are the traditional Solutions you

play08:51

might have to know um you can also again

play08:54

throw in red shift there in Azure synaps

play08:56

analytics which actually does fit more

play08:58

of that data platform space but those

play09:00

are kind of your key um data platforms

play09:03

that you'll likely be building on and in

play09:04

general building some sort of data

play09:06

warehouse or dat lake house if that's

play09:08

your cup of tea um on obviously it's

play09:10

going to depend on which one of these

play09:11

Solutions you pick they all do operate

play09:13

slightly differently um the way I often

play09:15

feel it is gcp tends to feel to me like

play09:18

it's very a little more limited in terms

play09:19

of like what I can find- tune whereas

play09:21

snowflake tends to be a happy medium and

play09:23

and data breaks like it gives you a lot

play09:25

of uh control but then you have to

play09:27

understand how that operates almost kind

play09:28

of like the old or orle days where it's

play09:30

like Oracle gave you a ton of control

play09:32

but that's why you'd pay a lot for

play09:33

Oracle Consultants cuz they'd have to

play09:35

know how to like set up control files

play09:36

and all this stuff um as you were

play09:38

loading data and as well as fine-tuning

play09:40

a lot of other stuff whereas you could

play09:42

just use SQL Server which I often found

play09:44

a little easier to work with versus

play09:45

Oracle now as you're learning these

play09:47

Solutions you're going to again be

play09:48

layering more and more of your skills on

play09:50

top of each other think about it what

play09:52

are you going to be likely writing when

play09:54

you are working on snowflake or big

play09:57

query likely SQL that is how you're

play09:58

going to intera act with these VAR

play10:00

Solutions hopefully on top of these

play10:02

tools you also have the skills and best

play10:04

practices to build a data warehouse or

play10:06

data lake house but those that's where I

play10:08

definitely draw the line in terms of

play10:09

like what's a tool what's more of a

play10:10

skill and a best practice right that's

play10:12

going to fall more in terms of like how

play10:14

you build data pipelines how you build

play10:15

data warehouses that comes more into

play10:17

skills and best practices and design

play10:19

versus actual tools that can help you um

play10:22

Implement these uh best practices and

play10:25

and designs and along with that if

play10:27

you're on a a data bricks uh fan person

play10:30

you will have to learn a spark and how

play10:32

it operates and how you can best

play10:34

interact with it including when you're

play10:35

writing things like SQL instead of maybe

play10:38

uh python or Scala in order to interact

play10:40

with spark you know what's the best way

play10:42

to run joins things like that um are

play10:44

really important to understand why you

play10:46

might want to use something like an

play10:47

engine like Spark versus maybe Presto or

play10:49

trino and where in fact at Facebook we

play10:52

even had the ability to switch in

play10:53

between spark and Presto depending on

play10:55

the job cuz sometimes it was more

play10:57

efficient to use Presto or trino and

play10:59

sometimes it was more efficient to know

play11:01

or to use spark and you'd have to know

play11:03

why and so it's really good to know um

play11:05

at least a little bit about all these

play11:07

tools if not a deep understanding

play11:09

because it will become valuable as

play11:11

you're going along I think the important

play11:13

thing is as you're going through these

play11:14

steps of learning again you don't need

play11:16

to feel rushed you will learn all of

play11:18

this stuff through time as long as

play11:20

you're putting in the effort you know if

play11:21

you're putting in 10 minutes a day

play11:23

probably won't learn it but if you're

play11:24

putting in hours a day like most of us

play11:25

have at some point if not still today

play11:28

you will pick up these Solutions you

play11:29

will learn them and you will feel

play11:31

confident in actually being able to

play11:32

deliver with them all right so now

play11:34

you've again you've kind of built all of

play11:35

this Baseline the next set of tools

play11:37

you'll often see that you need to know

play11:39

are things like orchestration ETL and

play11:41

data pipelines you can throw an elt in

play11:43

there these all kind of fit in this

play11:45

similar space and I know some people

play11:46

would get mad if I said that but I say

play11:48

that because airflow which um obviously

play11:51

fits in this workflow orchestrator space

play11:54

often when I see it implemented gets

play11:56

used as an ETL type solution or data

play11:58

pipeline

play12:59

to run a very basic um extraction of

play13:03

data and then load it somewhere and then

play13:05

maybe add in snowpipe or something

play13:07

similar that can just pick up a trigger

play13:08

of one file dropping into S3 but you're

play13:11

really going to see that there are a lot

play13:12

of different ways you can do pipelines

play13:13

and honestly what you often find is

play13:15

there's a few kind of types of tools you

play13:16

can do things very custom you can build

play13:18

it yourself people love doing that for

play13:20

some reason even though we we've built a

play13:21

ton people love having um open- Source

play13:24

Solutions again airflow and Mage are

play13:27

examples of that these kind of fit again

play13:28

in that Orchestra Trader world but also

play13:31

often just get used as data pipelines or

play13:33

ETL type flows as well and then you have

play13:36

things that are maybe fully you know

play13:37

managed things like ssas very drag and

play13:40

droppy um so ssas Azure data Factory and

play13:43

a few others that all involve you know

play13:45

dragging and dropping and and automating

play13:47

tasks that way and so those are kind of

play13:48

the various tools you'll see um there's

play13:51

a few others that like very much focus

play13:53

on maybe like just extract and load most

play13:55

of those tend to be very easy to work

play13:57

with so I don't think you need to put a

play13:58

ton of effort learning them off the bat

play14:01

I think it's very much worth to at some

play14:02

point but more than likely you'll need

play14:04

to pick some of these Solutions because

play14:06

you will uh likely use them and they

play14:08

tend to be the easiest to learn cuz

play14:09

there are others even again we could go

play14:11

on forever in terms of orchestrators

play14:13

data pipelines Informatica and a few

play14:15

others that often cost a lot of money to

play14:17

get access to but for now I think just

play14:20

understanding the concept and getting a

play14:22

few of these tools under your belt maybe

play14:23

one or two is good enough generally to

play14:25

at least make you uh hirable which is

play14:28

your first goal is just get hired in a

play14:30

junior position and then eventually you

play14:32

can go from there and again like I kind

play14:33

of referenced earlier the cloud is

play14:35

another set of tools that you will

play14:36

eventually learn there are a ton of

play14:37

clouds you don't need to learn all the

play14:39

clouds generally I tell most people AWS

play14:42

is a safe bet because most people use it

play14:43

if you do want to learn Azure understand

play14:45

that most of that is going to be uh at

play14:47

large Enterprises that use it whereas

play14:49

AWS T to be a broader range and I find

play14:51

that most people that use gcp use it for

play14:53

big query because they like big query so

play14:55

if you are going to pick AWS tools

play14:57

that's where I'd start and I do have

play14:58

this AWS video that you can go through

play15:00

to actually look at all the various uh

play15:02

tools you'll likely need to know I think

play15:04

I go through like eight or nine um cuz

play15:06

you don't need to know every solution

play15:07

and so the cloud is always a baseline

play15:09

that you need to know and then there's a

play15:10

ton of one-off tools that you may or may

play15:12

not need to know like honestly I have

play15:14

mixed feelings I will say that it's

play15:15

worth at least digging into Docker cuz

play15:17

you probably will have to occasionally

play15:19

start up a Docker instance here or there

play15:21

and same thing with kubernetes at least

play15:23

understand how it operates understands

play15:24

how to kill you know a pod occasionally

play15:28

but more than likely you will have a

play15:29

devops team that manages it and if not

play15:31

you are the devops team and that's now

play15:32

your new job like you it's very hard to

play15:34

like build data pipelines and manage a

play15:37

bunch of infrastructure that you've

play15:38

developed similar thing can be said

play15:39

about terraform right like it's worth

play15:41

knowing but in theory you should have a

play15:44

devops team that does that obviously

play15:46

nowadays people are um starting to I

play15:48

feel like reduce the amount of people

play15:50

that work on these teams so maybe maybe

play15:51

you will be a one One Stop Shop for all

play15:54

of this stuff but these tend to be the

play15:56

things you can maybe learn last uh you

play15:58

don't need to put a ton of effort in

play15:59

immediately um some of it will just come

play16:01

through uh you just naturally doing your

play16:03

work but also you don't want to be

play16:05

running or trying to figure out how

play16:06

Docker Works while you're pushing code

play16:08

to production so make sure you've at

play16:09

least run it a few times and if it is

play16:11

your first time seeing it in production

play16:12

and you've never run it in production

play16:14

try to find someone to help you out

play16:16

there cuz it's just there's a lot of

play16:17

ways it can go wrong and those are most

play16:19

of the Baseline tools you need to know

play16:20

I'm sure there's others that people feel

play16:22

like I've missed please comment below if

play16:23

you think uh I'll pin it if I think it's

play16:25

a good tool I'm like yeah that's

play16:27

actually true I should have covered this

play16:28

but I think that's the Baseline it takes

play16:30

a little bit of time to become a data

play16:31

engineer and I think this is part of it

play16:32

like you don't have to know all of these

play16:34

tools super in- depth but you need to

play16:35

know what they do in an interview

play16:36

someone likely might ask like where

play16:39

would you maybe use one of these

play16:40

Solutions versus another um how do joins

play16:43

work uh on one solution versus another

play16:45

uh does red shift have merge if it

play16:47

doesn't how can you you know end up

play16:49

running something similar to a merge

play16:50

statement all of this stuff is important

play16:52

to at least understand and have touched

play16:56

here and there I don't want this to be

play16:57

discouraging again you have long careers

play17:00

um it took me a few years to get to a

play17:01

point where I had a title of data

play17:03

engineer and and even now whether that's

play17:05

the title whether the title is data

play17:06

plumber I think is less the point I

play17:08

think the point is how you do your work

play17:10

so hopefully this was helpful for you

play17:12

out there whether you're an analyst an

play17:14

engineer data scientists um to

play17:16

understand the tools that a data

play17:17

engineer will likely need know with that

play17:19

guys thanks so much for watching and I

play17:20

will see you guys in the next one thanks

play17:22

all

play17:27

goodbye a

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
DataEngineeringToolsOverviewSkillDevelopmentSQLPythonLinuxCloudPlatformsETLOrchestrationDataPipelinesBigDataDevOps