The Harsh Reality of Being a Data Engineer

Seattle Data Guy
30 Sept 202210:31

Summary

TLDRIn this video, Ben Rogue John, the Seattle data guy, addresses the harsh realities of being a data engineer. He discusses the lack of attention from software engineers to data pipelines, the trend of companies trying to eliminate data engineering roles, and the prevalence of 'data swamps' instead of well-structured data lakes. He also touches on the unrealistic expectations placed on data professionals to be experts in all areas of data work and the importance of acknowledging one's limits. Lastly, he advises not to worry about always using the latest technologies, emphasizing the value of understanding foundational concepts.

Takeaways

  • 🔧 Data engineering involves dealing with the harsh realities of data pipeline maintenance and not always working on Big Data systems.
  • 🛠️ Many software engineers may not be aware of how their changes can impact data pipelines, leading to the need for data contracts to ensure data integrity.
  • 🔍 Companies sometimes attempt to eliminate data engineering roles, but often realize the importance of having someone manage and own data pipelines for analysts and scientists.
  • 🏗️ The industry struggles with 'data swamps' where data is dumped without structure, leading to chaos and difficulty in managing and accessing it.
  • 🤔 There's a misconception in companies expecting data professionals to be experts in all aspects of data work, similar to expecting a programmer to know all technology areas.
  • 📈 Data engineers must understand the depth of their skills and know their limits, seeking additional training or senior assistance when necessary.
  • 📚 Learning from older technologies can provide valuable insights into best practices and the evolution of data warehousing and modeling techniques.
  • 🚀 Not using the latest technologies doesn't mean falling behind; it can offer a deeper understanding of where current practices and solutions originated.
  • 💡 The speaker emphasizes the importance of continuous learning and sharing knowledge within the data community to improve overall understanding.
  • 🌐 Data engineers should focus on mastering the basics and not worry too much about the hype around new technologies, as fundamentals are crucial for long-term success.

Q & A

  • What is the main topic of Ben Rogue John's video?

    -The main topic of Ben Rogue John's video is the harsh realities of being a data engineer.

  • Why might software engineers not always care about data?

    -Software engineers might not always care about data because their reviews and performance metrics are often focused on delivering new features and functionality, which may not take into account how these changes could impact data pipelines.

  • What is the purpose of data contracts in the context discussed in the video?

    -Data contracts are becoming important to ensure that changes made by data producers, such as software engineers, do not break data pipelines, as these changes can have a significant impact on data engineers' work.

  • Why did Ben Rogue John mention the development of a system at Facebook?

    -Ben Rogue John mentioned the development of a system at Facebook to automatically scan and detect changes in data tables from sources, ensuring that data engineers are aware of any modifications that could affect their work.

  • What is one reason companies might want to remove data engineering roles?

    -Some companies want to remove data engineering roles because they see them as a bottleneck and would prefer data analysts and scientists to have direct access to data without the need for data engineering processes.

  • What is the term used to describe poorly structured data storage that Ben Rogue John discussed in the video?

    -The term used to describe poorly structured data storage is 'data swamps,' which refers to chaotic and unorganized data storage situations.

  • Why do companies sometimes struggle with defining the roles of data engineers, data scientists, and data architects?

    -Companies sometimes struggle with defining these roles because they expect individuals in these positions to have a broad range of skills and be able to handle all data-related tasks, which is unrealistic given the complexity and specialization required in the data field.

  • What is the 'Iceberg' meme mentioned by Ben Rogue John regarding SQL, and what does it signify?

    -The 'Iceberg' meme signifies that SQL is a deep and complex skill, with many layers and nuances to understand beyond just the basic commands, and that even experienced professionals continue to learn and discover new aspects of it.

  • Why is it important for data professionals to know their limits and seek help when needed?

    -It is important for data professionals to know their limits because the data field is vast and constantly evolving, making it impossible for one person to be an expert in every area. Seeking help ensures that projects are completed efficiently and accurately.

  • What is the advice given by Ben Rogue John regarding the use of older technologies in data engineering?

    -Ben Rogue John advises that working with older technologies is not a disadvantage, as it allows professionals to understand the history and evolution of best practices in data warehousing and modeling, and to appreciate the reasons behind current approaches.

  • What is the final reality that Ben Rogue John discusses in the video about data engineering?

    -The final reality discussed is that data engineers won't always get to use the newest and most hyped technologies, but focusing on the fundamentals and understanding the evolution of the field is more valuable than chasing the latest trends.

Outlines

00:00

🔧 The Realities of Data Engineering

In this paragraph, Ben Rogue John, the Seattle data guy, introduces the topic of the harsh realities faced by data engineers. He explains that the role is not always as exciting as it seems, often involving mundane tasks such as data migration and dealing with the impact of changes made by software engineers on data pipelines. He emphasizes the importance of data contracts to manage these impacts and shares his experience at Facebook, where a system was developed to monitor changes in data sources. He also discusses the misconception that data engineering can be eliminated, highlighting the necessity of data engineers to manage and maintain data pipelines for analysts and data scientists.

05:01

🌐 Data Swamps and the Scope of Data Engineering

This paragraph delves into the issue of 'data swamps,' where data is stored without structure or consideration for future integration and transformation. Ben uses an image from the data engineering subreddit to illustrate the chaotic nature of such environments. He connects this to the broader challenge of companies not fully understanding the roles and responsibilities of data engineers, data scientists, and data architects. The paragraph highlights the unrealistic expectations placed on data professionals to be experts in all aspects of data work, and the need for continuous learning and specialization in the field.

10:01

🛠 Embracing the Evolution of Data Technologies

In the final paragraph, Ben addresses the concern that data engineers might feel left behind by not always using the latest technologies. He argues that working with older technologies can provide valuable insights into best practices and the evolution of data warehousing. He encourages data engineers to focus on mastering the basics rather than chasing the latest trends, suggesting that a deep understanding of foundational concepts is more important than familiarity with new tools. Ben concludes by thanking viewers for watching and looking forward to the next video.

Mindmap

Keywords

💡Data Engineer

A data engineer is a professional who specializes in designing, building, and maintaining systems that manage large volumes of data. In the video, Ben Rogue John, the Seattle data guy, discusses the harsh realities of being a data engineer, emphasizing that the role often involves more mundane tasks than the exciting work with Big Data Systems that many might imagine.

💡Data Pipelines

Data pipelines refer to the process of moving and transforming data from one place to another. In the script, it's mentioned that software engineers may not always consider how their changes could impact data pipelines, which is a significant concern for data engineers who are responsible for maintaining the integrity and flow of data.

💡Data Contracts

Data contracts are agreements that outline the responsibilities and expectations between data producers and consumers, ensuring that data remains consistent and reliable. The video script highlights the increasing prevalence of data contracts as a response to the potential disruptions in data pipelines caused by changes made by various data producers.

💡Data Swamps

The term 'data swamps' is used in the video to describe disorganized and chaotic data storage situations where data is dumped without structure or consideration for future retrieval and analysis. It serves as a counterpoint to the idealized concept of a 'data lake,' illustrating the challenges data engineers face in real-world scenarios.

💡Technical Debt

Technical debt is the concept of work that needs to be done to fix problems in the code or system that were initially overlooked or hastily implemented. The script mentions that attempts to eliminate data engineering roles can lead to technical debt, as companies later realize the necessity of having data engineers to manage and maintain data systems.

💡Data Quality

Data quality refers to the overall integrity and reliability of data, which is crucial for accurate analysis and decision-making. The video script discusses how companies that try to remove data engineering roles often encounter issues with data quality, necessitating the reimplementation of data engineering strategies.

💡Data Warehousing

Data warehousing is the process of collecting, storing, and managing large amounts of data in a way that facilitates easy access and analysis. The script touches on the importance of understanding the historical development of data warehousing practices, as this knowledge informs modern approaches to data management.

💡Slowly Changing Dimension (SCD)

A slowly changing dimension is a type of data that changes over time but requires historical data to remain intact for accurate analysis. The video script uses SCD as an example of a concept that data engineers need to understand, highlighting the complexity and depth of knowledge required in the field.

💡SQL

SQL, or Structured Query Language, is the standard language for managing and manipulating relational databases. The video emphasizes the depth and complexity of SQL, noting that even seasoned professionals like the speaker consider themselves to be only moderately proficient due to the constant evolution of the language and database engines.

💡Hype and FOMO

Hype refers to the excitement or publicity generated around a product or idea, while FOMO stands for 'Fear of Missing Out.' The script advises against getting too caught up in the hype of new technologies or the fear of missing out on them, suggesting that focusing on the fundamentals and mastering existing technologies is more valuable.

💡Data Modeling

Data modeling is the process of creating a representation of data structures and their relationships within a database. The video script mentions that understanding the historical context of data modeling techniques is important, as it provides insight into why certain practices exist and how they are evolving.

Highlights

Reviewing the harsh realities of being a data engineer, including the less exciting aspects of the job.

Software Engineers often don't consider the impact of their new features on data pipelines.

The rise of data contracts to manage changes that could affect data pipelines.

Data producers' lack of awareness of how their changes can break data pipelines.

The necessity of data engineers to manage and own data pipelines for data sets to be understandable.

Companies' attempts to remove data engineering roles and the subsequent need to rehire.

The prevalence of data swamps versus the ideal of well-structured data lakes.

The chaotic nature of data swamps and the difficulty in managing unstructured data.

Misunderstandings within companies about the roles and responsibilities of data professionals.

The unrealistic expectation for data professionals to be experts in all data-related fields.

The depth and complexity of SQL and database engines, beyond basic SQL commands.

The importance of recognizing one's limitations and seeking help or training when needed.

The value of understanding the history and evolution of data practices and technologies.

The shift in technology trends and the importance of not getting stuck on using only the newest tools.

The balance between learning new technologies and mastering the fundamentals of data engineering.

The reassurance that working with older technologies can provide valuable insights into data practices.

Transcripts

play00:00

what is going on guys welcome back to

play00:02

another video with me Ben Rogue John aka

play00:04

the Seattle data guy today I wanted to

play00:06

review a subject that I kind of put

play00:08

together a while ago on a previous video

play00:10

which is the harsh realities of being a

play00:13

data engineer now I'm going to do this

play00:15

because a lot of people always ask you

play00:16

know should I become a data engineer is

play00:18

it the right role for me and things

play00:21

similar to those types of questions so I

play00:23

wanted to cover some of the harsh

play00:24

realities of being a data engineer

play00:26

because at the end of the day a lot of

play00:28

the work you're going to do might not

play00:29

always be exhilarating it might not

play00:31

always involve you working on Big Data

play00:33

Systems and a lot of it might just be

play00:35

migration from one platform to another

play00:38

and this isn't always exciting let's

play00:40

start with the fact that in general most

play00:43

software Engineers just don't care about

play00:45

data now let me be clear obviously one

play00:48

this is a massive generalization and two

play00:50

I mean more in terms of analytical

play00:52

purposes most of the time if you are a

play00:55

data engineer you're pulling data from

play00:56

various sources many of which are often

play00:59

being built by software Engineers who if

play01:02

they're being judged or have reviews

play01:04

that are all geared on their ability to

play01:06

deliver new features and functionality

play01:08

don't always pay attention to how those

play01:10

new features and functionality could

play01:12

possibly break your data pipelines and

play01:14

this is why we see a lot about data

play01:17

contracts kind of coming out because

play01:18

there are a lot of people that could be

play01:19

producing data and I'm saying software

play01:21

Engineers but honestly it's not just

play01:23

them it's General data producers for

play01:25

example if you work on Salesforce and

play01:27

you're a person who's maybe adding new

play01:29

features and columns in terms of trying

play01:31

to track any information or maybe taking

play01:33

away previous information from different

play01:36

Salesforce objects you could also

play01:39

possibly break data pipelines so really

play01:41

it's more General to say that a lot of

play01:43

data producers don't always necessarily

play01:45

care or at the very least know that if

play01:48

they make these small changes they will

play01:50

drastically impact your life as a data

play01:53

engineer again this is why data

play01:54

contracts are becoming a thing I feel

play01:55

like I'm seeing this pop up everywhere

play01:57

from various newsletters to new startups

play02:00

to LinkedIn posts all about this because

play02:02

everyone knows that this is a problem

play02:04

when I was at Facebook we had to develop

play02:06

a whole system that basically

play02:07

automatically looked and scanned to see

play02:09

if tables changed from your sources to

play02:12

make sure you knew that hey this table

play02:14

you're relying on is no longer the same

play02:16

you know some field has changed some

play02:18

data type has changed something similar

play02:20

to that so there is no way to sugarcoat

play02:22

it a lot of your work is going to be

play02:24

stuck spending time trying to fix all of

play02:26

these small changes that someone else

play02:28

produces on top of that I feel like I've

play02:30

had a few conversations now with various

play02:32

heads of data where they discussed how

play02:34

in previous companies they were all

play02:36

trying to remove data engineering

play02:38

somehow they wanted to get rid of data

play02:40

Engineers they were like well we just

play02:42

want the data and analysts and data

play02:45

scientists to directly access it that's

play02:46

why I put together this picture and it's

play02:48

many of these companies had to rehire

play02:50

and re-implement their data engineering

play02:52

strategies because I do think that as

play02:54

much as companies want to get rid of

play02:56

what they consider a bottleneck which is

play02:58

data engineering they must also admit

play03:00

that someone needs to manage and own

play03:03

these pipelines that create data sets

play03:05

that are actually understandable by

play03:06

analysts India scientists so yes there

play03:09

are a lot of companies that are trying

play03:10

to remove data engineering as well as

play03:12

tooling that just wants to kind of

play03:13

eliminate or reduce the amount of Need

play03:15

for data Engineers which kind of makes

play03:17

sense because there's not a lot of us

play03:19

out there I think this is kind of proven

play03:21

by the difference in numbers when you

play03:22

look at the different subreddits for

play03:24

days science versus data engineering but

play03:26

it often leads to problems and a lot of

play03:28

technical debt in the future I mean I

play03:30

think you're seeing this at a lot of

play03:32

companies you can read some articles

play03:33

like the Airbnb article where they

play03:35

eventually had to start reinstating or

play03:37

just implementing a data quality and

play03:40

data engineering strategy because they

play03:42

just kept running into various problems

play03:44

so it's a reality that a lot of

play03:46

companies want to get rid of data

play03:47

engineering to some degree or another

play03:49

but it's also a reality that it is very

play03:52

very hard to do so so for those of you

play03:54

who wonder if de is a good job choice I

play03:57

would say we've got at least another

play03:58

decade of doing this work another great

play04:01

harsh reality that exists in the whole

play04:04

world of data engineering is as much as

play04:06

we like to think that everything is

play04:08

perfect and every company out there that

play04:10

has written an article about you know

play04:12

developing a perfect data lake or data

play04:14

lake house exists out there are just a

play04:17

lot of data swamps up there as well and

play04:20

I've definitely had to go through a few

play04:21

of them I had to work with one company

play04:22

where they were just dumping all of

play04:26

their files into an S3 bucket no folders

play04:29

no structure no thing about time like

play04:31

when something was dropped it was just

play04:33

all of their raw files into one S3 data

play04:37

bucket there were thousands and

play04:39

thousands of files and there was very

play04:41

little ability to even search and figure

play04:43

out where and what file was just

play04:45

recently dropped so it was definitely a

play04:46

chaotic mess and you don't even know

play04:48

where sometimes to start in those

play04:50

situations because you're just so

play04:52

flabbergasted that it happened so there

play04:54

are tons of day swamps it's not just me

play04:56

who's posted this out there in fact I

play04:57

saw this hilarious image on the data

play05:00

engineering subreddit covered a lot of

play05:02

these points you know if you look at

play05:03

this image you can kind of see that you

play05:04

know data has just gone out of control

play05:06

there's no real structure everything's

play05:08

just kind of stored in the data Lake

play05:10

there's just a lot of problems that

play05:11

arise from these situations where yeah

play05:14

it's just kind of a data swamp it is a

play05:16

reality that you know we try to go into

play05:19

this world where we're going to move

play05:20

fast and create data and create value

play05:23

but all that really ended up happening

play05:25

was we stuffed a bunch of data somewhere

play05:27

and we never really thought about the

play05:28

transform or how we're gonna like

play05:30

integrate it or all these other key

play05:32

things that data Engineers do so this is

play05:35

somewhat connected with getting rid of

play05:36

data Engineers another reality that will

play05:38

never I think go away is just the

play05:41

ability for companies to actually know

play05:42

what data engineers and data scientists

play05:45

and data Architects all should be doing

play05:47

I was looking at a post on the data

play05:50

engineering subreddit and I think it

play05:51

kind of covered this really well you

play05:53

know where they just kind of point out

play05:54

the fact that a lot of companies expect

play05:55

dated people to do all data things it's

play05:58

similar to if you were some sort of

play06:00

programmer before they kind of just

play06:02

expect you to do all technology things

play06:04

you know you should understand how to do

play06:06

database things and website things and

play06:08

back-end things and front-end things and

play06:10

networking things and Linux things and

play06:12

anything that has to do with you know a

play06:14

keyboard and a mouse and a terminal you

play06:17

should be able to do and we're kind of

play06:19

in the same space now in the data world

play06:21

where everyone's just kind of expected

play06:23

to do everything even if it's not what

play06:25

you're good at and I liked how they kind

play06:27

of point out here that a lot of people

play06:28

especially people who are just breaking

play06:29

into the industry are using like high

play06:32

level YouTube videos and I've definitely

play06:33

put out plenty of high level videos to

play06:35

kind of say that they are proficient in

play06:37

these skills and most of these skills

play06:38

even SQL are a lifetime skill you can

play06:41

learn a lot of SQL in a year but really

play06:44

it's just the surface I really love that

play06:47

Iceberg meme recently put out about SQL

play06:49

but it really is so deep because it's

play06:51

not just about like SQL it's also about

play06:54

the database engine underneath like are

play06:55

you an expert in Oracle or SQL server or

play06:58

snowflake or data breaks or whatever

play07:00

solution you're picking because all of

play07:02

these even operate differently at least

play07:04

in to some degree and obviously you need

play07:07

to as a data person be able to work on

play07:09

most of these but to say that you're an

play07:12

expert on all these because there's just

play07:13

so much to know how to optimize each of

play07:16

these different solutions how to make

play07:18

sure you're you know writing a SQL in

play07:20

the best way it's all going to be

play07:22

slightly different so even if you know

play07:23

all of the SQL commands there's just so

play07:26

much more to know that's why I don't

play07:28

think I'll ever say that I'm a 10 in SQL

play07:31

whenever an interviewer asks how good is

play07:33

your SQL I'll probably always say like a

play07:35

six or seven because every year there's

play07:37

some new skill or New Concept that I'm

play07:40

like oh I didn't know this before but

play07:42

yet companies still expect you you know

play07:44

everything because that's just the way

play07:46

the tech World works it's like when you

play07:48

go to your parents house and they're

play07:50

having an issue with the router and they

play07:51

expect you to fix it because for some

play07:53

reason you can write a few lines of

play07:54

python in these cases it's important to

play07:56

know where your limit is if you're new

play07:58

to any specific area yes you can kind of

play08:01

figure out some of it but as soon as

play08:03

something gets deeper it's probably not

play08:05

a bad idea for you to go to your manager

play08:07

or your director and just let them know

play08:10

that you're kind of out of your depth

play08:11

and either you need some training or

play08:14

maybe there needs to be someone else

play08:15

that comes in that is more senior

play08:17

because there's just too much in the

play08:18

data world for one person to know that's

play08:21

why in a lot of my recent videos I've

play08:22

definitely tried to like bring in other

play08:23

people's knowledge because there's just

play08:25

so much information and so much

play08:27

knowledge that's trapped in everyone's

play08:29

brains and the more we can kind of get

play08:30

it out there I think hopefully the more

play08:32

people can kind of glean and understand

play08:34

what exactly is going on in this whole

play08:36

data World finally the last reality I

play08:38

think that's important just to

play08:40

understand is that you won't always get

play08:42

to use the new hyped and cool things

play08:44

that exist out there and I think that's

play08:47

more than okay mostly because I remember

play08:48

when I first started out in the data

play08:50

world I was working mostly on Oracle SQL

play08:52

server postgres and a lot of things that

play08:54

were on-prem at this point a lot of

play08:55

companies were moving to like redshift

play08:57

and Hadoop and I felt like really left

play09:00

behind but the interesting thing was

play09:02

that as soon as I finally kind of

play09:04

started working with more modern

play09:06

solutions companies start to move away

play09:07

from Hadoop and even redshift although

play09:10

redshift is still I think very popular

play09:11

to use and even Hadoop at plenty of

play09:13

companies just often in a different form

play09:16

it's not the end of the world I think to

play09:18

sometimes be on technologies that are a

play09:20

little bit older one you get to learn

play09:22

where a lot of like our best practices

play09:24

and solutions that we've developed over

play09:26

the last few decades have all come from

play09:28

why have we developed data warehousing

play09:30

the way that we have why in the world do

play09:33

slowly changing dimensions exist like

play09:35

all these things that you might just

play09:37

learn or read in a book you can kind of

play09:39

understand a little more why they exist

play09:41

why they're around and and why some

play09:43

people people want to change them by

play09:45

using different modeling techniques

play09:46

whereas if you just kind of jumped into

play09:48

the modern world of data engineering and

play09:51

data modeling and things in that space

play09:53

you might not get all of the Nuance from

play09:55

where we came from and why we're

play09:57

changing or at least trying to change or

play09:59

reapproach a lot of the problems that

play10:01

we've been looking at for the last few

play10:02

decades so I just wouldn't get too stuck

play10:05

on the fact that hey we're using old

play10:07

technology I'm falling behind because

play10:09

one you're going to likely work with

play10:11

that technology at some point if it's

play10:12

worth working with a lot of stuff is

play10:14

just a combination of marketing and hype

play10:16

and people fomoing about not getting to

play10:19

use the new technology that everyone

play10:20

else is testing out but I wouldn't worry

play10:22

about it too much and I would just make

play10:24

sure you focus on the basics and just

play10:26

get better at that with that guys I want

play10:27

to say thank you so much for watching

play10:28

this video and I will see you next time

play10:30

thank you and

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
Data EngineeringCareer AdviceTech RealityData PipelinesSoftware EngineersData ContractsData SwampsData LakesData QualityIndustry InsightsSkill Development
هل تحتاج إلى تلخيص باللغة الإنجليزية؟