The Harsh Reality of Being a Data Engineer
Summary
TLDRIn this video, Ben Rogue John, the Seattle data guy, addresses the harsh realities of being a data engineer. He discusses the lack of attention from software engineers to data pipelines, the trend of companies trying to eliminate data engineering roles, and the prevalence of 'data swamps' instead of well-structured data lakes. He also touches on the unrealistic expectations placed on data professionals to be experts in all areas of data work and the importance of acknowledging one's limits. Lastly, he advises not to worry about always using the latest technologies, emphasizing the value of understanding foundational concepts.
Takeaways
- 🔧 Data engineering involves dealing with the harsh realities of data pipeline maintenance and not always working on Big Data systems.
- 🛠️ Many software engineers may not be aware of how their changes can impact data pipelines, leading to the need for data contracts to ensure data integrity.
- 🔍 Companies sometimes attempt to eliminate data engineering roles, but often realize the importance of having someone manage and own data pipelines for analysts and scientists.
- 🏗️ The industry struggles with 'data swamps' where data is dumped without structure, leading to chaos and difficulty in managing and accessing it.
- 🤔 There's a misconception in companies expecting data professionals to be experts in all aspects of data work, similar to expecting a programmer to know all technology areas.
- 📈 Data engineers must understand the depth of their skills and know their limits, seeking additional training or senior assistance when necessary.
- 📚 Learning from older technologies can provide valuable insights into best practices and the evolution of data warehousing and modeling techniques.
- 🚀 Not using the latest technologies doesn't mean falling behind; it can offer a deeper understanding of where current practices and solutions originated.
- 💡 The speaker emphasizes the importance of continuous learning and sharing knowledge within the data community to improve overall understanding.
- 🌐 Data engineers should focus on mastering the basics and not worry too much about the hype around new technologies, as fundamentals are crucial for long-term success.
Q & A
What is the main topic of Ben Rogue John's video?
-The main topic of Ben Rogue John's video is the harsh realities of being a data engineer.
Why might software engineers not always care about data?
-Software engineers might not always care about data because their reviews and performance metrics are often focused on delivering new features and functionality, which may not take into account how these changes could impact data pipelines.
What is the purpose of data contracts in the context discussed in the video?
-Data contracts are becoming important to ensure that changes made by data producers, such as software engineers, do not break data pipelines, as these changes can have a significant impact on data engineers' work.
Why did Ben Rogue John mention the development of a system at Facebook?
-Ben Rogue John mentioned the development of a system at Facebook to automatically scan and detect changes in data tables from sources, ensuring that data engineers are aware of any modifications that could affect their work.
What is one reason companies might want to remove data engineering roles?
-Some companies want to remove data engineering roles because they see them as a bottleneck and would prefer data analysts and scientists to have direct access to data without the need for data engineering processes.
What is the term used to describe poorly structured data storage that Ben Rogue John discussed in the video?
-The term used to describe poorly structured data storage is 'data swamps,' which refers to chaotic and unorganized data storage situations.
Why do companies sometimes struggle with defining the roles of data engineers, data scientists, and data architects?
-Companies sometimes struggle with defining these roles because they expect individuals in these positions to have a broad range of skills and be able to handle all data-related tasks, which is unrealistic given the complexity and specialization required in the data field.
What is the 'Iceberg' meme mentioned by Ben Rogue John regarding SQL, and what does it signify?
-The 'Iceberg' meme signifies that SQL is a deep and complex skill, with many layers and nuances to understand beyond just the basic commands, and that even experienced professionals continue to learn and discover new aspects of it.
Why is it important for data professionals to know their limits and seek help when needed?
-It is important for data professionals to know their limits because the data field is vast and constantly evolving, making it impossible for one person to be an expert in every area. Seeking help ensures that projects are completed efficiently and accurately.
What is the advice given by Ben Rogue John regarding the use of older technologies in data engineering?
-Ben Rogue John advises that working with older technologies is not a disadvantage, as it allows professionals to understand the history and evolution of best practices in data warehousing and modeling, and to appreciate the reasons behind current approaches.
What is the final reality that Ben Rogue John discusses in the video about data engineering?
-The final reality discussed is that data engineers won't always get to use the newest and most hyped technologies, but focusing on the fundamentals and understanding the evolution of the field is more valuable than chasing the latest trends.
Outlines
🔧 The Realities of Data Engineering
In this paragraph, Ben Rogue John, the Seattle data guy, introduces the topic of the harsh realities faced by data engineers. He explains that the role is not always as exciting as it seems, often involving mundane tasks such as data migration and dealing with the impact of changes made by software engineers on data pipelines. He emphasizes the importance of data contracts to manage these impacts and shares his experience at Facebook, where a system was developed to monitor changes in data sources. He also discusses the misconception that data engineering can be eliminated, highlighting the necessity of data engineers to manage and maintain data pipelines for analysts and data scientists.
🌐 Data Swamps and the Scope of Data Engineering
This paragraph delves into the issue of 'data swamps,' where data is stored without structure or consideration for future integration and transformation. Ben uses an image from the data engineering subreddit to illustrate the chaotic nature of such environments. He connects this to the broader challenge of companies not fully understanding the roles and responsibilities of data engineers, data scientists, and data architects. The paragraph highlights the unrealistic expectations placed on data professionals to be experts in all aspects of data work, and the need for continuous learning and specialization in the field.
🛠 Embracing the Evolution of Data Technologies
In the final paragraph, Ben addresses the concern that data engineers might feel left behind by not always using the latest technologies. He argues that working with older technologies can provide valuable insights into best practices and the evolution of data warehousing. He encourages data engineers to focus on mastering the basics rather than chasing the latest trends, suggesting that a deep understanding of foundational concepts is more important than familiarity with new tools. Ben concludes by thanking viewers for watching and looking forward to the next video.
Mindmap
Keywords
💡Data Engineer
💡Data Pipelines
💡Data Contracts
💡Data Swamps
💡Technical Debt
💡Data Quality
💡Data Warehousing
💡Slowly Changing Dimension (SCD)
💡SQL
💡Hype and FOMO
💡Data Modeling
Highlights
Reviewing the harsh realities of being a data engineer, including the less exciting aspects of the job.
Software Engineers often don't consider the impact of their new features on data pipelines.
The rise of data contracts to manage changes that could affect data pipelines.
Data producers' lack of awareness of how their changes can break data pipelines.
The necessity of data engineers to manage and own data pipelines for data sets to be understandable.
Companies' attempts to remove data engineering roles and the subsequent need to rehire.
The prevalence of data swamps versus the ideal of well-structured data lakes.
The chaotic nature of data swamps and the difficulty in managing unstructured data.
Misunderstandings within companies about the roles and responsibilities of data professionals.
The unrealistic expectation for data professionals to be experts in all data-related fields.
The depth and complexity of SQL and database engines, beyond basic SQL commands.
The importance of recognizing one's limitations and seeking help or training when needed.
The value of understanding the history and evolution of data practices and technologies.
The shift in technology trends and the importance of not getting stuck on using only the newest tools.
The balance between learning new technologies and mastering the fundamentals of data engineering.
The reassurance that working with older technologies can provide valuable insights into data practices.
Transcripts
what is going on guys welcome back to
another video with me Ben Rogue John aka
the Seattle data guy today I wanted to
review a subject that I kind of put
together a while ago on a previous video
which is the harsh realities of being a
data engineer now I'm going to do this
because a lot of people always ask you
know should I become a data engineer is
it the right role for me and things
similar to those types of questions so I
wanted to cover some of the harsh
realities of being a data engineer
because at the end of the day a lot of
the work you're going to do might not
always be exhilarating it might not
always involve you working on Big Data
Systems and a lot of it might just be
migration from one platform to another
and this isn't always exciting let's
start with the fact that in general most
software Engineers just don't care about
data now let me be clear obviously one
this is a massive generalization and two
I mean more in terms of analytical
purposes most of the time if you are a
data engineer you're pulling data from
various sources many of which are often
being built by software Engineers who if
they're being judged or have reviews
that are all geared on their ability to
deliver new features and functionality
don't always pay attention to how those
new features and functionality could
possibly break your data pipelines and
this is why we see a lot about data
contracts kind of coming out because
there are a lot of people that could be
producing data and I'm saying software
Engineers but honestly it's not just
them it's General data producers for
example if you work on Salesforce and
you're a person who's maybe adding new
features and columns in terms of trying
to track any information or maybe taking
away previous information from different
Salesforce objects you could also
possibly break data pipelines so really
it's more General to say that a lot of
data producers don't always necessarily
care or at the very least know that if
they make these small changes they will
drastically impact your life as a data
engineer again this is why data
contracts are becoming a thing I feel
like I'm seeing this pop up everywhere
from various newsletters to new startups
to LinkedIn posts all about this because
everyone knows that this is a problem
when I was at Facebook we had to develop
a whole system that basically
automatically looked and scanned to see
if tables changed from your sources to
make sure you knew that hey this table
you're relying on is no longer the same
you know some field has changed some
data type has changed something similar
to that so there is no way to sugarcoat
it a lot of your work is going to be
stuck spending time trying to fix all of
these small changes that someone else
produces on top of that I feel like I've
had a few conversations now with various
heads of data where they discussed how
in previous companies they were all
trying to remove data engineering
somehow they wanted to get rid of data
Engineers they were like well we just
want the data and analysts and data
scientists to directly access it that's
why I put together this picture and it's
many of these companies had to rehire
and re-implement their data engineering
strategies because I do think that as
much as companies want to get rid of
what they consider a bottleneck which is
data engineering they must also admit
that someone needs to manage and own
these pipelines that create data sets
that are actually understandable by
analysts India scientists so yes there
are a lot of companies that are trying
to remove data engineering as well as
tooling that just wants to kind of
eliminate or reduce the amount of Need
for data Engineers which kind of makes
sense because there's not a lot of us
out there I think this is kind of proven
by the difference in numbers when you
look at the different subreddits for
days science versus data engineering but
it often leads to problems and a lot of
technical debt in the future I mean I
think you're seeing this at a lot of
companies you can read some articles
like the Airbnb article where they
eventually had to start reinstating or
just implementing a data quality and
data engineering strategy because they
just kept running into various problems
so it's a reality that a lot of
companies want to get rid of data
engineering to some degree or another
but it's also a reality that it is very
very hard to do so so for those of you
who wonder if de is a good job choice I
would say we've got at least another
decade of doing this work another great
harsh reality that exists in the whole
world of data engineering is as much as
we like to think that everything is
perfect and every company out there that
has written an article about you know
developing a perfect data lake or data
lake house exists out there are just a
lot of data swamps up there as well and
I've definitely had to go through a few
of them I had to work with one company
where they were just dumping all of
their files into an S3 bucket no folders
no structure no thing about time like
when something was dropped it was just
all of their raw files into one S3 data
bucket there were thousands and
thousands of files and there was very
little ability to even search and figure
out where and what file was just
recently dropped so it was definitely a
chaotic mess and you don't even know
where sometimes to start in those
situations because you're just so
flabbergasted that it happened so there
are tons of day swamps it's not just me
who's posted this out there in fact I
saw this hilarious image on the data
engineering subreddit covered a lot of
these points you know if you look at
this image you can kind of see that you
know data has just gone out of control
there's no real structure everything's
just kind of stored in the data Lake
there's just a lot of problems that
arise from these situations where yeah
it's just kind of a data swamp it is a
reality that you know we try to go into
this world where we're going to move
fast and create data and create value
but all that really ended up happening
was we stuffed a bunch of data somewhere
and we never really thought about the
transform or how we're gonna like
integrate it or all these other key
things that data Engineers do so this is
somewhat connected with getting rid of
data Engineers another reality that will
never I think go away is just the
ability for companies to actually know
what data engineers and data scientists
and data Architects all should be doing
I was looking at a post on the data
engineering subreddit and I think it
kind of covered this really well you
know where they just kind of point out
the fact that a lot of companies expect
dated people to do all data things it's
similar to if you were some sort of
programmer before they kind of just
expect you to do all technology things
you know you should understand how to do
database things and website things and
back-end things and front-end things and
networking things and Linux things and
anything that has to do with you know a
keyboard and a mouse and a terminal you
should be able to do and we're kind of
in the same space now in the data world
where everyone's just kind of expected
to do everything even if it's not what
you're good at and I liked how they kind
of point out here that a lot of people
especially people who are just breaking
into the industry are using like high
level YouTube videos and I've definitely
put out plenty of high level videos to
kind of say that they are proficient in
these skills and most of these skills
even SQL are a lifetime skill you can
learn a lot of SQL in a year but really
it's just the surface I really love that
Iceberg meme recently put out about SQL
but it really is so deep because it's
not just about like SQL it's also about
the database engine underneath like are
you an expert in Oracle or SQL server or
snowflake or data breaks or whatever
solution you're picking because all of
these even operate differently at least
in to some degree and obviously you need
to as a data person be able to work on
most of these but to say that you're an
expert on all these because there's just
so much to know how to optimize each of
these different solutions how to make
sure you're you know writing a SQL in
the best way it's all going to be
slightly different so even if you know
all of the SQL commands there's just so
much more to know that's why I don't
think I'll ever say that I'm a 10 in SQL
whenever an interviewer asks how good is
your SQL I'll probably always say like a
six or seven because every year there's
some new skill or New Concept that I'm
like oh I didn't know this before but
yet companies still expect you you know
everything because that's just the way
the tech World works it's like when you
go to your parents house and they're
having an issue with the router and they
expect you to fix it because for some
reason you can write a few lines of
python in these cases it's important to
know where your limit is if you're new
to any specific area yes you can kind of
figure out some of it but as soon as
something gets deeper it's probably not
a bad idea for you to go to your manager
or your director and just let them know
that you're kind of out of your depth
and either you need some training or
maybe there needs to be someone else
that comes in that is more senior
because there's just too much in the
data world for one person to know that's
why in a lot of my recent videos I've
definitely tried to like bring in other
people's knowledge because there's just
so much information and so much
knowledge that's trapped in everyone's
brains and the more we can kind of get
it out there I think hopefully the more
people can kind of glean and understand
what exactly is going on in this whole
data World finally the last reality I
think that's important just to
understand is that you won't always get
to use the new hyped and cool things
that exist out there and I think that's
more than okay mostly because I remember
when I first started out in the data
world I was working mostly on Oracle SQL
server postgres and a lot of things that
were on-prem at this point a lot of
companies were moving to like redshift
and Hadoop and I felt like really left
behind but the interesting thing was
that as soon as I finally kind of
started working with more modern
solutions companies start to move away
from Hadoop and even redshift although
redshift is still I think very popular
to use and even Hadoop at plenty of
companies just often in a different form
it's not the end of the world I think to
sometimes be on technologies that are a
little bit older one you get to learn
where a lot of like our best practices
and solutions that we've developed over
the last few decades have all come from
why have we developed data warehousing
the way that we have why in the world do
slowly changing dimensions exist like
all these things that you might just
learn or read in a book you can kind of
understand a little more why they exist
why they're around and and why some
people people want to change them by
using different modeling techniques
whereas if you just kind of jumped into
the modern world of data engineering and
data modeling and things in that space
you might not get all of the Nuance from
where we came from and why we're
changing or at least trying to change or
reapproach a lot of the problems that
we've been looking at for the last few
decades so I just wouldn't get too stuck
on the fact that hey we're using old
technology I'm falling behind because
one you're going to likely work with
that technology at some point if it's
worth working with a lot of stuff is
just a combination of marketing and hype
and people fomoing about not getting to
use the new technology that everyone
else is testing out but I wouldn't worry
about it too much and I would just make
sure you focus on the basics and just
get better at that with that guys I want
to say thank you so much for watching
this video and I will see you next time
thank you and
関連動画をさらに表示
5.0 / 5 (0 votes)