I analyzed 2,765,739 jobs to solve THIS
Summary
TLDRThe video script discusses the discrepancy between recommended data science skills and actual job market demands. The creator unveils an app that analyzes job postings to identify top skills like SQL and Excel, contrasting outdated internet suggestions. They critique misleading skill endorsements and advocate for evidence-based recommendations, akin to Stack Overflow's surveys. The script also details the development of a new solution to collect and analyze global job data more efficiently, using Python, APIs, and data engineering tools like BigQuery, Airflow, and Apache Spark. The result is a resource that offers real-time insights into in-demand skills and salary data for data professionals, accessible at datanerd.tech.
Takeaways
- đ The speaker discovered a discrepancy between the skills recommended by various sites and the actual top skills required for data analyst jobs based on their app's analysis of job postings.
- đĄ Some websites were promoting outdated skills or selling courses for the skills they claimed to be top-ranked, without data to back up their claims.
- đ The speaker compared their findings with the Stack Overflow survey, which is valuable for developers but less so for data professionals due to their low representation in the survey.
- đ The speaker's initial app was limited to U.S. data analysts, but they recognized the need for a global perspective to better serve their diverse subscriber base.
- đ» Technical issues with the app's design led to slow processing times and crashes, highlighting the need for a more robust solution involving data engineering practices.
- đ The speaker collaborated with a former data engineer from Meta to develop a plan for data extraction and cleaning using Python, BigQuery, and Apache Airflow.
- đ The project involved collecting and analyzing a large dataset of job postings to identify trends in job demand, required skills, and salary information.
- đ Data engineers emerged as the most in-demand job role, surpassing data scientists, which was previously considered the 'sexiest job of the 21st century'.
- đ« The dataset revealed that many job postings do not require a traditional degree, suggesting a shift towards skills-based hiring in the data industry.
- đ° Salary data from job postings was found to have wide ranges, making averages less reliable, and prompting the use of median values for more accurate insights.
- đ ïž The speaker utilized Apache Spark for natural language processing to extract salary details and skills from job descriptions, addressing the limitations of SQL and single-threaded processing.
Q & A
What was the main issue the creator found in the data science industry regarding skills recommendations?
-The creator found that some websites were recommending outdated skills or promoting their own products as top skills without any data to back up these claims.
What was the initial approach to address the skills recommendation issue in the video?
-The initial approach was to build an app that analyzed data analyst job posts in the United States to identify the most common skills required.
How did the creator plan to expand the data collection beyond just data analysts in the United States?
-The creator planned to use the serp API to collect data on different job titles and locations globally, focusing on the countries where the subscribers come from.
What was the issue with the app's performance when dealing with larger datasets?
-The app was poorly designed and would crash or take nearly an hour to generate a visualization when processing larger datasets.
Why did the creator decide to involve a data engineer in the project?
-The project became more complex with the need to search different job titles and locations, requiring a more robust solution that a data engineer specializes in.
What tools and services were used to build the new solution for data collection and processing?
-The new solution involved using Python, serp API, Google BigQuery, SQL, Airflow for data pipeline scheduling, and Apache Spark for processing large datasets.
What was the significance of using Apache Spark in the project?
-Apache Spark was used to handle the large volume of data by distributing the processing across multiple computers in a Spark cluster, which is efficient for big data tasks.
How did the creator approach the problem of extracting salary and skills information from job postings?
-The creator used natural language processing with Apache Spark to extract salary ranges and a list of predefined skills from the job descriptions.
What insights were gained from analyzing the job postings regarding the demand for different data-related job roles?
-Data engineers were found to be in the highest demand, surpassing data scientists, which aligns with the complexity and data handling needs of current projects.
How does the final app help users determine the top skills needed for data-related jobs?
-The app provides real-time insights into the top skills being requested in job postings, allowing users to filter by job title and see the most important skills for each role.
What additional feature does the app offer regarding salary information?
-The app links salary data to the identified skills, enabling users to find out potential salaries based on specific skills and compare them across different job titles.
Outlines
đ€ Data Science Skills Gap Analysis
The speaker identifies a discrepancy between the skills recommended by various websites and what is actually in demand in the data science industry. They built an app to analyze job postings for data analysts in the U.S. to determine the most sought-after skills. However, they found that some sites were promoting outdated skills or those they were selling, without data to back up their claims. The speaker expresses frustration with this lack of evidence-based guidance, sharing a personal anecdote about wasting time learning Microsoft Access, which turned out to be an obsolete tool. They suggest that a more reliable model could be the annual survey conducted by Stack Overflow, which provides valuable insights for developers but may not be as representative for data professionals, who make up a small percentage of respondents.
đ§ Building a Global Data Collection App
The speaker outlines the need for a new solution to collect data on job skills beyond just U.S. data analysts, aiming for a global perspective. They discuss the limitations of their previous app, which was not only U.S.-centric but also poorly designed, causing it to crash. The speaker collaborates with a former data engineer to develop a more robust system using Python, the serp API, and Google's BigQuery to handle the data extraction and storage. They mention using Airflow for scheduling data pipelines and emphasize the importance of simplifying the process. The speaker also delves into the initial steps of data collection, including identifying the countries their subscribers come from and setting up a system to collect and clean job data daily.
đ Analyzing Job Demand and Skills with Big Data Tools
The speaker discusses the process of cleaning and analyzing the collected job data, focusing on extracting useful information from JSON files into a structured format. They explore the most in-demand jobs, the relevance of degrees in job postings, and the prevalence of remote work flags. The analysis reveals that data engineers are in higher demand than data scientists, and that many job postings do not require a degree. The speaker also examines the formats in which salaries are listed and the challenges this poses for accurate analysis, such as wide salary ranges that can skew averages. They introduce the use of median values as a more reliable measure and touch on the use of Apache Spark for processing large datasets and natural language processing (NLP) for extracting skills from job descriptions.
đĄ Launching a Real-time Data Nerd Skills App
The speaker introduces an app that provides real-time insights into the top skills requested in job postings for data professionals, filtering by job title and skill. They describe the process of setting up a Spark cluster to clean salary data and extract skills from job descriptions, using a predefined list of keywords. The app, accessible at data nerd.tech, allows users to explore average salaries, skill requirements, and compare different job titles. The speaker also discusses the challenges of handling large salary ranges and the importance of using median values for more accurate salary insights. They conclude by inviting feedback and improvements for the website, aiming to provide a valuable resource for data professionals to understand the job market dynamics.
Mindmap
Keywords
đĄData Science Industry
đĄSQL
đĄExcel
đĄOutdated Skills
đĄStack Overflow
đĄData Engineers
đĄBigQuery
đĄAirflow
đĄNatural Language Processing (NLP)
đĄApache Spark
đĄData Analyst
Highlights
The speaker identified a discrepancy between the skills recommended by online sites and the actual top skills in data analyst job postings.
Outdated skills are being recommended by some sites, while others promote skills they also sell, without data to back up their claims.
The speaker's personal experience with learning Microsoft Access, which turned out to be an unnecessary skill, highlights the issue of misinformation.
Stack Overflow's annual survey is mentioned as a valuable resource, but it may not accurately represent the needs of data professionals, who make up a small percentage of respondents.
The previous app built by the speaker only catered to data analysts in the United States, limiting its global applicability.
The app was also criticized for its poor design, causing it to crash and process data very slowly.
The speaker proposes a new solution involving Python, the serp API, and Google BigQuery to collect and process data more efficiently.
Serp API provided free credits for the project, which is acknowledged by the speaker as a form of support.
The project's complexity necessitates the involvement of a data engineer to handle the data pipeline and processing.
Airflow is used as a data pipeline scheduler to automate the collection and cleaning of job data daily.
The speaker used YouTube channel analytics to determine the most relevant countries for job data collection.
Data cleaning involved extracting key information from JSON files into a fact table for analysis.
The speaker explored the demand for different job titles, finding that data engineers are currently in higher demand than data scientists.
The importance of degrees in job postings is discussed, with a significant number of postings not requiring a degree for data engineers and analysts.
Salary data is extracted and analyzed using Apache Spark, revealing wide salary ranges and the highest and lowest salaries for different job roles.
Natural Language Processing (NLP) is used to extract skills from job descriptions, with the speaker manually curating a list of relevant keywords.
The final app, datanerd.tech, provides real-time insights into top skills and salary data for data professionals.
The app allows filtering by job title and skill, providing a personalized view of in-demand skills and associated salaries.
The speaker invites feedback and improvements for the website, emphasizing the ongoing nature of the project.
Transcripts
that nerds I found a pretty big problem
in the data science Industry and well
let me show you in my last video I built
an app that analyzed data analyst job
posts in the United States for top
skills and sound with the app data
analysts can focus on learning top
skills like SQL and Excel as their most
common in job postings so after building
this I was curious how do these skills
stack up to what the internet is
suggesting and well I was in for an
Awakening some sites were recommending
outdated skills that weren't even close
to being in my top 10. others were
suggesting a skill was number one while
conveniently also selling you this skill
those with access to the most valuable
insights provided skills that could be
applied to any job and although a lot of
the sites did have skills that matched
up with the job posting data none of
these sites had any sort of data to back
up their claim for this hold up stop the
music how can a site recommend a top
skill to a data analyst without
providing any evidence to that claim I
mean the whole job of a data analyst is
to provide data to support support a
claim super ironic and this is all
personally upsetting to me because
whenever I first started as a data
analyst I was recommended to learn the
skill of Microsoft Access from one of
these sites I spent weeks trying to
learn access and I came to find out
afterwards that there were not only more
powerful but more popular tools that
Microsoft was actively trying to replace
access with so basically I wasted weeks
of My Life Learning an unnecessary tool
and if it's this bad for data analysts
what about for data scientists or data
Engineers well there's a solution I like
to model after that's a survey done by
the popular developer site stack
Overflow now if you're not familiar with
stack Overflow it's a site that's
primarily used to get you help with
popular tools like python SQL
surprisingly even Excel and they pull
their users annually to find the most
popular skills like programming SQL
databases and Cloud platforms along with
going as far as telling you salary for
top jobs in the developer industry and
this survey is extremely valuable for
aspiring developers in order to identify
the skills that they need to know most
all online education sites for
programmers even go as far as to quote
the results from the survey to Showcase
what skills you should learn but as
great as the survey is for developers
it's not so good for data nerds who
comprise less than 15 percent of the
respondents of this survey and because
of this low percentage it's hard for
data nerds to extract value of what
skills they should be learning so what
about that previous app that I built for
my subscribers well the first major
problem is that this data is only for
data analysts in the United States and
my subscribers aren't just that analysts
and are from around the world so we need
to collect data that supports this the
second and actually bigger problem is
that I keep on getting these emails
saying that my app is crashing the app
that I build is pretty poorly designed
as a test I used an even larger data set
with my current code and it took nearly
an hour to generate a visualization and
we're going to be processing a lot more
data I mean you can tell from this
clickbait title that previous app that I
built was only handling around 7 000
jobs at the time of filming this so we
need a completely new solution so let's
get into song building that first
problem of collecting data Beyond just
data analysts in the United States and
we're going to be still following a
similar approach that I did before which
is using python to connect to an API in
order to extract this data into a
database and specifically using a
service called serp API to handle this
typically if you're trying to scrape
this data from the website they don't
like this they're going to use methods
to block you such as those captions that
even humans can't solve sometimes
so serp API handles all of this and gets
me the data that I need so I was really
surprised when I reached out to Surf API
and asked for a few hundred thousand
search credits that they agreed so
thanks to serpent bi for supporting this
okay you're looking like a hot mess one
quick note this video is not sponsored
but I did want to be transparent about
serp API providing those free credits my
cloud bills are getting pretty high
right now so I am open to sponsors so
Google Cloud hit me up all right back to
Luke but now this project is getting
more complex we're not only needing to
search different job titles we also need
to search different locations with this
added complexity we need a more robust
solution and frankly this is a job for a
data engineer somebody that specializes
in moving data from point A to point B
what is up oh
yeah my headphones so this is bad he's
not only a former data engineer at meta
but he also runs the YouTube channel
over at Seattle data guy so I started
with showing him some python code that I
written for my previous project to get
us feedback and well so I'm putting all
these exclamation points
um
well the logs are so we went over a plan
to extract the data so I would build on
that initial plan I would continue to
use Python to call Surf API hello get
this data into our database for this I'm
using a popular cloud-based solution of
bigquery from Google now these results
from survey bi come in a Json file which
isn't very usable so then I could use
SQL to unpack all of this data once we
have this cleaned up in our data
warehouse we can then use our app to
connect to this now to make sure we were
collecting job data daily and also
cleaning it up I'd use a popular data
pipeline scheduler called airflow this
tool allows you to write jobs in Python
to run any number of tasks while also
controlling the order I think this is
good I think the big thing with a lot of
these things is always simplify like
don't try to kill yourself
which is exactly what I didn't do and
this project took way longer than my
normal video but let's focus on the
positives so I started with the data
collection first building out a python
script and airflow to collect not only
all those job titles via search term but
also all those different search
locations for collecting jobs from
around the world I decided to dig into
my YouTube channel analytics to find the
countries my subscribers come from which
is a lot of dang countries so after the
collection pipeline was built it was now
time to dive into cleaning the data for
this portion I just focused on
extracting all that data from those Json
files into our main data table or as
data nerds call it a fact table that has
a ton of key information in each column
for all of those different job postings
all right so we're not completely done
with the data cleanup just yet but we
need to perform some Eda first to see
what we're working with let's see how
big this table is as we're collecting
around 6 000 jobs per day querying it we
can see we have around 380 000 jobs
which may be different from what you see
down here in the the time future Luke is
supposed to be automating this title
update so that way it matches the number
of jobs in our database next let's look
at where all these job postings are
coming from and it looks like an
overwhelming majority are from LinkedIn
which also checks with what the majority
of my subscribers say they use so let's
do some comparisons of these postings
I'm curious to find out what is the most
in demand job right now now back in 2012
Harvard Business Review released this
article on data scientists claiming it
was the sexiest job of the 21st century
and talked about the need for this
profession as this field began to
explode and so I'm curious is data
scientist still the most popular job
well not anymore it looks like data
Engineers have an overwhelming majority
with many companies requesting this but
I mean this makes sense because look at
this data project that I'm doing right
now it required data engineering to get
the data I needed to perform this
analysis oh and it looks like those
authors released an updated article last
year where they captured this about data
Engineers but sadly they left out data
analysts let's look at another hot topic
for my data nerds and that's whether job
postings have any mention of a degree in
it this topic has really intensified by
Google's release of the data analytics
certificate that claimed they would
accept this certificate Over a four-year
degree The Wall Street Journal
investigated the need for degrees and
found that the pandemic has helped blue
collar workers transition more easily
into tech-based careers without a degree
with I.T and data processing being one
of the most common Industries for career
transitioners lucky for us the data set
includes information on whether a degree
is mentioned in the job posting and
looking at it right here we can see that
for every one in three job postings for
data engineers and analysts they have no
mention of a degree which I think is
pretty high number unfortunately for
data scientists this was only found in a
severely low seven percent all right
speed round here's some other
interesting insights less than 10
percent of job postings are flagged for
remote work with data analysts at the
lowest around six percent for job
locations we have an assortment from
around the world with anywhere being the
highest which correlates to all those
remote work jobs that we previously
found finally for the different type of
jobs off effort it seems like it's not
even close as most all opportunities in
this industry are focused on full-time
jobs which is really disheartening
because things like internships and
contract opportunities are really great
at getting those with low experience
experience so what about salary and
skills for data nerds you know the whole
point that you're watching this video
well inspecting the seller we can see
that it comes in this crazy format
sometimes it's yearly other times it's
hourly sometimes it's a range sometimes
it's not for the skills they're very
deep in the job descriptions so we need
to develop a way to extract these
keywords out both of these issues to fix
will be classified under NLP or natural
language processing
wrong way natural language processing I
never know which way to go simply put
it's a way for computers to process
human language and come a long way with
NLP as witnessed by Chad gbt so what
tools should we use for this processing
well SQL is a little too structured for
what we need and from that python
example that I showed earlier it took
nearly an hour to generate a
visualization on a single computer
but that's a single computer what else
if we used multiple computers it turns
out there's a tool specifically designed
for this called Apache spark now when a
pack of wild computers band together
they form a spark cluster look how cute
this little guy is and these guys are
great at fighting against big data now
how and where do you even run these
spark clusters well most all Cloud
providers offer this and they're more
than happy then to take your money to
offer this service now the framework for
Apache spark is written in Scott but
because it's such a popular tool they
offer apis to interact with this
framework including python which is
conveniently named Pi so I've never used
Pi symphonic before but it was pretty
easy to figure out as it's very similar
to pandas for this I have airflow spin
up a spark cluster daily and then import
all those new job entries in from there
I focus on the salary column first
extracting information like the minimum
maximum and average I even plotted this
chart to check my results
next up is the hard part of extracting
the skills I found that the best way to
handle this was to provide the cluster
with a list of skills to find and then
distract from the job description to get
this list was pretty painful I went
through and did a count of all the
popular words in the job descriptions
and hand-picked out all these tools
after I went through a few hundred
common words specific to data science I
then added to this list all those skills
from the stack Overflow survey in total
we have a total of 250 words specific to
data nerds I'm like 99 sure we got all
the keywords that we need but in the
future we're probably going to have to
develop some sort of machine learning
algorithm to extract these keywords in a
better method so now we're finally done
with the cleaning we use a spark cluster
to not only generate all that clean
salary data but also to extract all
those skills for each job posting all
right so now let's jump into exploring
those newly cleaned salary and skill
columns first let's look at the average
salary by the different job search terms
now I have a bad feeling about using an
average for this you see recently there
have been laws imposed in States like
New York and California that require
salaries be listed on job postings
there's just one big problem companies
are listing very wide ranges for
salaries in order to satisfy this
requirement so let's inspect some of
these ranges and it looks like data
scientists are the biggest offender at
350 000 for a range and not only are
these job postings coming from States
like New York and California but they're
also from popular tech companies and
surprisingly this 350 000 range isn't
really that bad on comprehensive.io a
salary tracking website they found a
data engineering role with a 750 000
variance at Netflix along with this
crazy range for a data analyst now
because of these large ranges in the
data set averages may not be the best
for this instead we're going to be using
mediums
not that median this type of median
which evaluating on this salaries aren't
skewed as high due to those large ranges
and also they check better with salary
aggregation sites like Glassdoor now I'm
also curious just for fun what is the
highest and lowest salaries based on
these large ranges of these companies
for the highest a data analyst takes the
win at 800 000 for the lowest a data
analyst also takes the win with this at
25
000. I even found this internship at
eight dollars an hour so sort of crazy
that that analysts have both the highest
and lowest salaries all right let's now
get into exploring those skills and for
this we're going to be using the app
that I built with the same python
framework of streamlip that I used on my
past app but instead of using python on
a single computer to aggregate all this
data on the front end we're now using
the power of multiple computers to use
SQL and Pi spark to aggregate this data
on the back end basically that long
extravagant storyline of how we've
cleaned up the data was done in order to
provide this in an easily accessible
manner via this app which you can access
via data nerd.tech you can access a
video your phone or even a web browser
so how should this be used well let's
say you're an aspiring data nerd and
curious about the skills you should be
focused on first to land a job with the
app you can get real-time insights into
what the top skills are being requested
in job postings today but this is for
all data nerds we can actually filter
down further based on a job title let's
look at data Engineers first for them we
can see SQL and python are most
important along with Cloud Technologies
looking at data scientists next we can
see that Python and SQL are still
important but so are other tools such as
R and Tableau oh yeah about this
percentage so out of 22 000 job postings
for data scientists 17 000 list the
skill of python or basically three out
of four data scientist jobs are
requesting this finally with data
analysts we can see that clearly SQL is
the most important followed by
spreadsheets programming languages and
Vis tools we can even filter this
further to see how things like languages
or even Cloud Technologies compare now
remember we do have all of that salary
data and because it's linked to these
skills we can then go through and
actually find out how much you get paid
based on a certain skill so let's
actually find out what would I get paid
for the skills that I used in this
project so let's first filter down by my
job title of data analysts we can look
at languages which we use SQL and python
paying around ninety thousand dollars
for cloud Technologies we use Google
cloud and specifically bigquery which is
around a hundred and eleven thousand
dollars for libraries we use both
airflow and Spark which have some of the
highest and also lowest salaries in this
one so based on these skills that I know
and used I feel I can get a better
representation of what my potential
salary could be now I obviously didn't
leave out a salary comparator between
all the different job titles so that's
available as well and you can see both
the annual and hourly rate
all right so this is all pretty crazy
we're able to now have a website that
data nerds can go to in order to find
out what are the top skills they need to
learn for their jobs based on actual
data from job descriptions on what
skills are being requested I'm looking
forward to continue to collect this data
over the year and then compare this and
see how skills and also salary track
over time I'm always open to feedback
and improvements for this website so if
you have any ideas to make this website
more accessible to others please feel
free to drop them in the comments below
alright as always if you got value out
of this video smash that like button
with that I'll see you in the next one
Voir Plus de Vidéos Connexes
GCP Data Engineer Live Q&A for job readiness
How to ACTUALLY become a data analyst? | Data Analyst Roadmap 2024
The Ultimate Big Data Engineering Roadmap: A Guide to Master Data Engineering in 2024
Becoming a Data Analyst is Harder Than EVER
3 Months Data Analyst Roadmap 2024 | Complete Syllabus | Become Job Ready in 3 Months
The Exact Skills and Certifications for an Entry Level Machine Learning Engineer
5.0 / 5 (0 votes)