I analyzed 2,765,739 jobs to solve THIS

Luke Barousse
17 Feb 202315:18

Summary

TLDRThe video script discusses the discrepancy between recommended data science skills and actual job market demands. The creator unveils an app that analyzes job postings to identify top skills like SQL and Excel, contrasting outdated internet suggestions. They critique misleading skill endorsements and advocate for evidence-based recommendations, akin to Stack Overflow's surveys. The script also details the development of a new solution to collect and analyze global job data more efficiently, using Python, APIs, and data engineering tools like BigQuery, Airflow, and Apache Spark. The result is a resource that offers real-time insights into in-demand skills and salary data for data professionals, accessible at datanerd.tech.

Takeaways

  • πŸ˜€ The speaker discovered a discrepancy between the skills recommended by various sites and the actual top skills required for data analyst jobs based on their app's analysis of job postings.
  • 😑 Some websites were promoting outdated skills or selling courses for the skills they claimed to be top-ranked, without data to back up their claims.
  • πŸ” The speaker compared their findings with the Stack Overflow survey, which is valuable for developers but less so for data professionals due to their low representation in the survey.
  • 🌐 The speaker's initial app was limited to U.S. data analysts, but they recognized the need for a global perspective to better serve their diverse subscriber base.
  • πŸ’» Technical issues with the app's design led to slow processing times and crashes, highlighting the need for a more robust solution involving data engineering practices.
  • πŸ“ˆ The speaker collaborated with a former data engineer from Meta to develop a plan for data extraction and cleaning using Python, BigQuery, and Apache Airflow.
  • πŸ“Š The project involved collecting and analyzing a large dataset of job postings to identify trends in job demand, required skills, and salary information.
  • πŸ“ˆ Data engineers emerged as the most in-demand job role, surpassing data scientists, which was previously considered the 'sexiest job of the 21st century'.
  • 🏫 The dataset revealed that many job postings do not require a traditional degree, suggesting a shift towards skills-based hiring in the data industry.
  • πŸ’° Salary data from job postings was found to have wide ranges, making averages less reliable, and prompting the use of median values for more accurate insights.
  • πŸ› οΈ The speaker utilized Apache Spark for natural language processing to extract salary details and skills from job descriptions, addressing the limitations of SQL and single-threaded processing.

Q & A

  • What was the main issue the creator found in the data science industry regarding skills recommendations?

    -The creator found that some websites were recommending outdated skills or promoting their own products as top skills without any data to back up these claims.

  • What was the initial approach to address the skills recommendation issue in the video?

    -The initial approach was to build an app that analyzed data analyst job posts in the United States to identify the most common skills required.

  • How did the creator plan to expand the data collection beyond just data analysts in the United States?

    -The creator planned to use the serp API to collect data on different job titles and locations globally, focusing on the countries where the subscribers come from.

  • What was the issue with the app's performance when dealing with larger datasets?

    -The app was poorly designed and would crash or take nearly an hour to generate a visualization when processing larger datasets.

  • Why did the creator decide to involve a data engineer in the project?

    -The project became more complex with the need to search different job titles and locations, requiring a more robust solution that a data engineer specializes in.

  • What tools and services were used to build the new solution for data collection and processing?

    -The new solution involved using Python, serp API, Google BigQuery, SQL, Airflow for data pipeline scheduling, and Apache Spark for processing large datasets.

  • What was the significance of using Apache Spark in the project?

    -Apache Spark was used to handle the large volume of data by distributing the processing across multiple computers in a Spark cluster, which is efficient for big data tasks.

  • How did the creator approach the problem of extracting salary and skills information from job postings?

    -The creator used natural language processing with Apache Spark to extract salary ranges and a list of predefined skills from the job descriptions.

  • What insights were gained from analyzing the job postings regarding the demand for different data-related job roles?

    -Data engineers were found to be in the highest demand, surpassing data scientists, which aligns with the complexity and data handling needs of current projects.

  • How does the final app help users determine the top skills needed for data-related jobs?

    -The app provides real-time insights into the top skills being requested in job postings, allowing users to filter by job title and see the most important skills for each role.

  • What additional feature does the app offer regarding salary information?

    -The app links salary data to the identified skills, enabling users to find out potential salaries based on specific skills and compare them across different job titles.

Outlines

00:00

πŸ€” Data Science Skills Gap Analysis

The speaker identifies a discrepancy between the skills recommended by various websites and what is actually in demand in the data science industry. They built an app to analyze job postings for data analysts in the U.S. to determine the most sought-after skills. However, they found that some sites were promoting outdated skills or those they were selling, without data to back up their claims. The speaker expresses frustration with this lack of evidence-based guidance, sharing a personal anecdote about wasting time learning Microsoft Access, which turned out to be an obsolete tool. They suggest that a more reliable model could be the annual survey conducted by Stack Overflow, which provides valuable insights for developers but may not be as representative for data professionals, who make up a small percentage of respondents.

05:01

πŸ”§ Building a Global Data Collection App

The speaker outlines the need for a new solution to collect data on job skills beyond just U.S. data analysts, aiming for a global perspective. They discuss the limitations of their previous app, which was not only U.S.-centric but also poorly designed, causing it to crash. The speaker collaborates with a former data engineer to develop a more robust system using Python, the serp API, and Google's BigQuery to handle the data extraction and storage. They mention using Airflow for scheduling data pipelines and emphasize the importance of simplifying the process. The speaker also delves into the initial steps of data collection, including identifying the countries their subscribers come from and setting up a system to collect and clean job data daily.

10:02

πŸ“Š Analyzing Job Demand and Skills with Big Data Tools

The speaker discusses the process of cleaning and analyzing the collected job data, focusing on extracting useful information from JSON files into a structured format. They explore the most in-demand jobs, the relevance of degrees in job postings, and the prevalence of remote work flags. The analysis reveals that data engineers are in higher demand than data scientists, and that many job postings do not require a degree. The speaker also examines the formats in which salaries are listed and the challenges this poses for accurate analysis, such as wide salary ranges that can skew averages. They introduce the use of median values as a more reliable measure and touch on the use of Apache Spark for processing large datasets and natural language processing (NLP) for extracting skills from job descriptions.

15:02

πŸ’‘ Launching a Real-time Data Nerd Skills App

The speaker introduces an app that provides real-time insights into the top skills requested in job postings for data professionals, filtering by job title and skill. They describe the process of setting up a Spark cluster to clean salary data and extract skills from job descriptions, using a predefined list of keywords. The app, accessible at data nerd.tech, allows users to explore average salaries, skill requirements, and compare different job titles. The speaker also discusses the challenges of handling large salary ranges and the importance of using median values for more accurate salary insights. They conclude by inviting feedback and improvements for the website, aiming to provide a valuable resource for data professionals to understand the job market dynamics.

Mindmap

Keywords

πŸ’‘Data Science Industry

The Data Science Industry refers to the field that encompasses the collection, analysis, and interpretation of data to extract valuable insights and support decision-making. In the video, the creator identifies a significant issue within this industry, specifically the discrepancy between the skills being taught or recommended and those that are actually in demand according to job postings.

πŸ’‘SQL

SQL (Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system. The video emphasizes SQL as one of the top skills that data analysts should focus on, as it is commonly mentioned in job postings.

πŸ’‘Excel

Excel is a widely used spreadsheet program that is part of the Microsoft Office suite. In the context of the video, Excel is highlighted as another crucial skill for data analysts, often required in job postings, and is a tool that the creator initially spent time learning.

πŸ’‘Outdated Skills

Outdated skills are those that were once relevant but are no longer in demand or have been replaced by newer technologies. The script mentions the creator's personal experience with learning Microsoft Access, which is now considered an outdated skill in the data science industry.

πŸ’‘Stack Overflow

Stack Overflow is a question and answer site for professional and enthusiast programmers. It is mentioned in the video as a valuable resource that conducts an annual survey to identify the most popular skills among developers, which can be used as a model for the data science community.

πŸ’‘Data Engineers

Data Engineers are professionals who specialize in building and maintaining the infrastructure that supports the storage, processing, and retrieval of large volumes of data. The video reveals that data engineers are in high demand, surpassing data scientists as the most sought-after job in the industry.

πŸ’‘BigQuery

BigQuery is a fully-managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure. In the video, BigQuery is used as a cloud-based solution for storing and managing the large datasets collected for the analysis.

πŸ’‘Airflow

Airflow is an open-source tool used to programmatically author, schedule, and monitor workflows. The script describes using Airflow to manage the data pipeline, ensuring that job data is collected and cleaned up daily.

πŸ’‘Natural Language Processing (NLP)

Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. The video discusses the use of NLP to extract salary information and skills from job descriptions.

πŸ’‘Apache Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is highlighted in the video as a tool used for processing large datasets, specifically for extracting salary data and identifying skills from job postings.

πŸ’‘Data Analyst

A Data Analyst is a professional who collects, processes, and performs statistical analyses on data to help make informed decisions in various industries. The video focuses on the skills and salary expectations for data analysts, as well as the importance of identifying the most relevant skills for job postings.

Highlights

The speaker identified a discrepancy between the skills recommended by online sites and the actual top skills in data analyst job postings.

Outdated skills are being recommended by some sites, while others promote skills they also sell, without data to back up their claims.

The speaker's personal experience with learning Microsoft Access, which turned out to be an unnecessary skill, highlights the issue of misinformation.

Stack Overflow's annual survey is mentioned as a valuable resource, but it may not accurately represent the needs of data professionals, who make up a small percentage of respondents.

The previous app built by the speaker only catered to data analysts in the United States, limiting its global applicability.

The app was also criticized for its poor design, causing it to crash and process data very slowly.

The speaker proposes a new solution involving Python, the serp API, and Google BigQuery to collect and process data more efficiently.

Serp API provided free credits for the project, which is acknowledged by the speaker as a form of support.

The project's complexity necessitates the involvement of a data engineer to handle the data pipeline and processing.

Airflow is used as a data pipeline scheduler to automate the collection and cleaning of job data daily.

The speaker used YouTube channel analytics to determine the most relevant countries for job data collection.

Data cleaning involved extracting key information from JSON files into a fact table for analysis.

The speaker explored the demand for different job titles, finding that data engineers are currently in higher demand than data scientists.

The importance of degrees in job postings is discussed, with a significant number of postings not requiring a degree for data engineers and analysts.

Salary data is extracted and analyzed using Apache Spark, revealing wide salary ranges and the highest and lowest salaries for different job roles.

Natural Language Processing (NLP) is used to extract skills from job descriptions, with the speaker manually curating a list of relevant keywords.

The final app, datanerd.tech, provides real-time insights into top skills and salary data for data professionals.

The app allows filtering by job title and skill, providing a personalized view of in-demand skills and associated salaries.

The speaker invites feedback and improvements for the website, emphasizing the ongoing nature of the project.

Transcripts

play00:00

that nerds I found a pretty big problem

play00:02

in the data science Industry and well

play00:04

let me show you in my last video I built

play00:08

an app that analyzed data analyst job

play00:10

posts in the United States for top

play00:11

skills and sound with the app data

play00:13

analysts can focus on learning top

play00:15

skills like SQL and Excel as their most

play00:17

common in job postings so after building

play00:19

this I was curious how do these skills

play00:21

stack up to what the internet is

play00:22

suggesting and well I was in for an

play00:24

Awakening some sites were recommending

play00:26

outdated skills that weren't even close

play00:29

to being in my top 10. others were

play00:30

suggesting a skill was number one while

play00:32

conveniently also selling you this skill

play00:34

those with access to the most valuable

play00:36

insights provided skills that could be

play00:38

applied to any job and although a lot of

play00:41

the sites did have skills that matched

play00:43

up with the job posting data none of

play00:44

these sites had any sort of data to back

play00:46

up their claim for this hold up stop the

play00:49

music how can a site recommend a top

play00:51

skill to a data analyst without

play00:53

providing any evidence to that claim I

play00:56

mean the whole job of a data analyst is

play00:58

to provide data to support support a

play01:00

claim super ironic and this is all

play01:03

personally upsetting to me because

play01:04

whenever I first started as a data

play01:06

analyst I was recommended to learn the

play01:08

skill of Microsoft Access from one of

play01:10

these sites I spent weeks trying to

play01:12

learn access and I came to find out

play01:14

afterwards that there were not only more

play01:15

powerful but more popular tools that

play01:18

Microsoft was actively trying to replace

play01:20

access with so basically I wasted weeks

play01:22

of My Life Learning an unnecessary tool

play01:24

and if it's this bad for data analysts

play01:26

what about for data scientists or data

play01:27

Engineers well there's a solution I like

play01:29

to model after that's a survey done by

play01:31

the popular developer site stack

play01:33

Overflow now if you're not familiar with

play01:35

stack Overflow it's a site that's

play01:37

primarily used to get you help with

play01:38

popular tools like python SQL

play01:40

surprisingly even Excel and they pull

play01:42

their users annually to find the most

play01:43

popular skills like programming SQL

play01:46

databases and Cloud platforms along with

play01:48

going as far as telling you salary for

play01:50

top jobs in the developer industry and

play01:53

this survey is extremely valuable for

play01:55

aspiring developers in order to identify

play01:57

the skills that they need to know most

play01:59

all online education sites for

play02:01

programmers even go as far as to quote

play02:03

the results from the survey to Showcase

play02:04

what skills you should learn but as

play02:06

great as the survey is for developers

play02:08

it's not so good for data nerds who

play02:10

comprise less than 15 percent of the

play02:12

respondents of this survey and because

play02:14

of this low percentage it's hard for

play02:15

data nerds to extract value of what

play02:18

skills they should be learning so what

play02:19

about that previous app that I built for

play02:21

my subscribers well the first major

play02:23

problem is that this data is only for

play02:25

data analysts in the United States and

play02:27

my subscribers aren't just that analysts

play02:29

and are from around the world so we need

play02:31

to collect data that supports this the

play02:33

second and actually bigger problem is

play02:35

that I keep on getting these emails

play02:36

saying that my app is crashing the app

play02:38

that I build is pretty poorly designed

play02:40

as a test I used an even larger data set

play02:43

with my current code and it took nearly

play02:44

an hour to generate a visualization and

play02:47

we're going to be processing a lot more

play02:48

data I mean you can tell from this

play02:50

clickbait title that previous app that I

play02:52

built was only handling around 7 000

play02:55

jobs at the time of filming this so we

play02:57

need a completely new solution so let's

play02:59

get into song building that first

play03:00

problem of collecting data Beyond just

play03:02

data analysts in the United States and

play03:04

we're going to be still following a

play03:06

similar approach that I did before which

play03:07

is using python to connect to an API in

play03:11

order to extract this data into a

play03:13

database and specifically using a

play03:15

service called serp API to handle this

play03:16

typically if you're trying to scrape

play03:18

this data from the website they don't

play03:20

like this they're going to use methods

play03:22

to block you such as those captions that

play03:24

even humans can't solve sometimes

play03:29

so serp API handles all of this and gets

play03:32

me the data that I need so I was really

play03:34

surprised when I reached out to Surf API

play03:36

and asked for a few hundred thousand

play03:37

search credits that they agreed so

play03:40

thanks to serpent bi for supporting this

play03:42

okay you're looking like a hot mess one

play03:44

quick note this video is not sponsored

play03:47

but I did want to be transparent about

play03:48

serp API providing those free credits my

play03:52

cloud bills are getting pretty high

play03:54

right now so I am open to sponsors so

play03:56

Google Cloud hit me up all right back to

play03:58

Luke but now this project is getting

play04:01

more complex we're not only needing to

play04:03

search different job titles we also need

play04:05

to search different locations with this

play04:07

added complexity we need a more robust

play04:09

solution and frankly this is a job for a

play04:12

data engineer somebody that specializes

play04:14

in moving data from point A to point B

play04:16

what is up oh

play04:19

yeah my headphones so this is bad he's

play04:22

not only a former data engineer at meta

play04:24

but he also runs the YouTube channel

play04:25

over at Seattle data guy so I started

play04:27

with showing him some python code that I

play04:29

written for my previous project to get

play04:30

us feedback and well so I'm putting all

play04:32

these exclamation points

play04:34

um

play04:37

well the logs are so we went over a plan

play04:41

to extract the data so I would build on

play04:43

that initial plan I would continue to

play04:44

use Python to call Surf API hello get

play04:47

this data into our database for this I'm

play04:49

using a popular cloud-based solution of

play04:52

bigquery from Google now these results

play04:53

from survey bi come in a Json file which

play04:56

isn't very usable so then I could use

play04:58

SQL to unpack all of this data once we

play05:01

have this cleaned up in our data

play05:02

warehouse we can then use our app to

play05:04

connect to this now to make sure we were

play05:06

collecting job data daily and also

play05:08

cleaning it up I'd use a popular data

play05:10

pipeline scheduler called airflow this

play05:12

tool allows you to write jobs in Python

play05:14

to run any number of tasks while also

play05:16

controlling the order I think this is

play05:18

good I think the big thing with a lot of

play05:19

these things is always simplify like

play05:22

don't try to kill yourself

play05:25

which is exactly what I didn't do and

play05:27

this project took way longer than my

play05:29

normal video but let's focus on the

play05:30

positives so I started with the data

play05:32

collection first building out a python

play05:34

script and airflow to collect not only

play05:35

all those job titles via search term but

play05:38

also all those different search

play05:39

locations for collecting jobs from

play05:41

around the world I decided to dig into

play05:43

my YouTube channel analytics to find the

play05:45

countries my subscribers come from which

play05:47

is a lot of dang countries so after the

play05:49

collection pipeline was built it was now

play05:51

time to dive into cleaning the data for

play05:54

this portion I just focused on

play05:55

extracting all that data from those Json

play05:57

files into our main data table or as

play06:00

data nerds call it a fact table that has

play06:02

a ton of key information in each column

play06:05

for all of those different job postings

play06:07

all right so we're not completely done

play06:09

with the data cleanup just yet but we

play06:12

need to perform some Eda first to see

play06:14

what we're working with let's see how

play06:15

big this table is as we're collecting

play06:17

around 6 000 jobs per day querying it we

play06:20

can see we have around 380 000 jobs

play06:22

which may be different from what you see

play06:24

down here in the the time future Luke is

play06:26

supposed to be automating this title

play06:27

update so that way it matches the number

play06:29

of jobs in our database next let's look

play06:31

at where all these job postings are

play06:33

coming from and it looks like an

play06:34

overwhelming majority are from LinkedIn

play06:36

which also checks with what the majority

play06:38

of my subscribers say they use so let's

play06:40

do some comparisons of these postings

play06:41

I'm curious to find out what is the most

play06:43

in demand job right now now back in 2012

play06:46

Harvard Business Review released this

play06:48

article on data scientists claiming it

play06:50

was the sexiest job of the 21st century

play06:52

and talked about the need for this

play06:53

profession as this field began to

play06:55

explode and so I'm curious is data

play06:57

scientist still the most popular job

play06:59

well not anymore it looks like data

play07:01

Engineers have an overwhelming majority

play07:03

with many companies requesting this but

play07:05

I mean this makes sense because look at

play07:07

this data project that I'm doing right

play07:08

now it required data engineering to get

play07:11

the data I needed to perform this

play07:12

analysis oh and it looks like those

play07:14

authors released an updated article last

play07:16

year where they captured this about data

play07:18

Engineers but sadly they left out data

play07:20

analysts let's look at another hot topic

play07:23

for my data nerds and that's whether job

play07:25

postings have any mention of a degree in

play07:28

it this topic has really intensified by

play07:30

Google's release of the data analytics

play07:31

certificate that claimed they would

play07:33

accept this certificate Over a four-year

play07:35

degree The Wall Street Journal

play07:36

investigated the need for degrees and

play07:38

found that the pandemic has helped blue

play07:40

collar workers transition more easily

play07:42

into tech-based careers without a degree

play07:44

with I.T and data processing being one

play07:46

of the most common Industries for career

play07:48

transitioners lucky for us the data set

play07:50

includes information on whether a degree

play07:52

is mentioned in the job posting and

play07:54

looking at it right here we can see that

play07:56

for every one in three job postings for

play07:58

data engineers and analysts they have no

play08:00

mention of a degree which I think is

play08:02

pretty high number unfortunately for

play08:04

data scientists this was only found in a

play08:06

severely low seven percent all right

play08:08

speed round here's some other

play08:09

interesting insights less than 10

play08:10

percent of job postings are flagged for

play08:12

remote work with data analysts at the

play08:14

lowest around six percent for job

play08:16

locations we have an assortment from

play08:18

around the world with anywhere being the

play08:19

highest which correlates to all those

play08:21

remote work jobs that we previously

play08:22

found finally for the different type of

play08:24

jobs off effort it seems like it's not

play08:26

even close as most all opportunities in

play08:28

this industry are focused on full-time

play08:30

jobs which is really disheartening

play08:31

because things like internships and

play08:33

contract opportunities are really great

play08:35

at getting those with low experience

play08:37

experience so what about salary and

play08:39

skills for data nerds you know the whole

play08:41

point that you're watching this video

play08:42

well inspecting the seller we can see

play08:43

that it comes in this crazy format

play08:45

sometimes it's yearly other times it's

play08:47

hourly sometimes it's a range sometimes

play08:49

it's not for the skills they're very

play08:51

deep in the job descriptions so we need

play08:53

to develop a way to extract these

play08:54

keywords out both of these issues to fix

play08:56

will be classified under NLP or natural

play08:59

language processing

play09:01

wrong way natural language processing I

play09:03

never know which way to go simply put

play09:05

it's a way for computers to process

play09:06

human language and come a long way with

play09:08

NLP as witnessed by Chad gbt so what

play09:10

tools should we use for this processing

play09:12

well SQL is a little too structured for

play09:15

what we need and from that python

play09:17

example that I showed earlier it took

play09:19

nearly an hour to generate a

play09:20

visualization on a single computer

play09:23

but that's a single computer what else

play09:24

if we used multiple computers it turns

play09:27

out there's a tool specifically designed

play09:28

for this called Apache spark now when a

play09:31

pack of wild computers band together

play09:33

they form a spark cluster look how cute

play09:35

this little guy is and these guys are

play09:37

great at fighting against big data now

play09:39

how and where do you even run these

play09:41

spark clusters well most all Cloud

play09:43

providers offer this and they're more

play09:44

than happy then to take your money to

play09:46

offer this service now the framework for

play09:48

Apache spark is written in Scott but

play09:50

because it's such a popular tool they

play09:52

offer apis to interact with this

play09:54

framework including python which is

play09:56

conveniently named Pi so I've never used

play09:58

Pi symphonic before but it was pretty

play09:59

easy to figure out as it's very similar

play10:01

to pandas for this I have airflow spin

play10:03

up a spark cluster daily and then import

play10:05

all those new job entries in from there

play10:07

I focus on the salary column first

play10:09

extracting information like the minimum

play10:11

maximum and average I even plotted this

play10:13

chart to check my results

play10:14

next up is the hard part of extracting

play10:17

the skills I found that the best way to

play10:19

handle this was to provide the cluster

play10:20

with a list of skills to find and then

play10:23

distract from the job description to get

play10:25

this list was pretty painful I went

play10:26

through and did a count of all the

play10:27

popular words in the job descriptions

play10:29

and hand-picked out all these tools

play10:32

after I went through a few hundred

play10:33

common words specific to data science I

play10:36

then added to this list all those skills

play10:38

from the stack Overflow survey in total

play10:39

we have a total of 250 words specific to

play10:43

data nerds I'm like 99 sure we got all

play10:46

the keywords that we need but in the

play10:48

future we're probably going to have to

play10:49

develop some sort of machine learning

play10:51

algorithm to extract these keywords in a

play10:54

better method so now we're finally done

play10:55

with the cleaning we use a spark cluster

play10:57

to not only generate all that clean

play10:59

salary data but also to extract all

play11:01

those skills for each job posting all

play11:03

right so now let's jump into exploring

play11:04

those newly cleaned salary and skill

play11:06

columns first let's look at the average

play11:08

salary by the different job search terms

play11:10

now I have a bad feeling about using an

play11:12

average for this you see recently there

play11:14

have been laws imposed in States like

play11:16

New York and California that require

play11:18

salaries be listed on job postings

play11:20

there's just one big problem companies

play11:22

are listing very wide ranges for

play11:24

salaries in order to satisfy this

play11:27

requirement so let's inspect some of

play11:28

these ranges and it looks like data

play11:30

scientists are the biggest offender at

play11:31

350 000 for a range and not only are

play11:35

these job postings coming from States

play11:36

like New York and California but they're

play11:38

also from popular tech companies and

play11:40

surprisingly this 350 000 range isn't

play11:42

really that bad on comprehensive.io a

play11:45

salary tracking website they found a

play11:47

data engineering role with a 750 000

play11:50

variance at Netflix along with this

play11:52

crazy range for a data analyst now

play11:54

because of these large ranges in the

play11:56

data set averages may not be the best

play11:58

for this instead we're going to be using

play11:59

mediums

play12:01

not that median this type of median

play12:03

which evaluating on this salaries aren't

play12:06

skewed as high due to those large ranges

play12:08

and also they check better with salary

play12:09

aggregation sites like Glassdoor now I'm

play12:12

also curious just for fun what is the

play12:14

highest and lowest salaries based on

play12:16

these large ranges of these companies

play12:17

for the highest a data analyst takes the

play12:19

win at 800 000 for the lowest a data

play12:23

analyst also takes the win with this at

play12:25

25

play12:25

000. I even found this internship at

play12:27

eight dollars an hour so sort of crazy

play12:29

that that analysts have both the highest

play12:31

and lowest salaries all right let's now

play12:33

get into exploring those skills and for

play12:35

this we're going to be using the app

play12:36

that I built with the same python

play12:38

framework of streamlip that I used on my

play12:40

past app but instead of using python on

play12:42

a single computer to aggregate all this

play12:44

data on the front end we're now using

play12:45

the power of multiple computers to use

play12:47

SQL and Pi spark to aggregate this data

play12:49

on the back end basically that long

play12:51

extravagant storyline of how we've

play12:53

cleaned up the data was done in order to

play12:54

provide this in an easily accessible

play12:56

manner via this app which you can access

play12:58

via data nerd.tech you can access a

play13:01

video your phone or even a web browser

play13:02

so how should this be used well let's

play13:04

say you're an aspiring data nerd and

play13:06

curious about the skills you should be

play13:07

focused on first to land a job with the

play13:10

app you can get real-time insights into

play13:12

what the top skills are being requested

play13:13

in job postings today but this is for

play13:15

all data nerds we can actually filter

play13:17

down further based on a job title let's

play13:19

look at data Engineers first for them we

play13:21

can see SQL and python are most

play13:22

important along with Cloud Technologies

play13:25

looking at data scientists next we can

play13:27

see that Python and SQL are still

play13:28

important but so are other tools such as

play13:30

R and Tableau oh yeah about this

play13:32

percentage so out of 22 000 job postings

play13:35

for data scientists 17 000 list the

play13:38

skill of python or basically three out

play13:40

of four data scientist jobs are

play13:42

requesting this finally with data

play13:43

analysts we can see that clearly SQL is

play13:45

the most important followed by

play13:46

spreadsheets programming languages and

play13:48

Vis tools we can even filter this

play13:50

further to see how things like languages

play13:52

or even Cloud Technologies compare now

play13:54

remember we do have all of that salary

play13:56

data and because it's linked to these

play13:57

skills we can then go through and

play13:59

actually find out how much you get paid

play14:01

based on a certain skill so let's

play14:04

actually find out what would I get paid

play14:06

for the skills that I used in this

play14:07

project so let's first filter down by my

play14:09

job title of data analysts we can look

play14:11

at languages which we use SQL and python

play14:13

paying around ninety thousand dollars

play14:14

for cloud Technologies we use Google

play14:16

cloud and specifically bigquery which is

play14:18

around a hundred and eleven thousand

play14:20

dollars for libraries we use both

play14:21

airflow and Spark which have some of the

play14:23

highest and also lowest salaries in this

play14:25

one so based on these skills that I know

play14:26

and used I feel I can get a better

play14:28

representation of what my potential

play14:29

salary could be now I obviously didn't

play14:31

leave out a salary comparator between

play14:33

all the different job titles so that's

play14:35

available as well and you can see both

play14:36

the annual and hourly rate

play14:39

all right so this is all pretty crazy

play14:41

we're able to now have a website that

play14:44

data nerds can go to in order to find

play14:46

out what are the top skills they need to

play14:48

learn for their jobs based on actual

play14:51

data from job descriptions on what

play14:53

skills are being requested I'm looking

play14:55

forward to continue to collect this data

play14:57

over the year and then compare this and

play14:59

see how skills and also salary track

play15:02

over time I'm always open to feedback

play15:03

and improvements for this website so if

play15:05

you have any ideas to make this website

play15:07

more accessible to others please feel

play15:09

free to drop them in the comments below

play15:11

alright as always if you got value out

play15:14

of this video smash that like button

play15:16

with that I'll see you in the next one

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Data ScienceJob AnalysisSkillsSalariesData AnalystData EngineerSQLPythonCloud TechNLPSpark