You NEED to AVOID these Mistakes as a Data Analyst | raw truth

Rohan Adus
1 Sept 202408:58

Summary

TLDRIn this video, Rohan discusses three common mistakes made by beginner data analysts and offers advice on how to avoid them. He emphasizes the importance of avoiding sampling bias by ensuring representative data sets, maintaining high data quality to prevent errors in analysis, and cautions against confusing correlation with causation. Rohan also stresses the need for critical thinking and proper data validation to become a high-quality data analyst.

Takeaways

  • 😀 Most data analysts struggle not due to coding skills but because they neglect the analysis itself.
  • 📊 Statistics is crucial and forms the backbone of data analysis and data science.
  • 🚫 Avoiding sampling bias is essential; ensure the sample size is representative of the entire population.
  • 🏥 Examples of sampling bias include political polling in urban areas or medical studies focusing only on hospital visitors.
  • 💡 Proper sampling methods are critical for accuracy, reliability, and integrity in studies, and are more cost-effective.
  • 📈 Poor data quality can severely impact analysis and subsequent business decisions.
  • 🔍 Data analysts should validate data from multiple sources to ensure accuracy before conducting analysis.
  • 🧼 Data cleaning is vital, but be cautious not to introduce bias through incorrect handling of missing values.
  • 🔗 Understanding the difference between correlation and causation is fundamental; correlation does not imply causation.
  • 🌡 Common examples used to illustrate this include ice cream sales and shark attacks, or chocolate consumption and Nobel Prizes.
  • 🔍 As a data analyst, it's important to critically think and investigate potential confounding variables that may affect data interpretation.

Q & A

  • What is the main reason most data analysts are not great at their job according to the speaker?

    -The speaker suggests that most data analysts are not great at their job not because of their coding skills with SQL, Python, R, or BI tools, but because they neglect the actual analysis, particularly the importance of statistics in data analysis.

  • What is sampling bias and how does it affect data analysis?

    -Sampling bias occurs when a small sample size taken for analysis is not representative of the entire population. This can lead to inaccurate conclusions because the data collected may not reflect the broader population's characteristics, thus affecting the reliability of the analysis.

  • Why is it important to use proper sampling methods in data analysis?

    -Proper sampling methods ensure accuracy, reliability, and integrity in a study. They are also more cost-effective than analyzing an entire population, making it feasible to conduct large-scale analyses without incurring excessive costs.

  • What is the second biggest mistake data analysts make according to the video?

    -The second biggest mistake data analysts make is using poor data quality. This can be due to issues like data pipeline errors, inaccurate data entry, or unreliable data sources, which can significantly impact the accuracy of analysis and subsequent business decisions.

  • How can data analysts validate the data they are using for analysis?

    -Data analysts can validate the data by comparing it with a second or third source, or even third-party data, to ensure consistency and accuracy. This due diligence helps in identifying discrepancies and ensuring the data used is reliable before conducting analysis.

  • Why is data cleaning important in the data analysis process?

    -Data cleaning is crucial to ensure data quality. It involves handling missing values, correcting errors, and removing inconsistencies. However, improper data cleaning can introduce bias or incorrect values, so it should be done carefully with proper automation and testing.

  • What is the difference between correlation and causation as explained in the video?

    -Correlation refers to the relationship between two or more variables, while causation implies that one variable causes the other to change. The video emphasizes that just because two variables are correlated, it does not mean one causes the other, which is a common misconception that data analysts must avoid.

  • What is a confounding variable and how does it relate to correlation?

    -A confounding variable is an external factor that influences two different variables, making them appear correlated when they are not. The video uses the example of hot weather leading to both increased ice cream sales and shark attacks, where the weather is the confounding variable.

  • Why is it important for data analysts to consider sample size when analyzing correlations?

    -Smaller sample sizes can lead to higher correlations that may not accurately represent the true population. Data analysts should ensure their sample size is large enough to provide meaningful and representative correlations in their analysis.

  • What is the speaker's stance on the future demand for data analysts and data scientists?

    -The speaker believes that data analysts and data scientists will continue to be in high demand, as data collection and analysis are still in their early stages and critical for businesses to make informed decisions, especially with advancements in AI.

Outlines

00:00

📊 Common Mistakes in Data Analysis

The paragraph discusses the prevalent issue of data analysts not being adept at their jobs, not due to a lack of coding skills, but because they often overlook the importance of proper analysis. The speaker emphasizes the significance of statistics in data analysis and data science. The video aims to educate viewers on three common mistakes that novice data analysts make and how to avoid them. The speaker, Rohan, introduces himself as the head of an educational company focused on data analytics and shares his experience in freelancing for e-commerce and tech startups. He invites viewers to join a community on Discord for like-minded individuals interested in data analytics and data science. The first mistake highlighted is sampling bias, which occurs when the sample size taken for analysis does not represent the entire population. Examples are given to illustrate how sampling bias can skew results, such as political polling in urban areas or medical studies focusing only on hospital visitors. The importance of using proper sampling methods for accuracy, reliability, and integrity in studies is stressed, along with the cost-effectiveness of sampling over analyzing entire populations.

05:01

🔍 Ensuring Data Quality and Avoiding Correlation-Causation Errors

The second paragraph delves into the second major mistake made by data analysts: using poor data quality. The speaker points out that while this may not always be the analyst's fault, the responsibility lies with them to ensure the data's reliability and accuracy. An example is given where a data pipeline error led to incorrect analysis and subsequent misguided business decisions. The speaker argues for the necessity of data analysts in the modern business landscape, especially with advancements in AI. The paragraph also touches on the importance of validating data from multiple sources and the common issue of data entry errors. The speaker advises on the need for proper data cleaning processes to avoid introducing bias and ensuring data quality. The final mistake discussed is the misconception that correlation implies causation, a critical error that can lead to flawed interpretations of data. The speaker uses the examples of ice cream sales and shark attacks, and chocolate consumption and Nobel Prizes, to illustrate how correlation does not equate to causation and the importance of identifying confounding variables. The paragraph concludes with a call for critical thinking and thorough research to avoid jumping to conclusions based on correlation alone.

Mindmap

Keywords

💡Data Analyst

A data analyst is a professional who collects, processes, and interprets complex digital data to help businesses make informed decisions. In the video, the role of a data analyst is emphasized as crucial for understanding and applying statistical methods to ensure the accuracy and reliability of data analysis. The video discusses common mistakes made by data analysts, such as overlooking sampling bias and poor data quality, which highlights the importance of this role in maintaining the integrity of data-driven decision-making.

💡Sampling Bias

Sampling bias occurs when a selected sample of a population is not representative of the entire population, leading to inaccurate or skewed results. The video uses the example of polling in urban areas to illustrate how sampling bias can occur, as these areas may lean politically different from rural areas, thus not accurately reflecting the views of the entire country. This concept is central to the video's message about the importance of proper sampling methods in data analysis.

💡Data Quality

Data quality refers to the accuracy, reliability, and consistency of data. The video stresses that poor data quality can significantly impact the outcomes of data analysis, leading to incorrect business decisions. An example given is a data pipeline error that misrepresents sales data, which could lead to unnecessary marketing or R&D spending. Ensuring high data quality is presented as a fundamental responsibility of data analysts.

💡Data Pipeline

A data pipeline is the process through which data is collected, processed, and stored for further analysis. The video mentions a scenario where a data pipeline error could lead to incorrect data logging, which in turn affects sales analysis. Understanding and validating data pipelines is highlighted as a critical aspect of a data analyst's job to ensure the data used for analysis is accurate.

💡Data Entry

Data entry is the process of manually entering data into a system. The video points out that data entry errors are common, especially when the task is tedious, and these errors can lead to poor data quality. As a data analyst, it's important to validate data from different sources to account for potential data entry mistakes, which is a key theme in the video's discussion on maintaining data integrity.

💡Data Cleaning

Data cleaning is the process of removing inconsistencies, correcting errors, and dealing with missing or incomplete data. The video warns against improper data cleaning, such as incorrectly imputing missing values, which can introduce bias into the data. Proper data cleaning techniques are essential for a data analyst to ensure the quality of the data used for analysis.

💡Correlation

Correlation is a statistical measure that expresses the extent to which two variables are linearly related. The video emphasizes the common mistake of equating correlation with causation, which is a critical misunderstanding in data analysis. The video uses the example of ice cream sales and shark attacks to illustrate how two correlated variables do not necessarily have a causal relationship.

💡Causation

Causation refers to a cause-and-effect relationship between events or variables. The video discusses how data analysts must differentiate between correlation and causation to avoid misleading conclusions. It uses the example of chocolate consumption and Nobel Prizes to show that correlation does not imply causation and that other variables, such as wealth and education, may be the true drivers.

💡Confounding Variable

A confounding variable is one that influences two other variables, making them appear correlated when they are not. The video explains the concept using the example of hot weather influencing both ice cream sales and shark attacks, where the actual correlation is due to the confounding variable (hot weather) rather than a direct relationship between the two events.

💡Critical Thinking

Critical thinking is the objective analysis and evaluation of an issue to form a judgment. The video stresses the importance of critical thinking for data analysts to avoid common pitfalls such as mistaking correlation for causation. It suggests that data analysts should always investigate further and consider all possible variables before drawing conclusions from data.

💡Sample Size

Sample size refers to the number of observations or individuals in a sample. The video notes that smaller sample sizes can lead to higher correlations that may not reflect the true population, which can result in misleading analysis. It emphasizes the importance of having a sufficiently large and representative sample size for accurate data analysis.

Highlights

Data analysts often neglect the analysis itself, not just their technical skills.

Statistics are crucial as the backbone of data analysis and data science.

Three common mistakes by beginner data analysts are discussed.

Sampling bias occurs when a sample is not representative of the entire population.

Urban areas may skew political polling data due to their left-leaning nature.

Medical studies can be flawed if only hospital visitors are sampled.

Proper sampling methods ensure accuracy, reliability, and cost-effectiveness.

Data quality is paramount; poor data can lead to incorrect analysis and decisions.

Data pipeline errors can mislead analysis, such as logging data from the wrong quarter.

Data analysts should validate data from multiple sources before conducting analysis.

Data entry errors are common and can significantly affect data accuracy.

Data cleaning should be done carefully to avoid introducing bias.

Correlation does not imply causation, a common mistake in data analysis.

Ice cream sales and shark attacks example illustrates the difference between correlation and causation.

Chocolate consumption and Nobel Prizes example shows the impact of a confounding variable.

Critical thinking is necessary to identify hidden variables influencing data.

Sample size is crucial; smaller sizes can lead to misleading correlations.

Data analysts play a vital role in the data-driven business landscape.

The demand for data analysts is expected to grow with advancements in AI.

Transcripts

play00:00

this is going to sound a bit harsh but

play00:01

most data analysts quite frankly aren't

play00:03

that great at their job and it's not

play00:06

because they aren't good at coding with

play00:07

SQL python r or even any bi tool it's

play00:11

simply because they just neglect the

play00:12

analysis if you've been on this channel

play00:14

for a while now you know how much I

play00:15

value statistics and I truly believe

play00:18

statistics is the backbone of data

play00:19

analysis and data science so in this

play00:21

video we're going to be going over three

play00:22

of the most common mistakes beginner

play00:24

data analyst make and how you can avoid

play00:26

them so make sure to actually save this

play00:27

video take some notes and then refer

play00:29

back to the later because I promise you

play00:31

these are things that will not be taught

play00:32

to you on the job and you just learn

play00:34

with years of experience if you're new

play00:35

here my name is Rohan and I run an

play00:37

education company teaching people data

play00:38

analytics and also do data analyst

play00:40

freelancing for e-commerce and Tech

play00:42

startups if you're interested in joining

play00:43

the like-minded community all into data

play00:45

analytics and data science go ahead and

play00:46

join the Discord down below we have

play00:48

almost 5,000 people in it now the first

play00:50

mistake I see people make is sampling

play00:52

bias sampling bias basically occurs when

play00:54

a small sample size that you're taking

play00:56

for your analysis isn't representative

play00:58

of the entire population so let's say

play00:59

you're taking a polling analysis and you

play01:01

want to actually figure out the

play01:02

political views of a certain region or

play01:04

the country and let's say you were going

play01:06

to go to a bunch of urban areas because

play01:08

you know friends in Chicago San

play01:10

Francisco New York and you ask all your

play01:12

friends to take polls of a ton of data

play01:14

with their friends and their family

play01:16

members in these urban areas the thing

play01:18

is most of these Urban environments are

play01:20

pretty left leaning rather than right

play01:22

leaning so your poll and your data may

play01:25

not be as representative of the entire

play01:27

country where you might be leaving more

play01:28

rural areas out of the analysis so these

play01:31

biased collections of data even though

play01:33

the analysis may be correct may just not

play01:35

be represented of the entire population

play01:37

and this is what we called sampling bias

play01:39

another example of this is a health

play01:40

study let's say you conduct a medical

play01:42

study for a new drug that you're

play01:44

releasing and you're only taking a small

play01:45

sample size from just people who visit

play01:47

hospitals it might be the easiest for

play01:49

you to collect data for people who

play01:50

already visit hospitals versus people

play01:51

who don't visit hospitals but your study

play01:53

would be overlooking one thing the

play01:55

people that may visit hospitals more may

play01:57

tend to be older and more vulnerable

play01:59

people to diseases so it may not be that

play02:02

accurate for the entire population so it

play02:04

is very important to use proper sampling

play02:06

methods to make sure you have accuracy

play02:08

reliability and integrity in your study

play02:10

it's also much more cost effective to

play02:12

sample rather than take an entire

play02:13

population imagine asking all 300

play02:16

million people in the United States or

play02:17

the billions of people in the world for

play02:19

a poll it's just not feasible very

play02:22

expensive so it is much more cost

play02:24

effective to just take a sample that is

play02:26

representative of the entire population

play02:28

so I think while taking these studies

play02:29

and the analysis be cognizant of the

play02:31

data you're actually using while

play02:33

conducting any analysis for any of your

play02:34

stakeholders or customers now the second

play02:36

biggest mistake I see data analyst and

play02:37

data scientists make is using poor data

play02:39

quality and I know this may not be your

play02:42

fault as a data scientist or data

play02:43

analyst it may be a data engineering's

play02:45

fault or maybe a software engineer's

play02:46

fault who's collecting the data itself

play02:48

but the fact of the matter is if your

play02:50

data isn't reliable or your data isn't

play02:52

accurate by any means this will

play02:53

completely mess up your analysis and

play02:55

completely mess up the decisions that

play02:57

you're recommending to your customers so

play02:59

let's say you're working for company

play03:00

that has a huge the problem is you don't

play03:02

understand that there's actually a data

play03:04

pipeline error in Q4 2024 rather than Q4

play03:07

2023 so these maybe week or two inside

play03:11

of the quarter would lead to a decrease

play03:13

in sales for the entire quarter just

play03:14

because you didn't log the data so

play03:16

because of this you think that the sales

play03:18

are actually going down and you actually

play03:19

spend more on marketing even though you

play03:21

probably didn't need to you spend more

play03:22

on R&D even though you probably didn't

play03:24

need to and this can hurt the business

play03:26

in the long run it is always just so

play03:28

funny to me when people think data

play03:30

analyst or data scientists aren't needed

play03:31

at companies but the reality is if you

play03:34

want to survive in business in this like

play03:36

honestly Century you need to be using

play03:38

data to some extent gone are the days

play03:40

where you can just use intuition and the

play03:42

next step is making sure you have

play03:43

accurate and reliable data sources for

play03:45

this and I'll be honest like I think

play03:47

we've only reached the tip of the

play03:49

iceberg when it comes to data collection

play03:50

and data analysis so when people ask me

play03:53

are data analysts still going to be in

play03:54

demand in 10 years you do the math we

play03:56

haven't even touched the tip of the

play03:58

iceberg when it comes to data collection

play03:59

how are we going to actually measure and

play04:01

make decisions based on this data if we

play04:03

don't have data analyst and data

play04:04

scientists especially with all these AI

play04:06

advancements another thing I want you to

play04:07

watch out for in data accuracy is data

play04:09

entry there are a ton of people who are

play04:11

doing data entry currently and it is not

play04:13

uncommon for a human to make a mistake

play04:14

when it comes to data entry often times

play04:16

this a very like tedious task and you

play04:19

just want to want to get done with it

play04:20

and there's a lot of errors that come

play04:22

into this so what I recommend you to do

play04:24

as a data analyst in my career I've had

play04:26

jobs where I've taken two data sources

play04:29

that should have yielded the same number

play04:31

because they're coming from the same

play04:32

sales data but the problem is they were

play04:35

different by 5 to 10% and this is maybe

play04:38

because data entry was incorrect or the

play04:39

data pipelines weren't working so my

play04:41

point here is your job as a data

play04:43

analysis you may not know if a

play04:44

pipeline's broken you may not know if

play04:46

the data entry was correct but your job

play04:48

is to figure out and validate the data

play04:50

with a Second Source or even a third

play04:52

Source or even a third party data and

play04:54

just validate before conducting analysis

play04:57

this is so important and I recommend

play04:58

this for every single analysis you do

play05:00

you do the proper due diligence and make

play05:02

sure you are validating Data before what

play05:04

I've done at a lot of companies is I

play05:06

actually have status reports every

play05:07

morning I'd get a status report on the

play05:09

data that was imported from our data

play05:11

pipelines coming from third party or

play05:12

even first party data so I'd always just

play05:15

validate with two different data sources

play05:16

make sure it's within like a 5% margin

play05:18

and if it is normally it's good to go

play05:20

but if it isn't I always bring it up

play05:21

before doing any work for that day

play05:24

because the data analysis of poor data

play05:26

is incorrect next thing I want to

play05:27

mention is data cleaning you people data

play05:29

clean to make sure data quality is good

play05:31

but I've seen a lot of people do this

play05:33

mistake of actually like inputting in

play05:36

data and missing values whether it be an

play05:38

average or something but sometimes it's

play05:40

just Incorrect and then you can also

play05:41

have bias where those columns that

play05:44

aren't filled up with data may not be

play05:46

filled up for a reason and that could

play05:48

also cause a bias in the study so I want

play05:51

you to be very careful while doing data

play05:52

cleaning make sure you have the proper

play05:54

automation the proper testing in place

play05:56

before conducting an analysis okay so

play05:57

the most common mistake I se you dat

play05:59

analyst make is they think correlation

play06:01

is causation and this is probably

play06:03

drilled into you in your stats classes

play06:04

or high school math classes or college

play06:06

classes but it is more important now

play06:09

than ever before CU you are a data

play06:10

analyst you are a data scientist and

play06:12

quite frankly it's your job to make

play06:14

sense of data and provide decisions for

play06:16

those of you who don't know correlation

play06:17

just means the relationship between two

play06:19

or more variables with each other

play06:21

however just because two variables are

play06:22

correlated does not mean they are

play06:24

causing each other to go up or down so

play06:26

the most popular example I like to give

play06:28

is the ice cream example during the

play06:30

summertime with hotter weather ice cream

play06:32

sails go up and shark attacks also go up

play06:36

but is it safe to say just because Cold

play06:38

Stone decides to sell more ice cream in

play06:41

the summer this is causing shark attacks

play06:43

no that's ridiculous right it's because

play06:46

it's hot outside which is called the

play06:47

confounding variable a confounding

play06:50

variable is a variable that influences

play06:51

two different variables that then appear

play06:53

to be correlated even though they aren't

play06:56

causation so because it's hot outside in

play06:59

the summer more people are likely to go

play07:01

to the beach to cool off and then more

play07:03

people are also going to go buy ice

play07:04

cream because it's hot outside to cool

play07:06

down so even though two variables have

play07:08

gone up does not mean that they cause

play07:10

each other the next example I like to

play07:11

bring up with this is countries with

play07:13

higher chocolate consumption tend to

play07:15

have more Noble Prize winners and you

play07:16

may be wondering hey that makes sense if

play07:18

you chocolate I can get smarter and I

play07:19

can do more noble PRI and I can achieve

play07:21

more noble prizes but the reality is

play07:24

think about it where is chocolate more

play07:26

predominant in terms of consumption

play07:28

wealthier countries wealthier countries

play07:30

tend to also have better education

play07:32

systems which also leads to more people

play07:34

being able to do research and more

play07:36

research equals more noble prizes being

play07:38

given so it's not necessary that

play07:40

chocolate is causing people to win all

play07:42

these Nobel prizes and become a very

play07:43

prestigious Nation it's because the

play07:45

wealth in the country has both caused

play07:47

people to buy more chocolate consume

play07:49

more chocolate and also achieve winning

play07:51

a Nobel Peace Prize so always remember

play07:54

in every single job I have there may be

play07:56

another variable influencing the two

play07:58

variables you are comparing

play08:00

and often times it's hidden in it's your

play08:02

job as a data analyst and data scientist

play08:03

to find that variable and bring it to

play08:05

light in your presentation and in your

play08:07

analysis so I mean there's really no

play08:09

clear solution to this it just comes

play08:10

down to critical thinking even though if

play08:11

you see a very strong correlation of

play08:13

like a 6 or 7 I wouldn't necessarily

play08:16

jump and think a it's causation always

play08:18

do your research and always find out and

play08:20

investigate further this is why it takes

play08:22

time and you shouldn't rush through

play08:23

analysis my last note on correlation is

play08:25

make sure your sample size is big enough

play08:27

typically smaller sample sizes lead to

play08:29

higher correlations even though they may

play08:30

not be indicative of a true population

play08:33

so make sure you have a measure of what

play08:35

sample size you're using what problem

play08:36

you're trying to solve and really piece

play08:38

together all these things together and

play08:40

that's what forms a highquality data

play08:41

analyst so these were the three most

play08:43

common mistakes that I myself have seen

play08:45

other data analyst make in the beginning

play08:47

myself included so if you got any value

play08:49

to this video please leave a like

play08:50

subscribe and I'll see you the next one

play08:52

[Music]

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
Data AnalysisSampling BiasData QualityCorrelation vs CausationStatistical MistakesData ScienceData AccuracyData CleaningSampling MethodsData Validation
¿Necesitas un resumen en inglés?