Statistics made easy ! ! ! Learn about the t-test, the chi square test, the p value and more

Global Health with Greg Martin
10 Jun 201912:50

Summary

TLDRThis script offers a simplified approach to learning statistics by focusing on practical thinking rather than complex formulas. It introduces common statistical questions and explains how to analyze sample data to identify differences between groups and relationships between variables. The video covers summarizing and visualizing data, selecting appropriate statistical tests, and interpreting results. It also discusses the importance of defining hypotheses and choosing an alpha value before analyzing data. Examples include t-tests, chi-square tests, and correlation tests, emphasizing the significance of statistical findings in understanding population characteristics.

Takeaways

  • 📊 **Understanding Statistics**: The script emphasizes a simplified approach to learning statistics by focusing on thought processes rather than complex formulas and theories.
  • 🔍 **Analyzing Sample Data**: It discusses the common tasks in statistics, which include identifying differences between groups and relationships between variables within sample data.
  • 🤔 **Questioning Realness**: The script raises the question of whether observed differences and relationships in sample data are 'real' and how to define this term.
  • 📈 **Data Variables**: It explains the importance of understanding the two types of variables in datasets: categorical (like gender) and numeric (like height).
  • 📋 **Summarizing Data**: The script outlines methods for summarizing data, such as counting observations for categorical data and calculating median, mean, and standard deviation for numeric data.
  • 📊 **Visual Representation**: It describes how to visualize data using tables, bar charts, box plots, and histograms to better understand the distribution and central tendencies.
  • 🧐 **Combining Variables**: The script explores analyzing combinations of variables to uncover specific differences or relationships, such as comparing average heights between genders.
  • 📚 **Statistical Tests**: It introduces the concept of applying statistical tests to determine if sample observations can be generalized to the wider population.
  • 🔑 **Hypothesis and Null Hypothesis**: The script stresses the importance of defining a hypothesis and a null hypothesis before analyzing data, along with setting an alpha value to determine statistical significance.
  • 📝 **Research Questions**: It provides examples of how to form research questions based on the type of variables involved, such as comparing a single numeric variable to a theoretical value or examining the relationship between two numeric variables.
  • 🔗 **Sponsorship Acknowledgement**: The script includes a thank you note to Biomed Central (BMC) for sponsoring the video and briefly discusses the importance of open access journals.

Q & A

  • What is the main focus of the video script?

    -The main focus of the video script is to simplify the learning of statistics by introducing a way of thinking that enables addressing common statistical questions when analyzing sample data.

  • What are the two primary types of variables typically found in data sets?

    -The two primary types of variables typically found in data sets are categorical variables (like gender) and numeric variables (like height).

  • How does the script suggest summarizing categorical data?

    -The script suggests summarizing categorical data by counting the number of observations in each category and representing them in a table and on a bar chart.

  • What are the key summary measures for numeric data mentioned in the script?

    -The key summary measures for numeric data mentioned in the script include the range, interquartile range, standard deviation, median, and mean.

  • What visualization tools are suggested for numeric data?

    -The script suggests using box plots and histograms as visualization tools for numeric data.

  • What is the significance of the term 'real' in the context of the script?

    -In the context of the script, the term 'real' refers to whether the observed differences or relationships in sample data are statistically significant and can be inferred to represent the wider population.

  • What is the role of statistical tests in analyzing data according to the script?

    -Statistical tests play a role in determining if the observed differences or relationships in sample data are statistically significant and can be generalized to the wider population.

  • What is the significance of the alpha value in statistical analysis as discussed in the script?

    -The alpha value is significant in statistical analysis as it represents the probability threshold below which the null hypothesis is rejected, indicating that the observed difference is statistically significant.

  • What is the null hypothesis and how is it used in the script?

    -The null hypothesis is a statistical assumption that there is no effect or difference. In the script, it is used as a baseline to compare against observed data, and if the observed data is unlikely under the null hypothesis, it can be rejected.

  • How does the script explain the process of analyzing data with one categorical variable?

    -The script explains that with one categorical variable, such as gender, a one-sample proportion test can be conducted to determine if there is a statistically significant difference in proportions between groups.

  • What is the purpose of the chi-square test as mentioned in the script?

    -The purpose of the chi-square test, as mentioned in the script, is to determine if there is a statistically significant association between two categorical variables.

  • How does the script describe the process of analyzing two numeric variables?

    -The script describes the process of analyzing two numeric variables by using a correlation test to determine if there is a statistically significant relationship between the variables, as indicated by the correlation coefficient and the p-value.

Outlines

00:00

📊 Introduction to Statistical Thinking

The script begins by simplifying the approach to learning statistics, emphasizing a conceptual understanding over complex formulas. It discusses the examination of sample data to identify differences between groups and relationships between variables. The speaker introduces the concept of determining whether observed differences and relationships are 'real'. A hypothetical scenario involving the height and weight of people in Ireland is used to illustrate the process of analyzing a dataset with variables such as gender and age group. The script explains the importance of summarizing and visualizing data to make it more interpretable, including the use of tables, bar charts, box plots, and histograms. The goal is to transform raw data into meaningful insights.

05:01

🔍 Hypothesis Testing and Statistical Significance

This section delves into the process of hypothesis testing, starting with defining a research question and hypothesis. It underscores the importance of setting a null hypothesis and an alpha value before analyzing data. The speaker uses the example of gender distribution to explain how to apply a one-sample proportion test. The concept of statistical significance is introduced, discussing p-values and the decision to reject or fail to reject the null hypothesis based on the alpha value. The script also covers the chi-square test for comparing categorical variables and the t-test for numeric variables, providing a framework for determining if observed differences are statistically significant.

10:02

📈 Advanced Statistical Analysis Techniques

The final paragraph explores more complex statistical analyses involving multiple variables. It discusses how to analyze a single numeric variable against a theoretical value using a t-test and how to use ANOVA for comparing means across multiple categories. The script then introduces the concept of correlation between two numeric variables, explaining the use of the correlation coefficient to measure the strength and direction of a relationship. The speaker emphasizes the importance of statistical tests in determining if observed correlations are statistically significant. The section concludes with a brief mention of resources for further learning in statistical analysis and programming for statistical purposes.

Mindmap

Keywords

💡Statistics

Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. In the video, the presenter aims to simplify the learning of statistics by focusing on a way of thinking rather than complicated formulas, which is central to understanding the data presented.

💡Sample Data

Sample data refers to a subset of a larger population that is used to represent the whole for the purpose of research or analysis. The video emphasizes the examination of sample data to observe differences between groups and relationships between variables, which is a common practice in statistical analysis.

💡Categorical Variables

Categorical variables are variables that can take on one of a limited number of possible values, which are usually words or categories. In the script, gender is given as an example of a categorical variable, which is used to group data for analysis.

💡Numeric Variables

Numeric variables are variables that represent quantities or numerical values. Height is used as an example in the video, where numeric variables are analyzed for their distribution, range, and relationships with other variables.

💡Statistical Tests

Statistical tests are methods used to determine if a result from sample data is statistically significant. The video discusses various tests like t-tests and ANOVA, which are used to analyze the sample data and infer about the larger population.

💡Hypothesis

A hypothesis is a proposed explanation for a phenomenon, which can be tested through experimentation or further research. The video script mentions defining a hypothesis before analyzing data, which is a fundamental step in the scientific method.

💡Null Hypothesis

The null hypothesis is a default position that there is no relationship between the variables being studied, used especially in statistics. The video explains that one must consider the null hypothesis and calculate the probability of observing the sample results if it were true.

💡P-Value

The p-value is the probability of obtaining results at least as extreme as the ones observed, assuming that the null hypothesis is true. The video script discusses calculating the p-value to determine the significance of observed differences in the sample data.

💡Alpha Value

The alpha value is the level of statistical significance used to determine whether to reject the null hypothesis. In the video, an alpha value of 0.05 is used as the threshold to decide if the observed difference is statistically significant.

💡Correlation

Correlation measures the extent to which two variables are linearly related. The video discusses the use of a correlation test to determine if there is a relationship between two numeric variables, such as height and weight.

💡Correlation Coefficient

The correlation coefficient is a statistic that measures the strength and direction of a linear relationship between two variables. The video explains that it ranges from -1 to 1, with -1 indicating a perfect negative correlation, 1 a perfect positive correlation, and 0 no correlation.

Highlights

Learning statistics can be simplified by focusing on a way of thinking rather than complex formulas and theories.

Statistical analysis often involves looking at differences between groups and relationships between variables.

The key question in statistics is determining whether observed differences and relationships are real.

A simple dataset can reveal specific differences between groups and relationships between variables.

Statistical tests should be used to interpret results and determine if sample data implies anything about the wider population.

Data sets typically contain categorical and numeric variables, which are summarized and visualized for analysis.

Categorical variables are grouped into categories, while numeric variables are measured on a numerical scale.

Summarizing data involves counting observations for categorical data and describing distribution for numeric data.

Visual representations like bar charts, box plots, and histograms help in understanding data distribution and central tendencies.

Combining variables can reveal interesting insights, such as differences in average height between genders.

Statistical tests are chosen based on the type of variables involved, such as t-tests for numeric variables or chi-square tests for categorical variables.

The process of hypothesis testing involves defining a null hypothesis, selecting an alpha value, and calculating a p-value.

A low p-value (less than the alpha value) indicates that the observed difference is statistically significant.

The correlation coefficient measures the strength and direction of the relationship between two numeric variables.

A correlation coefficient of -1 to 1 indicates the degree of linear relationship, with -1 being perfectly negative and 1 being perfectly positive.

The video is sponsored by Biomed Central, a publisher of open access journals.

The speaker is the editor-in-chief of one of Biomed Central's journals, 'Globalization and Health'.

The speaker emphasizes the importance of defining research questions and hypotheses before analyzing data.

The video provides an overview of the five most important combinations of data types and the corresponding statistical tests.

The speaker offers additional resources for learning more about statistical analysis and programming languages for statistics.

Transcripts

play00:00

learning statistics does not need to be

play00:02

difficult

play00:03

now instead of bombarding you with a

play00:06

complicated formula and statistical

play00:08

theory I'm gonna walk you through a way

play00:10

of thinking and that's gonna enable you

play00:12

to address the most common statistical

play00:13

questions when we look at sample data

play00:16

for the most part we see two things we

play00:18

see differences between groups so men

play00:20

are taller than women and we see

play00:22

relationships between variables like

play00:24

taller people way more than shorter

play00:26

people committed and the big question is

play00:28

are those differences and are those

play00:30

associations or relationships real and

play00:33

I'm going to talk you through what it is

play00:34

that we mean by the term real over the

play00:36

next few minutes we're going to take a

play00:38

look at a very simple data set and we're

play00:40

gonna see how by looking at various

play00:42

combinations of variables and variable

play00:44

traps we can identify very specific

play00:47

differences between groups and very

play00:49

specific relationships between variables

play00:50

and I'm gonna walk you through when and

play00:53

how to use statistical tests and how to

play00:56

interpret your results now let's imagine

play00:59

that we have a research question and

play01:01

it's about the height and the weight of

play01:02

people living in Ireland of course we

play01:05

can't measure the height in the weight

play01:06

of the entire population so instead we

play01:09

take a random sample of the population

play01:11

and we measure the weight and the height

play01:13

of that sample and we clicked some

play01:15

additional information like gender and

play01:17

age group from each of the people in our

play01:19

sample and we arranged these data in a

play01:21

spreadsheet or data set with the various

play01:24

attributes in columns and these are

play01:26

called variables and these variables

play01:29

will be the object of our inquiry

play01:31

[Music]

play01:32

now most data sets that you work with

play01:35

will contain two types of variables

play01:37

categorical and numeric variables

play01:39

categorical variables like gender

play01:41

content categories as the name suggests

play01:43

think of them as groups or buckets that

play01:45

the data can be arranged into in this

play01:47

case males and females numeric variables

play01:50

like height on numbers as the name

play01:52

suggests and can be arranged on a number

play01:54

line now to better understand our data

play01:58

and to make sense of it we summarized it

play02:00

and we visualize it in the case of

play02:02

categorical data we can count up the

play02:04

number of observations in any given

play02:05

category and we can represent them in a

play02:07

table and on a bar chart and to

play02:09

summarize numeric data we firstly

play02:11

interested in the spread

play02:12

the distribution of the data so we might

play02:14

describe the range of the data the

play02:16

interquartile range we could also

play02:17

include the standard deviation to get a

play02:20

sense of the middle of the data we use

play02:22

the median which divides the doctor into

play02:24

two equal halves and we use the mean

play02:26

which is the average the mean is

play02:28

probably the most commonly used summary

play02:29

value to represent this kind of data we

play02:32

can visualize that data using a box plot

play02:34

which is a visual representation of the

play02:36

range the interquartile range and the

play02:38

median and of course we can create a

play02:40

histogram and this gives us the shape of

play02:42

the data so I hope you can see that this

play02:44

process of summarizing and visualizing

play02:46

the data takes it from being just

play02:47

numbers and words on a spreadsheet and

play02:49

turns it into something that is

play02:50

meaningful to us something that we can

play02:52

get our heads around something that we

play02:53

can think about now in this very simple

play02:55

data set we've got two categorical and

play02:57

two numeric variables and things start

play02:59

to get interesting when we start looking

play03:00

at combinations of variables so for

play03:03

example we can take a look at a

play03:04

categorical and a numeric variable like

play03:06

gender and height and so we can group

play03:08

the data by gender which is the

play03:09

categorical variable and create a

play03:11

summary of the numeric variable in this

play03:14

case height that is separated out into

play03:16

those two groups and looking at the

play03:18

summary we can see that in our sample

play03:19

data men are on average taller than

play03:22

women what I want you to see here is

play03:23

that we've looked at a combination of

play03:25

the categorical and a numeric variable

play03:26

but as you can imagine there are other

play03:29

possible combinations of variables that

play03:31

we could have looked at we could have

play03:32

looked at height and weight which are

play03:33

both numeric we could have looked at

play03:35

gender and age group both categorical

play03:37

and in each case we might see either

play03:39

differences between groups or

play03:41

relationships between variables and in

play03:43

each of these cases there are specific

play03:45

statistical tests that we can apply to

play03:48

see if what we are seeing in the sample

play03:50

data has implications for what we think

play03:53

about the wider population can we infer

play03:55

anything is what we are seeing

play03:57

statistically significant so let's take

play04:00

a quick look at the five most important

play04:02

combinations of data that we have and

play04:04

we'll look at firstly what might we

play04:06

observe in our sample data given that

play04:07

sort of combination of data types and

play04:09

secondly what statistical test we might

play04:12

apply to determine whether or not we can

play04:14

infer anything about the wider

play04:15

population so we might look at a single

play04:16

categorical variable like gender and we

play04:18

could do a one sample proportion taste

play04:21

for two categorical variables we would

play04:23

do a chi-square test for a single

play04:25

numeric

play04:25

with the t-test if we have a categorical

play04:28

and a numeric variable we do a t-test or

play04:31

analysis of variance or ANOVA if there

play04:33

are more than two categories in a

play04:34

categorical variable and for two numeric

play04:36

variables we do a correlation test now

play04:38

I'm going to come back to each of these

play04:39

scenarios in each of these tests so

play04:41

don't panic at this point what I want

play04:43

you to see is how the data can be

play04:45

divided up and in just a few minutes

play04:47

we're going to take each of these

play04:48

scenarios and work through exactly what

play04:50

questions you can ask and how it is that

play04:52

you can apply statistical tests and

play04:54

importantly how to interpret your

play04:55

results now before we carry on I just

play04:58

want to say a big thank you to biomed

play05:00

central or BMC for sponsoring this video

play05:03

BMC are a publishing company that

play05:05

published open access journals and that

play05:07

means that the full-text of all of the

play05:09

papers published are available for free

play05:10

to anyone in the world I'm the

play05:12

editor-in-chief of one of the journals

play05:14

that they publish called globalization

play05:15

and health and genuinely impressed with

play05:17

them as a company I believe that they

play05:19

have integrity and I honestly believe

play05:20

that they are making the world a better

play05:22

place they have a portfolio of over 300

play05:25

journals that they publish so check them

play05:26

out at biomed central com I'll put a

play05:28

link in the description below

play05:30

at this point I want to say this it's

play05:32

not good science to take a data set and

play05:35

just randomly stab around blindly hoping

play05:38

to find something that's statistically

play05:39

significant

play05:40

before you interrogate the data you

play05:42

start off by defining your question your

play05:44

hypothesis you define your null

play05:46

hypothesis you identify the alpha value

play05:48

that you're going to use and then you

play05:50

analyze the data so let's look at what

play05:52

we can do with just one categorical

play05:54

variable like gender we might ask the

play05:56

question is there a difference in the

play05:58

number of men and women in the

play05:59

population now we could state that as a

play06:01

hypothesis which is that there is a

play06:04

difference between the number of men and

play06:05

women in the population and we could

play06:07

check to see whether or not we think

play06:08

that that is the case and when we look

play06:10

at our sample data well we do in fact

play06:12

see that there's a difference in the

play06:14

proportion of men and women so should we

play06:15

get excited well no not yet

play06:17

remember this is just sample data we

play06:20

could have by chance selected a sample

play06:23

that just happened to show a difference

play06:25

so let's consider the possibility that

play06:27

in actual fact there is no difference in

play06:30

the number of men and women in the

play06:31

population and we call that our null

play06:33

hypothesis and if that were true how

play06:37

likely would it be what

play06:39

the chances what is the probability that

play06:41

we would see the difference that we have

play06:43

observed or greater difference for that

play06:45

matter and if we can show that that

play06:47

probability is low then we can have a

play06:49

degree of confidence that the null

play06:50

hypothesis is wrong and we can reject it

play06:52

but before we calculate this probability

play06:55

which we're going to call our p-value we

play06:57

must be clear about how small is small

play07:00

enough below what value of P would we

play07:03

reject the null and we must decide on

play07:05

that cutoff before we calculate the

play07:07

p-value and we call that cutoff the

play07:09

alpha value and for the rest of the

play07:11

examples in this video we're going to

play07:12

use an alpha value of point zero five or

play07:15

five percent so we've really got two

play07:16

scenarios we've got the null hypothesis

play07:18

which is that there's no difference and

play07:20

the alternative hypothesis which is that

play07:22

there is a difference and the next step

play07:24

is to apply a statistical test and in

play07:26

this case we're doing one sample

play07:28

proportion test and we generate a

play07:30

p-value if the P is less than the alpha

play07:34

then we can reject the null hypothesis

play07:36

and state that the difference that we

play07:38

observe is statistically significant if

play07:40

we add another categorical variable in

play07:42

this case age group we may have a

play07:45

research question like does the

play07:47

proportion of males and females differ

play07:49

across these groups so our hypothesis is

play07:52

that the number of men and women that we

play07:55

observe is dependent on the age calorie

play07:58

that we look at in other words the

play07:59

proportions change or depend on or are

play08:02

dependent on the age category now we can

play08:04

collect our sample data we look at it

play08:07

and we can see that yes in fact the

play08:09

proportions do change across the age

play08:11

groups in other words in our sample data

play08:13

the proportions are dependent on age

play08:15

category now is that JooJoo chance well

play08:18

let's test the idea that the proportions

play08:21

are all the same well that they are

play08:22

independent of age category that's our

play08:25

null hypothesis now here we can conduct

play08:28

a chi-square test and that gives us a

play08:31

p-value and if the p-value is less than

play08:34

the Alpha we can reject the null

play08:35

hypothesis and state that our

play08:38

observation is statistically significant

play08:40

if we want to look at just one numeric

play08:42

variable on its own like height then we

play08:45

don't have any groups to look for

play08:47

differences between and we don't have

play08:48

another numeric variable to look for

play08:50

some sort of associational relationship

play08:51

with so what questions can we ask well

play08:55

we might have some theoretical value

play08:57

that we want to compare our data to for

play08:59

example in the case of average height we

play09:01

might have some historic data we might

play09:03

wonder if the current population is

play09:05

significantly different from that

play09:06

historic daughter so our question might

play09:08

be is the average height different from

play09:11

a previously established height let's

play09:13

imagine that the previously established

play09:14

height was one point four meters we want

play09:16

to know if the average height in our

play09:18

current population is different to that

play09:20

our hypothesis is that there is a

play09:22

difference again we collect some sample

play09:24

data we find that the average height is

play09:26

indeed different from the historic

play09:27

height is that statistically significant

play09:30

well if there were no difference what

play09:32

would the chances be that we observed

play09:34

the difference that we do or a greater

play09:36

difference we conduct a t-test comparing

play09:38

the averages and if the p-value is less

play09:40

than the alpha then we can reject the

play09:42

null hypothesis and state that the

play09:44

observed difference is statistically

play09:46

significant now let's consider a

play09:48

categoric and a numeric variable and

play09:50

remains the question is there a

play09:52

difference between the average height of

play09:53

men and women in this case our

play09:55

hypothesis is that there is a difference

play09:57

in our sample we do observe a difference

play10:01

let's assume that there's no difference

play10:05

we conduct a t-test which gives us a

play10:09

p-value if the P is less than the Alpha

play10:11

will reject the null and we state that

play10:13

the observation is statistically

play10:15

significant if we had a categorical

play10:17

variable with more than two categories

play10:18

like age group that's got three

play10:20

categories then instead of doing a

play10:21

t-test we would do an analysis of

play10:23

variance or ANOVA now let's look at the

play10:26

last combination of variable types in

play10:27

the Stata said two numeric variables

play10:29

height and weight here we might start

play10:33

with the question is there a

play10:34

relationship between height and weight

play10:35

our hypothesis is that there is a

play10:37

relationship we collect sample data we

play10:40

look at it and one lakh we do see some

play10:41

sort of relationship is drill or let's

play10:44

assume that it's not it's assumed that

play10:46

there's no correlation between the two

play10:47

variables and if it weren't real then

play10:49

what are the chances that we'd see the

play10:51

relationship that we do and here we

play10:53

conduct a

play10:53

correlation tastes now a correlation

play10:56

test is going to give us two things

play10:57

firstly it's going to give you a

play10:59

correlation coefficient which tells us

play11:01

something about the nature of the

play11:02

association between the two variables

play11:03

and I'm going to talk about that in just

play11:06

a minute

play11:06

but of course it also gives us a p-value

play11:09

and again if the p-value is less than

play11:12

the Alpha we can reject the null

play11:13

hypothesis and state that the

play11:15

correlation that we see is statistically

play11:18

significant and the correlation that we

play11:19

see can be represented by a number that

play11:22

we call the correlation coefficient so

play11:23

let's talk about that for a second

play11:25

correlation coefficient is a number

play11:27

between negative 1 and 1 and it looks at

play11:30

the relationship between two numeric

play11:32

variables if as the X variable gets

play11:38

larger the Y variable gets smaller we

play11:40

say that they are negatively correlated

play11:41

if they are perfectly negatively

play11:43

correlated then the correlation

play11:45

coefficient will be negative 1 if

play11:47

there's no relationship between the two

play11:49

variables then the correlation

play11:50

coefficient will be 0 and if there's a

play11:54

perfectly positive correlation as X goes

play11:56

up Y goes up then the correlation

play11:57

coefficient will be 1 and of course you

play12:00

can have any value in between and by the

play12:02

way it doesn't matter which of your

play12:04

variables is on the x and the y axis the

play12:06

correlation coefficient will be the same

play12:07

of course we've only just been able to

play12:09

scratch the surface in terms of what

play12:11

there is to learn about statistical

play12:12

analysis if you want to learn more then

play12:14

go to learn more 365.com and I've got

play12:18

some courses there that you can love if

play12:20

you'd like to learn about our

play12:21

programming which is a programming

play12:22

language that gets used for statistical

play12:24

analysis and it's free it's very

play12:26

powerful it's easy to use it's

play12:28

absolutely fantastic I have a YouTube

play12:30

channel that focuses specifically on

play12:32

that so that's our programming 101 I'll

play12:35

put a link in the description below go

play12:37

and check it out otherwise please

play12:38

subscribe to this channel hit the bell

play12:40

notification if you want notification of

play12:41

future videos leave your comments below

play12:42

and share this video with anyone that

play12:44

you think might find it useful until

play12:46

next time take care

Rate This

5.0 / 5 (0 votes)

関連タグ
StatisticsData AnalysisResearch MethodsSample DataStatistical TestsCategorical DataNumeric VariablesData VisualizationHypothesis TestingCorrelation
英語で要約が必要ですか?