Statistics 101: Introduction to the Chi-square Test

Brandon Foltz
31 Jul 201237:39

Summary

TLDRIn this video, we explore the basics of the Chi-Square test, a crucial tool in hypothesis testing. Designed for beginners in statistics, the video explains key concepts like random variation, expected versus observed frequencies, and interpreting results. The instructor uses real-world data and visual aids such as graphs to clarify these concepts. A simple example involving a fair versus loaded die illustrates the Chi-Square test step-by-step. The video emphasizes understanding data visually and sets the stage for a more complex problem to be solved in the next session.

Takeaways

  • 📚 The video series is designed to introduce basic statistics concepts, particularly for those new to the subject or in need of a review.
  • 🗣️ The presenter emphasizes the correct pronunciation of 'Chi-square' as 'Kai Square', like 'kite', to avoid common mispronunciations.
  • 📈 The video discusses the use of various graphs such as line graphs, stacked bar charts, stacked percentage bar charts, stacked area charts, stacked percentage area charts, and spider diagrams to visualize and understand data better.
  • 🔍 The Kai Square test is introduced as a method to determine if observed data varies significantly from expected data, which can indicate more than just random chance at play.
  • 🎯 The presenter sets up a hypothetical scenario involving a fair and a loaded die to illustrate how the Kai Square test works in practice.
  • ⚖️ The concept of 'null hypothesis' (H0) and 'alternative hypothesis' (H1) is explained, where H0 assumes no significant difference (e.g., the die is fair), and H1 assumes there is a significant difference.
  • 📉 The video explains the process of calculating the Kai Square statistic through observed versus expected frequencies, squaring the differences, and dividing by expected values.
  • 📊 The importance of the critical Kai Square value is highlighted, which serves as a threshold to determine whether to reject the null hypothesis based on the calculated Kai Square statistic.
  • 🔢 The degrees of freedom, a key component in the Kai Square test, are discussed, and in the context of the die example, it is simply one less than the number of categories (5 in this case).
  • 💯 The impact of the chosen P-value on the strictness of the test is demonstrated, showing how a more stringent P-value (e.g., 0.01 vs. 0.05) increases the critical Kai Square value needed to reject the null hypothesis.
  • 🔑 The video concludes with a reminder that the next installment will apply the concepts learned to a more complex data set involving student enrollment data over five years.

Q & A

  • What is the purpose of the video series on basic statistics?

    -The purpose of the video series is to introduce and explain basic statistical concepts, particularly aimed at individuals who are new to statistics or need to review foundational ideas.

  • Why does the speaker prefer using the word 'stats'?

    -The speaker prefers using 'stats' because it has fewer 'S's and 'T's, reducing the likelihood of tripping over their own tongue while speaking, which they admit happens often.

  • What is the primary focus of the video on the chi-square test?

    -The video focuses on introducing the chi-square test, explaining its common misunderstandings, and demonstrating how to perform a simple chi-square test step by step.

  • What type of data is the speaker analyzing in the video?

    -The speaker is analyzing data on the number of undergraduate students at different class levels (freshman, sophomore, junior, senior, and unclassified) over a 5-year period at a regional university.

  • What is the main question the speaker is trying to answer with the data?

    -The main question is whether the variation in student headcount over the 5-year period is beyond what would be expected due to chance alone.

  • What types of graphs are discussed in the video to visualize data?

    -The types of graphs discussed include simple line graphs, stacked bar charts, stacked percentage bar charts, stacked area charts, stacked percentage area charts, and spider or radar diagrams.

  • What does the speaker notice about the junior and senior class levels in the data?

    -The speaker notices that the headcount for junior and senior class levels, as well as the unclassified students, has increased significantly over the 5-year period.

  • What is the correct pronunciation of 'chi-square' according to the speaker?

    -The correct pronunciation is 'Kai Square', rhyming with 'kite', not 'cheetah' or 'chai'.

  • What are the two categorical variables in the dice experiment presented in the video?

    -The two categorical variables in the dice experiment are the fairness of the die (fair or loaded) and the outcome of the dice rolls (numbers 1 through 6).

  • How does the speaker describe the relationship between the chi-square test and the observed versus expected data?

    -The speaker describes the chi-square test as a tool to help understand the relationship between two categorical variables by comparing the observed data (actual outcomes) with what is expected (theoretical outcomes), and determining if the variation is due to random chance or something else.

  • What is the significance of the P value in the context of the chi-square test?

    -The P value determines the level of tolerance for variation in the data. A lower P value means less tolerance for variation and a higher threshold for rejecting the null hypothesis, indicating that the observed data is significantly different from what would be expected by chance.

  • How does the speaker explain the concept of 'degrees of freedom' in the chi-square test?

    -In the context of the dice experiment, the degrees of freedom are explained as the number of categories minus one, which in this case is 6 (the six sides of the die) minus 1, equaling 5.

  • What is the null hypothesis in the dice experiment?

    -The null hypothesis in the dice experiment is that the die is fair, meaning that each roll has an equal chance of resulting in any of the six numbers.

  • What does the speaker conclude about the die based on the chi-square test results?

    -Based on the chi-square test results, the speaker concludes that the die is not fair, as the observed frequencies of the numbers differ significantly from what would be expected on a theoretically fair die.

  • What is the effect of changing the P value on the chi-square critical value?

    -Changing the P value affects the chi-square critical value. A lower P value results in a higher critical value, making it more difficult to reject the null hypothesis because the threshold for considering the variation as not due to chance is higher.

Outlines

00:00

📚 Introduction to Basic Statistics and Chi-Square Test

The speaker introduces a video series on basic statistics, clarifying that 'stats' will be used for ease of pronunciation. The videos are aimed at beginners or those needing a review. The first topic is the Chi-Square test, often misunderstood, and the speaker plans to set up a complex problem to be solved in a subsequent video. The Chi-Square test will be introduced with a simple example before tackling the complex problem. The context involves analyzing changes in undergraduate student headcount at a university over five years, with the goal of determining if observed variations are due to chance. The speaker emphasizes the natural random variation in such data and sets the stage for using graphs and the Chi-Square test to analyze it.

05:02

📊 Exploring Data Visualization Techniques

This paragraph delves into various data visualization methods to better understand the student headcount data. The speaker extols the virtues of graphs for their ability to enhance data comprehension. The options discussed include simple line graphs, stacked bar charts, stacked percentage bar charts, stacked area charts, stacked percentage area charts, and spider or radar diagrams. Each visualization technique offers a unique perspective on the data, from tracking changes over time to comparing proportional enrollments and relative percentages. The speaker provides examples of how these methods can reveal insights into the data, such as the increasing headcount of juniors, seniors, and unclassified students compared to freshmen and sophomores.

10:03

🎲 Understanding the Chi-Square Test with a Dice Experiment

The speaker introduces the Chi-Square test with a dice experiment to illustrate the concept in a relatable way. The test is used to examine the relationship between two categorical variables, such as the outcome of dice rolls. The experiment involves rolling a die 100 times a day for six days, recording the frequency of each number. The expected outcome is that each number would appear 100 times if the die is fair. The Chi-Square test will compare the observed frequencies with the expected frequencies to determine if the variation is due to random chance or if it suggests the die is loaded. The explanation includes setting up a null hypothesis (the die is fair) and an alternative hypothesis (the die is not fair), and it touches on the concepts of P values and degrees of freedom, although it does not delve into their technical definitions.

15:05

🔢 Calculating the Chi-Square Statistic

The speaker provides a step-by-step guide to calculating the Chi-Square statistic using the observed and expected frequencies from the dice experiment. The process involves subtracting the expected frequency from the observed frequency, squaring the result, and then dividing by the expected frequency. The outcomes of these calculations are then summed to obtain the Chi-Square value. The speaker emphasizes the simplicity of the math involved and explains that the final Chi-Square value will be used to make a statistical conclusion about the fairness of the die.

20:06

🎯 Interpreting the Chi-Square Test Results

The paragraph explains how to interpret the Chi-Square test results by comparing the calculated Chi-Square value with a critical value obtained from the Chi-Square distribution. If the calculated value exceeds the critical value, the null hypothesis is rejected, suggesting the die is not fair. The speaker uses an example with a Chi-Square value of 12.26 and a critical value of 11.07, leading to the rejection of the null hypothesis. The explanation includes the impact of the P value on the strictness of the test, with a lower P value requiring greater observed variation to reject the null hypothesis. The speaker also demonstrates how changing the P value from 0.05 to 0.01 increases the critical value, thus affecting the conclusion of the test.

25:07

📘 Recap and Preview of Upcoming Video Content

In conclusion, the speaker summarizes the Chi-Square test, emphasizing its purpose for analyzing the relationship between two categorical variables and comparing observed data with expected outcomes to determine if variations are due to random chance. The speaker also previews the next video, which will apply the Chi-Square test to the university enrollment data introduced earlier. The goal of the next video will be to assess whether the variations in student headcount can be attributed to random chance or if there are other factors at play.

Mindmap

Keywords

💡Statistics

Statistics refers to the branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. In the context of the video, statistics is the overarching theme, as the script discusses various statistical concepts and methods, such as hypothesis testing and the chi-square test, to analyze and interpret data, particularly in the field of institutional research.

💡Chi-Square Test

The chi-square test is a statistical test used to determine if there is a significant difference between the expected frequencies and the observed frequencies in a dataset. The video provides an in-depth introduction to this test, explaining its purpose, how it is conducted, and how to interpret its results. The test is used to analyze the relationship between two categorical variables, as demonstrated in the script with an example involving dice rolls.

💡Hypothesis Testing

Hypothesis testing is a statistical method used to make decisions about populations based on sample data. The video script introduces the concept of hypothesis testing by explaining the null hypothesis (H0) and the alternative hypothesis (H1). It uses the example of a fair versus a loaded die to illustrate how to set up and test hypotheses, and how to interpret the results to determine if the observed data is significantly different from what would be expected by chance.

💡Observed Frequency

Observed frequency refers to the actual data collected during an experiment or study. In the video, observed frequency is used in the context of the chi-square test, where the script describes how to calculate the difference between the observed frequencies of dice rolls and the expected frequencies to determine if the die is fair or loaded.

💡Expected Frequency

Expected frequency is the frequency of an event that would be expected if the null hypothesis were true. The script explains how to calculate expected frequencies for each category in a dataset, using the assumption that a die is fair, and each number should appear 100 times out of 600 rolls as an example.

💡Degrees of Freedom

Degrees of freedom (DF) is a statistical concept that represents the number of values that can vary freely in a set of data. In the video, the degrees of freedom are used in the context of the chi-square test, where the script explains that for a simple chi-square test with six categories, the degrees of freedom would be 5 (6 - 1).

💡Critical Value

A critical value is a value of a test statistic that determines the threshold for rejecting the null hypothesis in a statistical test. The video script discusses how to find the chi-square critical value using Excel and explains that if the calculated chi-square value exceeds this critical value, the null hypothesis is rejected.

💡P-Value

The p-value is the probability that the observed results (or something more extreme) would occur if the null hypothesis were true. The script explains the concept of p-values in the context of setting a level of confidence for the statistical test, and how changing the p-value affects the critical value and the decision to reject or accept the null hypothesis.

💡Graphs and Data Visualization

Graphs and data visualization are methods used to represent data in a graphical format to make it easier to understand and interpret. The video script discusses various types of graphs, such as line graphs, bar charts, and spider diagrams, and how they can be used to visualize changes in student headcount over time, helping to understand patterns and variations in the data.

💡Categorical Variables

Categorical variables are variables that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or case to a particular group or category. The script emphasizes that the chi-square test is used to analyze the relationship between two categorical variables, such as class levels (freshman, sophomore, etc.) and years in the context of student enrollment data.

Highlights

Introduction to the Chi-Square test, a fundamental statistical method for hypothesis testing.

Explanation of the pronunciation and spelling of 'Chi-Square' to avoid common mistakes.

Overview of the Chi-Square test's purpose in understanding the relationship between two categorical variables.

Description of the test's application in comparing observed data with expected outcomes.

Introduction of a hypothetical scenario involving student enrollment data at a university to illustrate the test.

Discussion on the natural random variation in data and how Chi-Square helps determine if observed variation is beyond chance.

Presentation of various graphing options like line graphs and bar charts to visualize data effectively.

Analysis of student headcount data over five years using different graphical representations.

Explanation of how to calculate the Chi-Square statistic step by step using a dice-rolling experiment.

Clarification of statistical terms such as 'null hypothesis' and 'alternative hypothesis' in the context of the test.

Importance of the P-value in determining the level of confidence in the test results.

Calculation of the Chi-Square critical value using Excel for making statistical inferences.

Interpretation of the Chi-Square test result comparing the test statistic to the critical value.

Impact of changing the P-value on the stringency of the test and its critical value threshold.

APA format guidelines for reporting Chi-Square test results in academic writing.

Summary of the Chi-Square test process from hypothesis formulation to conclusion drawing.

Preview of the next video's content, which will involve a more complex application of the Chi-Square test to enrollment data.

Transcripts

play00:01

[Music]

play00:08

hello and welcome to my video series on

play00:11

basic statistics now two notes before we

play00:13

get going number one I will most often

play00:17

just use the word stats there are fewer

play00:20

S's and T's crammed together in stats

play00:23

and therefore I am less likely to trip

play00:25

over my own tongue which happens often

play00:28

number two these video are geared

play00:30

towards individuals who are relatively

play00:32

new or just need to review the basic

play00:36

concepts in stats so if you have

play00:39

advanced study in quantitative methods

play00:42

these videos are probably a bit below

play00:44

what you would need also if you do have

play00:47

advanced study in quantitative methods

play00:49

just keep in mind that I am simplifying

play00:52

some of the concepts for those who are

play00:53

new to the topic so all that being said

play00:56

let's go ahead and Dive Right

play00:58

In

play01:01

in this video we will be doing an

play01:03

introduction to the kai Square test this

play01:06

is one of the most often

play01:08

misunderstood tests in hypothesis

play01:10

testing so we're going to do a couple

play01:12

things we're going to set up a more

play01:15

complex problem that we will actually

play01:16

solve in the next video but we'll talk

play01:19

about it in this video after we talk

play01:22

about that data we'll look at some

play01:23

graphs that help us understand that data

play01:25

better and then we will actually do a

play01:27

simple Ki Square test

play01:30

step by step so you can see exactly how

play01:33

the numbers are calculated which are

play01:35

actually fairly simple and then how we

play01:37

interpret that so let's go ahead and

play01:40

talk about our

play01:43

problem now again this problem will be

play01:45

solved in the next video in this video I

play01:48

will actually be doing a simple example

play01:49

that we will then apply to this more

play01:51

complex problem so we'll see this more

play01:54

again in part

play01:55

two so you work in the office of

play01:58

institutional research at a small but

play02:00

growing Regional 4year University over

play02:04

the past 5 years the number of

play02:06

undergraduate students at each level so

play02:09

freshman sophomore Junior and senior and

play02:11

then we have unclassified students which

play02:13

are sometimes high school students or

play02:16

others um has changed so we have had

play02:20

variation in our student headcount over

play02:23

this 5year

play02:25

period now here are our questions now

play02:28

even though some headcount random

play02:31

variation is

play02:33

inevitable is that variation beyond what

play02:37

we would expect due to chance alone now

play02:40

there is a lot packed into that question

play02:42

and I want to explain a couple of things

play02:44

just out in the world when we count the

play02:47

occurrences of things if we count An

play02:51

Occurrence maybe today and then we count

play02:53

the same thing tomorrow and then we

play02:56

count the same thing the day after that

play02:58

and the day after that there's going to

play03:00

be a natural random variation in the

play03:04

number that we count so maybe I go out

play03:08

to a busy stoplight and I count the

play03:10

number of cars that go through it during

play03:12

a 15minute interval say during rush hour

play03:16

well if I do the same thing tomorrow I'm

play03:18

going to get a different number if I do

play03:20

the same thing the day after that I'm

play03:22

probably going to get a different number

play03:23

but those numbers are probably going to

play03:25

be close together even though they vary

play03:28

a little bit just you know randomly so

play03:33

what we're trying to ask here is is that

play03:36

variation

play03:37

beyond what we would expect just due to

play03:41

the normal random chance

play03:46

variation now what types of graphs can

play03:48

we use to better visualize our

play03:53

data and how can a Ki Square test help

play03:56

us rule out that variation due to chance

play04:00

alone so where is the threshold by which

play04:03

we can say wait a minute that change is

play04:06

just beyond what we'd expect due to

play04:08

chance

play04:12

alone so here is our student headcount

play04:16

now this is actual data by the way I did

play04:18

not make this up so we have the years

play04:21

2007 through

play04:22

2011 then we have the class levels

play04:25

freshman sophomore junior senior and

play04:26

then we have the unclassified and this

play04:29

track the student headcount or

play04:31

enrollment you can think of it during

play04:33

the fall semester of each one of these

play04:35

years so go ahead and take a look at

play04:38

that and we'll talk about what we

play04:42

see but one thing I noticed is that for

play04:45

Junior and senior the headcount goes up

play04:48

quite a bit over that amount of time and

play04:50

the same thing for in classified if I

play04:52

look at freshman and sophomore it kind

play04:55

of goes up and down there is no real you

play04:58

know pattern as far as straight up or

play05:01

straight

play05:05

down now let's go ahead and do some

play05:07

graphing options so we can visualize

play05:09

this data

play05:10

better now one of my credos is graphs

play05:14

are your friend graphs are awesome use

play05:18

more

play05:20

graphs take advantage of our ability to

play05:23

understand data visually a column of

play05:26

numbers is one thing but making it

play05:29

visual is a whole another thing and that

play05:31

can really help with your classmates or

play05:33

your instructor or your co-workers or

play05:37

your Dean or whoever else you might be

play05:39

presenting this to take advantage of our

play05:41

ability to understand data visually so

play05:44

in this problem we're going to consider

play05:46

simple line

play05:47

graph a stacked bar

play05:50

chart a stacked percentage bar chart a

play05:54

stacked area

play05:55

chart a stacked percentage area chart

play05:59

and then a spider or radar diagram and

play06:03

we're going to talk about what each one

play06:04

does for us as far as interpreting or

play06:07

understanding our data

play06:09

better so here's our simple line graph

play06:12

and we have our five class levels over

play06:15

on the right hand side denoted by each

play06:17

line now as you can see if you start at

play06:19

the bottom the special uh category seems

play06:23

to go up over time the junior which is

play06:27

the green line goes up the senior line

play06:31

which is the purple goes up over time

play06:34

but then we have sophomore which kind of

play06:35

goes up and down and then the same thing

play06:37

for the Freshman it starts up goes down

play06:40

and comes up again and then kind of goes

play06:41

down again so when we talk about

play06:44

variation in our data across the years

play06:48

this is what we're talking about now

play06:50

because of the Natural Way enrollments

play06:52

work and other things in you know in

play06:54

society and nature work there's going to

play06:56

be some natural random variation we

play06:58

cannot expect expect unless we do some

play07:00

serious quota filling to have the exact

play07:03

same enrollment every year so we're

play07:06

going to have natural variation now what

play07:09

we're trying to figure out is is that

play07:11

variation within what we we would expect

play07:14

by just sort of random chance

play07:19

alone now here is a stacked bar chart

play07:22

which is another way of looking at our

play07:23

data So within each year we have the

play07:27

number of students in each class level

play07:30

and then they're stacked on top of each

play07:32

other so what does this do for us you

play07:34

know that's new well it helps us see

play07:38

proportional enrollment so you can see

play07:41

that the Freshman which is there in the

play07:43

blue bar not only can we see its pattern

play07:45

over time you know down a little bit up

play07:48

and then down a little bit but we can

play07:50

also see its size relative to the other

play07:53

class levels so it seems to be about

play07:55

twice as much as the sophomore which is

play07:57

about twice as much as the the junior

play08:00

depending on the year you're looking at

play08:02

and then Senior and our special category

play08:05

so it helps us see relative in this case

play08:08

enrollments you know as one class level

play08:10

is compared to

play08:14

another now here is our stacked

play08:16

percentage chart now of course in this

play08:19

case each class level is described by

play08:23

the percentage of the total enrollment

play08:26

it occupies for any given year so if we

play08:29

look in 2007 we can see that freshmen

play08:33

were approximately

play08:35

38% of our total undergraduate

play08:39

enrollment or headcount in 2008 that

play08:42

went down to about maybe

play08:44

33% by the time we got to 2011 we're

play08:47

almost all the way down to

play08:49

30% now of course the entire enrollment

play08:52

takes up 100% so it's all relative here

play08:56

again and you can see that the special

play09:00

color there at the very top gets bigger

play09:02

so it takes up a larger percent of our

play09:04

enrollment same thing for seniors it

play09:08

appears and

play09:10

juniors and then sophomores seem to

play09:13

narrow in their percentage as we go

play09:15

across time so this helps us look at

play09:18

relative percent for each

play09:23

year now here is a stacked area chart

play09:26

now this is very similar to the Stacked

play09:29

bar chart except we take that data and

play09:32

we go all the way across the graph with

play09:34

it so this tells us a few things if we

play09:37

look at the very top of the graph we can

play09:39

see that it increases so what we can say

play09:42

is that our overall head count our

play09:44

overall student enrollment increased

play09:48

over this time period from you know in

play09:50

the mid

play09:51

1400s up to above

play09:54

1,600 now as far as the individual bands

play09:57

of course those represent each each

play09:59

class level so if we look the senior the

play10:02

purple seems to widen over time the

play10:06

junior the green seems to widen over

play10:08

time the sophomore level seems to narrow

play10:11

a bit and the Freshman seems to narrow a

play10:15

bit so again with this visual

play10:17

information we can sort of make you know

play10:20

some ideas in our head about how this

play10:23

has changed over time it seems that our

play10:27

freshman and sophomore enrollments went

play10:30

down a bit but our Junior and senior in

play10:34

special category enrollments went up

play10:37

over this time and actually with this

play10:39

University there's a I have a hypothesis

play10:42

at least why that happened but maybe

play10:44

we'll talk about that in the next

play10:47

video now here is our stacked percentage

play10:50

area chart and this is very similar to

play10:53

our stacked percentage bar chart where

play10:56

each band represents a percentage of the

play10:58

total so again you can see that overall

play11:01

The Freshman seems to narrow as a

play11:03

percentage same thing with the

play11:06

sophomore percentage and the junior you

play11:10

got to look at two things here does it

play11:12

get wider and sort of its direction so

play11:15

it does seem to get whiter and then same

play11:18

thing with the senior there in the

play11:20

purple and the special category there on

play11:23

top so what can can we say about this

play11:27

well it seems that our freshman and

play11:29

sophomores

play11:30

combined are taking up a smaller

play11:33

percentage of our overall undergraduate

play11:35

enrollment and then our junior senior

play11:39

and special categories are taking up a

play11:41

larger percent over this time so again

play11:44

we have variation in our enrollment but

play11:46

our question is is it within just random

play11:50

chance variation or is there something

play11:52

else going

play11:55

on now the last one we're going to look

play11:57

at is the spider diagram and again this

play11:59

is an often underused diagram that I

play12:02

think can be very helpful now of course

play12:05

each grade level is represented by a

play12:07

different color and in the center we can

play12:10

call that a hub kind of like the Hub of

play12:12

a wheel if You' like and then radiating

play12:15

out are the years so each spoke coming

play12:17

out of that Center is a year so 2007 08

play12:21

09 2010 and

play12:24

2011 then of course we plot the number

play12:27

of students in each grade level along

play12:29

that spoke now what does this tell us

play12:32

new well if you notice as we swing

play12:35

around the spider diagram as we get

play12:39

towards 2010 and

play12:41

2011 we have like a bulge in the special

play12:45

the Juniors and the seniors so the the

play12:48

lighter blue the green and the purple

play12:50

and if you remember other graphs that

play12:53

was apparent in our areas in stacked

play12:55

bars because the Juniors the seniors and

play12:58

the special cat ategory we becoming a

play13:00

greater part of her overall

play13:02

enrollment then if you look at the red

play13:04

it doesn't change a whole

play13:06

lot over time sometimes it comes in a

play13:10

bit and goes back out and comes in a bit

play13:11

and goes back out but not by a whole lot

play13:14

and then the blue starts way out almost

play13:18

all the way to 600 students comes back

play13:20

in to 500 in 2008 goes back out to the

play13:24

middle stays in the middle and then

play13:26

comes back in again at 2011

play13:29

so it helps us sort of see bulges in our

play13:32

sper our spider or radar diagram to see

play13:34

where our changes have

play13:39

been okay so let's actually get to the

play13:41

heart of this video and that is what is

play13:43

a Kai Square

play13:46

test now first and

play13:49

foremost make sure you pronounce it

play13:52

[Music]

play13:53

correctly it is Kai Square as in kite

play13:57

not ch as in cheetah not a chi

play14:01

square or

play14:04

chai as in chai T it's not chai square

play14:09

it is Kai Square so I've been in about

play14:12

10 different stats classes between

play14:14

undergraduate and all my graduate work

play14:17

every class every one of them someone

play14:20

has said I don't understand the chi

play14:23

square or I have a question about the

play14:26

chai square it's Kai Kai Square as in

play14:30

kite so don't be that person in your

play14:33

class okay now what does it do it helps

play14:36

us understand the relationship between

play14:39

two

play14:40

categorical variables and that's very

play14:43

important they have to be categorical

play14:45

variables so what do I mean by that well

play14:48

grade level that's one example in this

play14:50

case so we have freshman sophomore

play14:52

junior senior and then special or

play14:55

unclassified um sex male or female if we

play14:59

think of it as a binary

play15:01

category um age group so we could have

play15:04

you know you've probably seen them are

play15:06

you in the age group 18 to 25 or 26 to

play15:10

35 or 36 to 45 whatever so those are

play15:14

categories if they're put in

play15:17

groups years of course we have years in

play15:20

this example Etc so the important thing

play15:22

here is it has to be categorical

play15:27

variables now Kai squares involve the

play15:30

frequency of events or the count so

play15:34

we're only dealing with counting things

play15:37

we are counting members of these

play15:40

categories we're not dealing in percents

play15:43

we're not dealing in anything like that

play15:45

we are dealing in frequencies

play15:49

counting now it helps us compare what we

play15:52

actually

play15:53

observed with what we

play15:57

expected okay observed versus

play16:00

expected often times using population

play16:04

data and I don't want to go into all

play16:07

that right now but you know that's every

play16:10

member of a certain category we denote

play16:13

so that's a population or theoretical

play16:17

data and actually when we do our example

play16:19

we're going to be using a theoretical

play16:21

data event I guess you could

play16:24

say now Kai squares assist us in

play16:27

determining the role of random chance

play16:30

variation between these categorical

play16:33

variables so the relationship is going

play16:36

to change but the question is is that

play16:39

change within a certain limit we set

play16:43

that would account for just random

play16:47

variation and finally we use the Ki

play16:50

Square distribution now if that just

play16:52

went whatever your head do not worry for

play16:55

this video it's not important just know

play16:58

that we use use it and I put it in here

play16:59

just to be technically correct and

play17:02

within that we use what's called a

play17:03

critical value which I'll explain here

play17:05

in a little bit to accept or

play17:08

reject our

play17:10

hypothesis okay so if I'll just talk

play17:13

about hypothesis and kisore

play17:16

distributions or critical values or have

play17:18

your mind going in Crazy directions

play17:19

right now don't worry in the example

play17:22

we're going to do is going to be so

play17:23

Crystal Clear step by step that uh

play17:26

you'll have it down pat

play17:30

now just look at our head count changes

play17:32

over time again so we can see we have a

play17:35

couple of categories that go seem to go

play17:37

up over time almost in like a very flat

play17:40

straight line and then we have a couple

play17:42

sophomore and freshmen that kind of

play17:44

Bounce all over the place so we have

play17:46

variation and we just want to know are

play17:49

these categories grade level and year

play17:52

the variation that occurs is it due to

play17:55

random chance alone or is there

play17:56

something else going on in this

play18:01

data now in this video we're going to

play18:03

use a very simple experiment it's very

play18:05

common when talking about the Ki square

play18:08

and that is the dice experiment and I'm

play18:10

going to set it up maybe a little bit

play18:11

differently than other people have so

play18:13

here is our

play18:15

example let's say I have two Dy in my

play18:18

hand okay and just in case you maybe are

play18:22

not familiar you know D or the six-sided

play18:25

squares that are often used in games

play18:27

especially like gambling games and they

play18:30

have you know one through six on each

play18:33

side so let's say I have two Dy in my

play18:35

hand one is fair and the other is 156

play18:41

loaded that means it favors the numbers

play18:44

five and six due to alterations in its

play18:48

weight so some people that cheat at

play18:52

casinos swap out the

play18:55

actual uh casino dice or dice with

play18:59

weighted dice to get the numbers they

play19:01

want okay so I give you two of them and

play19:04

one is fair and one is

play19:06

loaded now I ask you to determine if

play19:10

it's the fair die or the loaded die I

play19:14

just gave you and I want you to be

play19:17

95% confident in your

play19:21

conclusion now to do that what you're

play19:23

going to do is I'm going to ask you to

play19:24

do is over the next 6 days I want you to

play19:27

roll that dot okay 100 times each

play19:31

day for a total of 600 rows okay and

play19:37

then record how many times each number

play19:41

occurs over those 600 rolls so you're

play19:44

going to 100 rolls each day for 600 for

play19:46

6 days 600 total rolls keeping count a

play19:51

frequency of how many times each number

play19:53

comes

play19:55

up now let's assume

play19:58

okay that the die I gave you is fair

play20:02

let's assume that what would we expect

play20:06

to happen What Would We theoretically

play20:08

expect to happen over these 600

play20:13

rows now if the die is fair if we roll

play20:17

it 600 times and we have six numbers on

play20:20

the die we would theoretically expect

play20:24

each number to come up 100 times so so

play20:28

six numbers 600 rolls each one has the

play20:33

same probability of coming up so

play20:35

theoretically we would expect 100 of

play20:38

each number to

play20:43

occur so how we going to State this

play20:45

hypothesis and again this is one of the

play20:47

more complicated slides to just hang

play20:49

with

play20:50

me first we have What's called the null

play20:52

hypothesis and that's represented by H

play20:56

subz so if you've been in St St class

play20:58

you've probably seen something like this

play21:01

now our null hypothesis is that the die

play21:04

is

play21:06

fair then we have our our alternative

play21:10

hypothesis which is denoted by H sub

play21:13

one and our alternative hypothesis is

play21:16

that the die is not

play21:20

fair okay so we have the null that says

play21:22

the die is fair we have the alternative

play21:24

that says the D is not fair pretty

play21:27

straightforward

play21:30

now what is the everyday sort of English

play21:32

way of saying

play21:34

this now is the variation in our

play21:37

observed data simply due to

play21:40

chance or is the variation beyond what

play21:45

random chance should

play21:49

allow or how far can our data vary

play21:54

before we have to reject the null

play21:57

hypothesis ois and conclude that the die

play22:01

is not fair which is our our

play22:05

alternative so we're going to have some

play22:07

variation but we need to know if that

play22:10

variation occurs within limits we

play22:14

set now I asked you to be 95% confident

play22:18

so that creates what's called A P value

play22:22

of

play22:25

0.5 so again if that P value concept

play22:28

kind of goes over your head don't worry

play22:30

about it too much now another way of

play22:33

thinking about the P value is what level

play22:37

of Tolerance are we willing to put on

play22:41

this

play22:42

variation if our tolerance is pretty

play22:44

loose we might have a P value of 0.1 or

play22:48

sort of

play22:50

10% if we want the tolerance for the

play22:53

variation in our data to be very narrow

play22:55

we want to be very strict

play22:58

we might choose a P value of

play23:02

0.1 or

play23:04

1% so I've sort of pick the medium which

play23:07

is 05 which is often you the most

play23:10

commonly common use commonly used in a

play23:14

lot of social science

play23:16

research okay so degrees of freedom oh

play23:20

goodness this is one of those Concepts

play23:22

that gets flown around flung around in

play23:25

stats classes and never gets explained

play23:28

at least in my experience very well and

play23:31

guess what I'm not going to explain it

play23:32

in this video either now for this kind

play23:36

of for this test our degrees of Fe

play23:39

Freedom DF are simply the number of

play23:42

categories we have which is six we have

play23:44

six numbers minus one okay so in this

play23:48

example just kind of take it as it is

play23:51

that our degrees of freedom are 6 - 1

play23:53

which equals 5 now in the next video

play23:56

when we have more complex categories the

play23:58

degrees of freedom will be figured a bit

play24:00

differently but for this example it's

play24:02

just 6 - 1 or five now we have a concept

play24:05

called the kai Square critical value

play24:09

well what is that well the kai Square

play24:12

critical value is sort of um the

play24:16

threshold it is the point where we just

play24:20

have to conclude that our variation is

play24:22

too great to be explained by chance

play24:25

alone and therefore we'd have to reject

play24:28

our n hypothesis over there so the

play24:31

easiest way to find this actually is in

play24:33

Excel Excel has a built-in function sort

play24:36

of the kai inverse or CH hii I in v and

play24:40

then you just give it two inputs you

play24:42

give it your P value which in our case

play24:44

is

play24:45

0.005 then you give it your degrees of

play24:47

freedom which in our case is five then

play24:50

it spits back a value a critical value

play24:53

of

play24:56

11.07 so when we do this kind of keep

play24:59

that number in the back of your mind our

play25:01

threshold for our Ki Square critical

play25:03

value is going to be

play25:07

117 so what that means is that if our D

play25:11

Kai square is greater than

play25:15

11.07 then we have to reject our null

play25:20

hypothesis and claim that the die is not

play25:23

fair the variation is just beyond what

play25:27

we would expect by normal random chance

play25:31

or normal random variation so if we get

play25:35

a Kai Square that's greater than

play25:37

11.07 we got to throw the null

play25:39

hypothesis in the garbage and just

play25:42

accept the alternative hypothesis which

play25:44

states the die is not fair so let's go

play25:48

ahead and do this step by

play25:51

step okay so here is our expected

play25:55

frequency which we talked about now on

play25:57

the right hand hand side is what our

play25:58

data actually produced these are our

play26:00

actual observations so 6 days later you

play26:03

come to me and say here are my

play26:04

observations so the number one came up

play26:06

111 times the number two came up 90 3 81

play26:10

Etc and of course that adds up to 600

play26:13

total rolls so those are our observed

play26:17

frequencies when we actually did the

play26:22

experiment so here's the first step in

play26:25

figuring out our Kai Square the math is

play26:27

very very easy okay so I know you're

play26:32

smart and you can do it so let's go

play26:33

ahead and just do it step by step the

play26:36

first step is we take our observed our

play26:39

observation minus what we expected so as

play26:42

you can see on the right hand side it's

play26:43

simple subtraction we take our observed

play26:46

which in the first case is 111 minus our

play26:49

expected which was 100 because we

play26:51

expected to be a fair die so 111 minus

play26:55

100 is 11 and then we just do that all

play26:58

the way down that column that's it

play27:00

that's step one observed minus

play27:05

expected in the next step we take that

play27:09

observed minus the expected and square

play27:12

it that's it so for number one remember

play27:16

we had 111 - 100 which was 11 and then

play27:19

in this step we just Square it which is

play27:22

121 then we do that for each of our

play27:26

numbers so step one we subtract step two

play27:30

we Square it that's

play27:34

it now in step three we take what we got

play27:38

in step two which was the

play27:40

squaring and we divide that by what we

play27:45

expected which is e okay so in all of

play27:49

our cases we expected 100 this is not

play27:52

this is actually very simple division

play27:54

it's just moving the decimal place so

play27:56

for the number one we had 121 minus 100

play28:00

that's

play28:00

1.21 for number two it's just well one

play28:05

and for number three we had

play28:08

3.61 and for number four we had

play28:13

0.04 and on down five and six so that's

play28:17

step three just remember Step One is

play28:20

subtraction step two we square that and

play28:23

the step three we divide that by our

play28:26

expected which in this case was 100 so

play28:28

it's very

play28:31

easy now in step four we just add all

play28:35

those up so in our right-and column

play28:38

again we had 1.21 1 3.61 04 Etc we just

play28:43

add all those up that's what the

play28:44

summation sign at the top of the slide

play28:46

means and guess what folks we just did

play28:49

our Kai Square we're

play28:51

done at least at least with the math

play28:53

part so our Ki Square value for this

play28:56

experiment was

play29:00

12.26 of course that doesn't mean

play29:02

anything yet until we actually interpret

play29:05

it but the kai Square value for this is

play29:10

12.26 now remember our critical Ki

play29:14

Square value was

play29:18

11.07 now guess what 12.26 is greater

play29:22

than

play29:24

11.07 so therefore our die critical

play29:28

value is greater than

play29:31

11.07 so we have to reject our null

play29:34

hypothesis which said the die is fair

play29:37

and claim that the die is in fact not

play29:41

fair so we have to accept our

play29:44

alternative hypothesis because our Kai

play29:47

Square was greater than our critical Ki

play29:50

square based on our Excel

play29:55

formula so how do we interpret that

play29:57

result now if we actually use the APA

play30:00

format which I encourage you to do

play30:02

depending on your discipline of course

play30:04

what we would say is that the observed

play30:05

frequency of each number on the die

play30:08

differed significantly from what would

play30:10

be expected on a theoretically Fair

play30:14

die of course we have our Kai Square

play30:17

Five is our degrees of freedom n is the

play30:19

number of times we rolled the die and

play30:21

that equal

play30:22

12.26 and our P value was

play30:25

05 so if you were writing this in a

play30:28

journal that's exactly how it would look

play30:31

in apa

play30:32

format now our problem Ki Square was

play30:35

12.26 our critical Ki Square was 11.07

play30:38

which is therefore our variation was too

play30:41

great to be explained by chance alone

play30:44

therefore we must reject uh our null

play30:46

hypothesis which is the diph and accept

play30:50

H1 which was our alternative hypothesis

play30:54

and say the die is not fair so we are

play30:57

95% confident that you have the loaded

play31:01

die in your

play31:05

hand now I want to talk about just the

play31:08

effect of choosing a P value remember I

play31:10

said the P value was sort of how strict

play31:13

we're willing to be on accepting random

play31:17

variation if we change the P value we're

play31:20

going to sort of change the threshold of

play31:22

what we're willing to accept as random

play31:26

chance now this is the same slide I just

play31:28

changed a few numbers so null hypothesis

play31:30

is still the die is fair alternative is

play31:33

the die is not

play31:34

fair same interpretation okay we're

play31:37

talking about variation

play31:39

here here's what we changed instead of

play31:42

being 95% confident I want you to be

play31:46

99% confident so we're going to have a P

play31:49

value not 05 but now it's going to be

play31:53

0.01 so we're going to be much much more

play31:57

strict on interpreting our variation now

play32:02

what that means is that we're going to

play32:04

need a lot more variation in our

play32:08

observations in order to reject our null

play32:12

hypothesis we're going to need much more

play32:14

variation in our observations to reject

play32:18

our null hypothesis because we've

play32:20

selected a much more strict P value now

play32:25

degrees of freedom are the same now for

play32:27

the kai Square we changed our P value so

play32:30

we're going to have a new critical value

play32:33

so when we put this into Excel we

play32:36

changed that to 0.01 degrees of freedom

play32:38

are five now our critical value is

play32:56

15.09% to go over that threshold because

play33:00

we've picked such a strict P

play33:03

value so therefore if our die Kai square

play33:07

is greater than

play33:25

15.09% on a theoretically Fair die and

play33:30

everything there is the same five

play33:31

degrees of freedom 600 rolls our problem

play33:35

Kai Square was 12.26 that didn't change

play33:38

but our P value did so we have a p of

play33:42

01 so our problem Kai Square was

play33:45

12.26 our critical Kai square with a P

play33:49

value of

play33:50

01 was now

play33:53

[Music]

play33:55

15.09% that threshold therefore our

play33:59

variation was not too great to be

play34:02

explained by chance alone therefore in

play34:06

this case we have to accept our null

play34:09

hypothesis and conclude the die is fair

play34:14

and we are 99% confident that you have

play34:17

the fair

play34:19

die now wait a

play34:21

minute you have the same Dy in your

play34:25

hand but in the first case when P was

play34:28

05 we concluded you had the loaded

play34:32

die now with a p of

play34:36

01 we conclude you have the

play34:39

farad what now this is my point on

play34:43

changing the P value or changing the

play34:47

strictness with which we are willing to

play34:50

explain

play34:51

variation so with a p of

play34:54

05 we did not need as much variation

play34:58

to

play34:59

overcome the critical value as you

play35:02

notice the critical value went up

play35:04

significantly when we changed the P to

play35:07

0.01 so it's just much more strict we

play35:10

have to have more variation to cross

play35:12

that sort of 99% confidence because we

play35:15

selected such a low P

play35:19

value all right just a quick review what

play35:22

is a Ki Square test remember it is Ki

play35:24

Square not not CH Square or chai square

play35:28

it's Kai and kite it helps us understand

play35:31

the relationship between two categorical

play35:34

variables that's

play35:36

important Ki squares involve the

play35:38

frequency of events the count so in this

play35:40

case we were counting the number of

play35:41

times each number comes up it helps us

play35:44

compare what we actually observed with

play35:47

what we

play35:49

expected Kai squares assist us in

play35:53

rejecting or ruling out to some extent

play35:56

random chance variations between

play35:58

categorical

play35:59

variables and we use the kai Square

play36:02

distribution which we didn't talk about

play36:03

it's not important to what we're doing

play36:05

um to accept or reject our hypothesis

play36:09

regarding random chance now basically

play36:12

what that means there is that we were

play36:13

able to put into Excel our degrees of

play36:16

freedom and our P value and it generated

play36:19

a critical Kai Square value that we have

play36:22

to surpass in order to reject our null

play36:25

that's all that little thing means to on

play36:26

there okay so that's

play36:29

review now just reminder in our next

play36:32

video we will actually be looking at the

play36:34

data we started with so look in this

play36:37

case instead of having a die with six

play36:39

numbers on it we have five class level

play36:43

categorizations so freshman through

play36:46

unclassified then we have five years so

play36:49

we have five grade levels and five years

play36:53

for our categories and then we're going

play36:55

to try to determine are we are going to

play36:57

determine whether or not the variation

play37:00

present in this

play37:01

data can be explained just by you know

play37:04

random chance just the random chance

play37:06

that comes with

play37:10

enrollment all right so that is our

play37:12

in-depth introduction to the kai Square

play37:16

hopefully you learned a lot and I look

play37:18

forward to seeing you again in our next

play37:20

video when we look at that enrollment

play37:22

data in a more complex example again

play37:24

thank you very much for watching I look

play37:26

forward to seeing you again next

play37:28

[Music]

play37:37

time

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Chi-Square TestHypothesis TestingStatistics BasicsEducational VideoData AnalysisGraph InterpretationResearch MethodsAcademic TutorialStatistical SignificanceCategorical Data
Benötigen Sie eine Zusammenfassung auf Englisch?