SEM Series (2016) 2. Data Screening

James Gaskin
22 Apr 201625:08

Summary

TLDRThis video script is a comprehensive guide to data screening in statistical analysis. It covers essential steps like identifying missing data, handling unengaged responses, managing outliers, and checking for skewness and kurtosis. The tutorial uses a practical approach with examples from a dataset, demonstrating how to clean and prepare data for analysis in SPSS. It also discusses the implications of each step and provides tips for reporting the findings in research.

Takeaways

  • πŸ” The video script discusses a systematic approach to data screening, focusing on handling missing data, identifying unengaged responses, and managing outliers in continuous variables.
  • πŸ—‚οΈ The process starts with organizing the dataset by removing unnecessary variables and keeping track of IDs to ensure data integrity.
  • πŸ“Š To detect missing data, the script suggests using Excel to count blank cells and sort the data to identify rows with a high number of missing values, which may be candidates for removal.
  • 🚫 The script identifies unengaged responses by looking for patterns such as identical answers across all questions or extremely short survey completion times, indicating a lack of attention from respondents.
  • πŸ“ˆ For continuous variables, the script recommends checking for outliers using scatter plots and replacing extreme values with the mean or median to maintain data integrity.
  • πŸ“‰ The video also covers the importance of dealing with skewness and kurtosis, suggesting the use of conditional formatting in Excel to highlight values that exceed certain thresholds, indicating potential issues with data distribution.
  • πŸ“ It's advised to report the data screening process in a research paper, including details about missing data imputation, removal of unengaged responses, and handling of outliers.
  • ❌ The script emphasizes the need to be cautious when dealing with a high percentage of missing data in a variable, as it can dilute the potency of the variable and affect analysis outcomes.
  • πŸ”’ The process of imputing missing values differs for ordinal and continuous scales, with medians used for ordinal scales and means for continuous scales, to maintain the integrity of the data.
  • ⏱️ The video mentions the use of attention traps and reverse-coded items as strategies to identify unengaged respondents, which can help in cleaning the dataset.

Q & A

  • What is the first step in data screening as described in the script?

    -The first step in data screening is to check for missing data in the rows of the dataset.

  • How does the script suggest handling unengaged responses in a survey?

    -Unengaged responses can be detected by visually inspecting data, using attention traps, recording time elapsed for survey completion, employing reverse-coded items, or identifying respondents who give the same response to every question.

  • What is an 'attention trap' in the context of survey data?

    -An 'attention trap' is a question in a survey designed to identify respondents who are not paying attention, such as asking them to select a specific, counterintuitive answer to a straightforward question.

  • How can outliers on continuous variables be identified and addressed according to the script?

    -Outliers on continuous variables can be identified using scatter plots or by calculating the standard deviation of the values. Addressing them may involve removing the outlier if it's due to an erroneous response or imputing a value like the mean or median for the variable.

  • What is the purpose of using reverse-coded items in a survey?

    -Reverse-coded items are used to detect unengaged respondents. They are negatively worded questions where the expected answer should be the opposite of what is typically agreed upon, helping to identify respondents who may not be paying attention.

  • Why is it important to check for missing data before conducting further analysis?

    -Checking for missing data is important because it can affect the validity and reliability of the analysis. Missing data can lead to biased results or require imputation, which should be reported in the study.

  • What is the recommended approach for handling missing data in ordinal scales versus continuous scales?

    -For ordinal scales, the median is recommended for imputation, while for continuous scales, the mean is more appropriate. This distinction is made because ordinal scales do not have actual mean values between the scale points.

  • How does the script suggest detecting respondents who may not be engaged in the survey?

    -The script suggests detecting unengaged respondents by looking for those who provide the same response to every question (liners), those who complete the survey in an unrealistically short amount of time, or those who do not correctly answer attention trap questions.

  • What is the significance of checking for skewness and kurtosis in the data?

    -Checking for skewness and kurtosis is important to assess the normality of the data distribution. Extreme values can indicate that the data may not meet the assumptions required for certain statistical tests, potentially affecting the analysis and its interpretation.

  • How should the presence of outliers be reported in a research paper according to the script?

    -The presence of outliers should be reported by noting the specific variables affected, the nature of the outliers (e.g., extremely high or low values), and the actions taken to address them, such as removal or correction of the values.

Outlines

00:00

πŸ“Š Data Screening and Missing Data Analysis

The paragraph discusses the initial steps of data screening, focusing on identifying rows with missing data in a dataset. It explains the process of removing rows with unengaged responses and handling outliers in continuous variables. The speaker demonstrates how to access and organize the data, checking for missing values, and deciding which variables to keep or remove based on their relevance to the study. The importance of keeping a unique identifier for each row is emphasized to maintain data integrity during sorting or other manipulations.

05:01

πŸ”’ Imputing Missing Values and Dealing with Unengaged Responses

This section delves into the process of imputing missing values in the dataset. The speaker explains the difference between ordinal and continuous variables and the appropriate methods for imputing missing data for each type. For ordinal variables, the median is used, while the mean is suitable for continuous variables. The paragraph also addresses the issue of unengaged responses during surveys, suggesting methods to detect and remove such data. The speaker demonstrates how to use SPSS for data manipulation and emphasizes the importance of reporting these steps in research.

10:03

πŸ•΅οΈβ€β™‚οΈ Detecting and Removing Unengaged Responses

The focus of this paragraph is on detecting unengaged responses in survey data. Techniques such as attention traps, time elapsed for survey completion, and reverse-coded items are discussed to identify respondents who may not have paid attention. The speaker also introduces the concept of using standard deviation to find respondents who gave the same response to every question, indicating a lack of engagement. The paragraph concludes with the speaker's decision to remove unengaged responses and the rationale for reporting these actions in research.

15:06

πŸ“‰ Identifying and Handling Outliers in Continuous Variables

The paragraph discusses the identification and handling of outliers in continuous variables such as age and experience. The speaker demonstrates how to use scatter plots to visualize potential outliers and the importance of corroborating extreme values with other data points. The decision-making process for dealing with outliers, including the choice to either correct erroneous responses or remove them, is explored. The speaker also discusses the implications of not having recorded timestamps for additional verification.

20:06

πŸ“‹ Variable Screening, Skewness, and Kurtosis

In this final paragraph, the speaker addresses the last steps of data screening, including checking for skewness and kurtosis in the variables. The process of using descriptive statistics in SPSS to obtain skewness and kurtosis values is demonstrated. The speaker sets a rule for what constitutes problematic values (greater than three) and uses conditional formatting in Excel to highlight these. The paragraph concludes with a discussion on how to report these findings and the potential actions to take if significant skewness or kurtosis is found, such as watching the variable or removing it from the analysis.

Mindmap

Keywords

πŸ’‘Case Screening

Case screening refers to the process of examining individual cases or rows in a dataset to determine their suitability for analysis. In the context of the video, the speaker discusses the importance of case screening to identify rows with missing data or unengaged responses, which may be candidates for removal from the dataset. This process is crucial for ensuring the quality and reliability of the data analysis.

πŸ’‘Missing Data

Missing data is a common issue in data analysis where some values within a dataset are not available or are incomplete. The video script mentions checking for missing data in rows as a part of the data cleaning process. The speaker demonstrates how to identify and handle missing data, such as by imputing values or removing rows with excessive missing data, to maintain the integrity of the dataset.

πŸ’‘Outliers

Outliers are data points that are significantly different from other observations, indicating potential errors or extreme values. In the video, the speaker discusses identifying outliers on continuous variables, such as age and experience, by visually inspecting data or using statistical methods. Outliers can skew the results of an analysis, so the speaker considers removing or correcting them to ensure accurate insights.

πŸ’‘Ordinal Scales

Ordinal scales are a level of measurement where the data can be ranked but the differences between the ranks are not necessarily equal. The script mentions that for ordinal variables, such as Likert scales, the median is often used for imputation of missing values because it is more appropriate than the mean, which assumes equal intervals between scale points.

πŸ’‘Continuous Variables

Continuous variables are variables that can take any value within a range, as opposed to discrete variables which can only take certain values. The video discusses handling missing data in continuous variables like experience by imputing the mean, which is suitable because these variables can have any value within a range and are not restricted to a set number of ordered categories.

πŸ’‘Imputation

Imputation is the process of estimating and filling in missing data points in a dataset. The speaker in the video uses imputation to deal with missing values by replacing them with the median for ordinal scales and the mean for continuous variables. This technique helps to maintain the completeness of the dataset and is a common practice in data preprocessing.

πŸ’‘Unengaged Responses

Unengaged responses are instances where survey respondents do not pay attention to the questions and provide random or identical answers across all items. The video script describes methods to detect unengaged responses, such as identical responses to all questions or extremely short completion times for a survey. The speaker advises removing these responses to maintain data quality.

πŸ’‘Attention Trap

An attention trap is a question or item intentionally designed to identify unengaged respondents in a survey. The video mentions using an attention trap question with a specific instruction to answer in a certain way, such as 'strongly disagree', to catch respondents who are not paying attention. Those who fail the attention trap are identified as unengaged and may have their data removed.

πŸ’‘Skewness and Kurtosis

Skewness and kurtosis are statistical measures that describe the shape of a distribution of data. Skewness measures the asymmetry of the distribution, while kurtosis measures the 'tailedness'. In the video, the speaker checks for skewness and kurtosis to identify any data that might not be normally distributed, which could indicate issues with the data or the need for transformation. The script mentions using these measures as part of the data screening process.

πŸ’‘Descriptives Frequencies

Descriptives frequencies are statistical tools used to summarize data in terms of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation). The video script includes a step where the speaker uses descriptives frequencies in SPSS to identify variables with missing data, which helps in deciding how to handle these missing values during the data cleaning process.

Highlights

Case screening is essential for identifying rows with missing data in a dataset.

Unengaged responses can be detected by analyzing missing data in rows and removing them if necessary.

Outliers in continuous variables should be examined and addressed to maintain data integrity.

Data organization is crucial before analyzing missing data to ensure variables are correctly aligned.

Using Excel to check for missing data by counting blank cells can be an efficient method.

Deleting variables not required for analysis streamlines the dataset and focuses on relevant information.

ID variables are important for tracking individual responses within a dataset.

Imputing missing values is necessary for variables with less than 5% missing data to maintain data completeness.

Median imputation is suitable for ordinal scales, while mean imputation is appropriate for continuous scales.

SPSS can be used to replace missing values with median or mean values for specific variables.

Attention traps in surveys can help identify unengaged respondents by their inconsistent responses.

Recording time elapsed during a survey can be a method to detect unengaged respondents who complete it too quickly.

Reverse-coded items in surveys can reveal unengaged respondents who fail to answer them correctly.

Standard deviation can be used to detect respondents who give the same response to every question.

Outliers on continuous variables can be identified through scatter plots and addressed accordingly.

Skewness and kurtosis should be checked for normality in the data, with values over three indicating potential issues.

Extremely non-normal variables may need to be watched or dropped if they cause problems in further analysis.

Reporting the handling of missing data, unengaged responses, and outliers is important for transparency in research.

Transcripts

play00:00

alright first things first let's go back

play00:03

to here and we're going to do some case

play00:04

screening now cases refer to the rows in

play00:06

your data set let's go get that data set

play00:09

and then we'll look at the missing data

play00:10

in the rows unengaged responses these

play00:14

are potential candidates for dismissal

play00:16

removing their row from the data set and

play00:18

then outliers on our continuous

play00:20

variables so the data again is right

play00:23

here on the home page of stat wiki it's

play00:26

the YouTube SCM series I'm just going to

play00:28

click this so you know I'm using the

play00:29

exact same data as you open it up okay

play00:32

we have the data now this has more

play00:35

variables than you will need but just

play00:37

barely so first things first we need to

play00:40

see if we have missing data on the rows

play00:45

let's see

play00:46

missing nope here missing data in rows

play00:49

now the easiest way for you to do this

play00:51

is actually first to organize the data

play00:54

let me get rid of any variables we don't

play00:56

need do we need playful let's go back to

play00:58

this model here here's the model here's

play01:03

the data and do we need playful the

play01:05

answer is yes do we need comp latent new

play01:09

let me just delete that for now you

play01:13

might want to just drag it down to the

play01:14

bottom of yours I'm going to just delete

play01:17

it because I won't be using it a typical

play01:19

use yes we are using usefulness yes joy

play01:21

yes that's enjoyment the same info AK

play01:25

yes decision quality yes gender yes age

play01:28

yes education no frequency experience

play01:33

yes yes and ID I'm actually going to

play01:36

keep the ID because it allows us to keep

play01:39

keep track of which road belongs to

play01:42

which person so we have all the data we

play01:45

need I'm going to save this as something

play01:50

else in my downloads folder that's fine

play01:52

YouTube SEM series trimmed okay alright

play01:58

now what I'd like to do is first check

play02:00

to see if I have a lot of missing data

play02:02

I'm just going to do ctrl a ctrl C

play02:05

that's a select all and copy and then

play02:09

I'm going to go into Excel let's do a

play02:11

new excel sheet here

play02:13

and paste that in here and I'm just

play02:16

going to go to the very end

play02:17

ooh here it is the very end for me

play02:21

that's column a s and I'm going to

play02:24

include a formula it's equals count

play02:27

blank now what that'll do is it'll tell

play02:30

me how many missing values I have in

play02:32

whatever range I specify so my range is

play02:36

from a 1 to a r1 and in this first row I

play02:41

have no missing data that's awesome

play02:42

just double click that and we'll see we

play02:45

can get here I have 1 1 2 you know let

play02:49

me just do this I'm going to sort by

play02:52

this let's see if it lets me a to Z sort

play02:56

largest to smallest now before you start

play03:00

sorting data outside of SPSS make sure

play03:03

you have the unique identifier in there

play03:07

because if you start sorting and you

play03:10

only have a portion of the data you need

play03:11

to be able to match it back up now in

play03:13

this case I have all of the data so what

play03:15

I'm going to do is if I remove any rows

play03:17

it doesn't matter X I'm just going to

play03:18

delete all of the data in my current

play03:20

data set and replace it all with

play03:22

whatever I end up with here but again if

play03:24

you're only sorting a few of the

play03:26

variables or manipulating a few of the

play03:28

variables out of all of the variables

play03:30

you need to be able to resort it in the

play03:33

correct order so that when you put it

play03:35

back in SPSS the data for each row

play03:38

belongs to the right row hope that made

play03:40

sense

play03:40

okay so largest to smallest we can see

play03:43

whoa I have eggs there are two these two

play03:48

records 288 and 303 that have 42 missing

play03:52

values that's epic Wow

play03:55

um okay looks like they didn't fill out

play03:59

any of the survey except whatever this

play04:02

question was um so I can't use those

play04:06

roads they are completely useless to me

play04:08

so I'm going to go over to SPSS and

play04:11

delete those rows that's 288 and 303 for

play04:16

the identifiers let me go over here find

play04:18

ID 288 and 303 so here's the ID looks

play04:24

like it is in order which is helpful

play04:27

288 here yep missing and you can see

play04:32

this one right here 303 also missing so

play04:34

I just held ctrl while clicking single

play04:37

clicking on the row number and that lets

play04:41

me select both of them and now if I

play04:43

right-click one of those selected ones

play04:45

and hit clear it will delete those and

play04:48

shift the others up so those two are now

play04:49

gone from the data set other than that

play04:52

it looks like I'm doing pretty good just

play04:54

one or two missing it's not more than 5%

play04:56

missing so what I'm going to do is just

play05:01

impute those values so which variables

play05:05

are the ones that have a missing no good

play05:07

what I can do is go back over to SPSS

play05:09

and do analyze descriptives frequencies

play05:14

and just stick all my variables in there

play05:16

except ID I don't need ID oops throw

play05:23

those in there and then make sure I

play05:25

display frequencies table and hit OK and

play05:28

this will show me right here which ones

play05:30

are missing data ah useful three and why

play05:35

don't we do make my life easier I'm

play05:36

going to copy this over go back to excel

play05:39

new tab paste this in here and I know

play05:41

I'm going fast you're welcome to pause

play05:43

stop you can even play these in slow-mo

play05:46

if you click on the COG on the bottom

play05:49

right you can put me at like 0.5 speed

play05:51

or 0.75 speed and I'll still be talking

play05:53

but it will be like this super slow

play05:59

anyway if that is helpful so I'm just

play06:02

going to get rid of the ones that don't

play06:04

have missing just so I can consolidate

play06:08

okay so it looks like we only have a few

play06:11

that have missing values oh I didn't

play06:13

select gender here we go get rid of that

play06:14

one okay so these are all ordinal Likert

play06:20

scales so what we'll do for these will

play06:23

impute the median for each of those for

play06:26

decision quality one that's also ordinal

play06:29

experience as continuous so we'll want

play06:31

to impute the mean now why is that

play06:34

median for ordinal and mean for

play06:37

continuous well in an ordinal scenario

play06:40

there is no such thing as a one point

play06:42

two seven which might be the mean or

play06:44

three point six six on those don't exist

play06:48

on ordinal scale and ordinal is a one

play06:50

two three four etc whereas experience

play06:53

could be any range of numbers

play06:55

this is experienced in years I believe

play06:58

so it could be anything from zero to

play07:01

however old Excel is twenty-five or

play07:03

something like that so a mean is more

play07:05

appropriate here although median would

play07:07

work just as well okay so let's go do

play07:10

that over in SPSS here we go we're going

play07:15

to go to transform and replace missing

play07:20

values and let's put the ones in there

play07:25

that had some missing it was useful two

play07:26

three and five so go down here useful

play07:29

two three five throw those in and joy

play07:35

six

play07:38

it's choice six and decision quality one

play07:45

there's decision quality one and that

play07:48

the last one was experience throw that

play07:52

in there okay well with these guys I

play07:54

wonder if I can Slyke multiple at a time

play07:56

I can Ray with all of them except

play07:58

experience and expand that um I want to

play08:02

use Oh poop he won't let me change them

play08:04

one uh more than one at a time I want to

play08:07

just replace the existing variable so

play08:10

I'm going to get rid of this underscore

play08:12

one change and I want it to be a median

play08:15

of nearby points and how many points all

play08:19

points

play08:20

and then a change then this one get rid

play08:24

of the same

play08:27

but for experience I want it to be the

play08:30

mean I'll still get rid of this

play08:32

underscore one so it just replaces the

play08:33

existing variable okay and then I hit

play08:36

okay and it says change existing

play08:38

variables it says that because I'm

play08:40

naming them the exact same thing as I

play08:42

have variables named already the answer

play08:44

is yes I want to replace them

play08:45

okay and at this point if you want to

play08:48

play it say if you might save as not

play08:50

this screen you might save your dataset

play08:53

as something else like imputed or no

play08:57

missing or something like that now why

play08:59

do we want no missing values well in

play09:02

Amos later on you can't do a few things

play09:06

if you have missing values you can't

play09:09

estimate modification indices you can't

play09:12

run a bootstrap I believe there are a

play09:16

few different issues that come up when

play09:18

you have missing values so it's best to

play09:20

impute them now what do you do if you

play09:21

have more than five or ten maybe percent

play09:23

of missing values on a single column

play09:26

well that gets tricky because you start

play09:29

um what's the word diluting like when

play09:34

you dilute your your drink with water

play09:38

you start diluting the potency of the

play09:41

variable because you start bringing

play09:43

everything towards the mean and so all

play09:46

of the effects the regression of the

play09:48

correlations will be diluted or dampened

play09:51

and so that's not ideal so moving on

play09:54

what do we report well what we could say

play09:58

is may bring this down again what we

play10:01

could say in our paper is something like

play10:02

we had one two three four five six seven

play10:08

variables with missing values all less

play10:12

than 5% missing which we replaced with

play10:15

the median for ordinal scales and the

play10:18

mean for continuous scales period you

play10:22

also might want to say that we deleted

play10:24

two rows due to having fully incomplete

play10:29

responses so more than 20% of the

play10:32

responses we're missing and so we

play10:33

removed those rows that's what you'd say

play10:36

in your report nothing more than that

play10:38

you don't need to say exactly how many

play10:39

were

play10:40

each variable in which variables they

play10:41

were just tell it like it is like that

play10:44

okay now is simple what's next

play10:50

unengaged responses so this is tricky

play10:53

and not everybody looks into this and

play10:56

unengaged response is when somebody is

play10:59

taking your survey but not really paying

play11:01

attention so how do we detect this well

play11:06

let's go back let me copy all this and

play11:08

ctrl C and let's create a new sheet

play11:13

trophy what we can do is visually

play11:20

inspect all of this data which is kind

play11:23

of tricky right especially if you have

play11:25

300 respondents or more I mean that is a

play11:28

lot of visual inspection so the easiest

play11:31

way to do this is to actually throw in

play11:35

an attention trap so an attention trap

play11:38

is something like um if you are still

play11:42

paying attention please answer strongly

play11:44

disagree or if you are still pegan

play11:46

tition please answer somewhat agree and

play11:49

then anyone who does not answer that

play11:50

question correctly obviously wasn't

play11:53

paying attention and so you can remove

play11:55

you are justified to remove their data

play11:58

another way you can do this is by

play12:01

recording the time elapsed for taking

play12:03

the survey and then sort by that time in

play12:06

the end and those who took less than

play12:09

whatever's reasonable let's say you had

play12:11

a 60 item survey and you see many people

play12:14

finish the survey in under 60 seconds

play12:17

that is mentally physically impossible

play12:20

right

play12:21

um and so clearly they were not engaged

play12:24

if they spent less than minute filling

play12:26

out your 60 item questionnaire another

play12:28

way to do this is to use reverse coded

play12:31

items so let me go back to the survey

play12:33

here and go right here and you can see

play12:37

here in the labels we have the wording

play12:39

of each of the questions a reverse code

play12:41

item is a negatively code coded item so

play12:44

for playful you can see I am playful

play12:46

when I interact with Excel what if I had

play12:48

another question that said I am rigid or

play12:52

not playful or

play12:53

or something like that when I play it

play12:56

when I interact with Excel well you

play12:58

would expect them to answer in the

play13:00

opposite direction so if they answer

play13:01

strongly agree to this one then if you

play13:04

have a negatively coded item reverse

play13:06

coded item they should answer like

play13:10

strongly disagree for that item or

play13:13

something like that in the negative end

play13:14

of the scale and if they don't well then

play13:16

they weren't paying attention you can

play13:17

move their row the other kind which is

play13:20

what we call it be liner is someone who

play13:23

answers the exact same value for every

play13:26

question so there's one way to detect

play13:29

this and that is to do an equals

play13:33

standard deviation stdp

play13:37

there we go standard deviation on the

play13:40

row again that was equals stdev

play13:45

there it is and see if there is much of

play13:50

a standard deviation

play13:52

now I actually don't need it for this

play13:54

whole row um once we start looking at

play13:58

gender and age I'm no longer interested

play14:01

so let me change this range here just to

play14:07

there okay and we'll see a standard

play14:10

deviation and what this is going to tell

play14:12

us is that there is zero variance on a

play14:16

row meaning the answer the exact same

play14:17

value for every question and oh we do

play14:22

have one actually do we have any others

play14:24

everything else looks reasonable

play14:26

reasonable what is reasonable and

play14:28

depends on what kind of scales you're

play14:30

using I use five-point Likert scales so

play14:32

maybe like a point five standard

play14:33

deviation

play14:34

there's no literature support for that

play14:35

I'm just saying this is a an indicator

play14:38

sort of like a red flag go then look

play14:40

well here's another one

play14:42

okay um looks like we have two okay let

play14:46

me scroll back up we have two cases

play14:49

where in fact I should just sort this by

play14:52

this column small so largest two came

play14:58

out at the very bottom of zero there

play15:00

your largest smallest whoops okay these

play15:02

two rows which are cases nineteen and

play15:05

295 they had zero variance on the

play15:08

questions look this guy answered for all

play15:11

the way across the sky answered three

play15:12

all the way across clearly not engaged

play15:15

if you answer the exact same thing for

play15:18

every single question now if I had

play15:19

recorded times I could also see how long

play15:22

it took them and so anyway there's a

play15:24

good way to detect them what I would do

play15:27

at this point now that I've visually

play15:29

inspected them after looking that

play15:31

standard deviation I'm going to go to

play15:34

IDs 19 and 295 and just remove those two

play15:38

rows and what would I report I report

play15:40

that I found two respondents who were

play15:42

unengaged as evidenced by giving the

play15:45

exact same response for every single

play15:47

item and that is fully justifiable 19

play15:51

and 295 okay go back to here ID and 19

play15:59

there is I'm going to right-click and

play16:03

clear and then go find number 295

play16:07

here we go ID 295 this guy I know the

play16:14

rule number says 293 but the IDS 295

play16:16

don't get those confused there's 295 I'm

play16:19

going to delete this one clear okay we

play16:24

have now looked at unengaged responses

play16:26

excellent

play16:27

last thing outliers on continuous

play16:29

variables what continuous variables am I

play16:31

actually using well we have age and

play16:34

experience here's age you can see it's

play16:36

fairly continuous and experience I

play16:38

believe this went in years so it's up to

play16:42

as many years as excels been in

play16:44

existence so let's take these two and

play16:47

see if there are any outliers so how do

play16:50

we do this well one simple way is to go

play16:53

to graphs go to chart builder and select

play16:58

a scatter dot just a simple one drag

play17:00

that out and on the y-axis throw in your

play17:03

variable I'm use experience in this case

play17:05

and on the x-axis we can throw ID in

play17:08

there and then hit OK and looks like

play17:13

we're pretty good except whoa this guy

play17:16

out here so we want to know who this is

play17:19

how could we figure that out they say

play17:21

they have

play17:22

I have years of Excel experience now

play17:25

that is possible Excel has existed for

play17:28

quite a while let me go ahead and sort

play17:30

this row right click it and sort

play17:32

descending

play17:33

here's that person of 25 years of

play17:35

experience how can we corroborate

play17:36

whether this is possible well we have

play17:39

their age look this person is 21 years

play17:42

old they've been using this for years

play17:44

prior to being born probably not

play17:47

possible so and these two people have

play17:51

used since they were 6 or 7 hmm

play17:53

definitely possible so we'll leave those

play17:55

in but this one impossible so what do we

play17:59

think happened here is this an erroneous

play18:01

response is this a useless case you know

play18:06

looking at this they responded for for

play18:08

everything really even paying attention

play18:10

their 21-6 useful for 2 2 2 2 2 2 2 2 1

play18:15

1 1 1 1 2 4 4 4 4 having it's hard to

play18:20

say whether this was an engaged person

play18:22

or not 25 was the top of the scale I

play18:26

wonder if they just clicked on the

play18:27

slider and pulled it too right oh this

play18:30

is a tough choice what I'm going to do

play18:34

and this is your call I would probably

play18:37

delete this guy

play18:38

why because selected to for all the play

play18:43

holes for for all the joys those two

play18:46

should move together right um and then

play18:51

ones for all atypicals

play18:53

that might be valid or oh this is our

play18:56

decision if you want to play it safe

play18:59

then what you can do is just correct

play19:03

this erroneous response what is the

play19:05

median for experience let's go find out

play19:08

I have experienced just over here let's

play19:11

see here's the experience column let me

play19:14

just get the median or the mean if I

play19:16

just highlight that whole column you can

play19:18

see down in Excel says the average is

play19:20

four point four one so let me go ahead

play19:23

and stick that in for point four one

play19:24

nine so four point four two as well

play19:27

stick in there four point four two I'm

play19:30

going to give this person the average

play19:32

not knowing what the real value might

play19:34

have been

play19:35

also fairly suspicious that this is just

play19:37

a bad respondent I wish I had recorded

play19:40

timestamps oh well okay that was the

play19:43

only one on experience let's go do the

play19:45

same thing for age in fact if we want we

play19:48

can just take age here right click it

play19:50

sort descending we have a 35 year old a

play19:53

34 year old 31 year old nothing really

play19:55

outlying on top let's sort it ascending

play19:58

we have two people who didn't report age

play20:00

now are we even using age in our

play20:03

analyses let's go back to here we are

play20:06

controlling for age so we have two here

play20:08

that haven't even been imputed yet what

play20:11

respondent didn't respond let's go ahead

play20:13

and stick mean in there as well back to

play20:17

excel let's find out what that average

play20:20

was here's the age column the average

play20:22

was twenty one point six five I'll go

play20:24

ahead and stick that in here twenty one

play20:26

point six five twenty one point six five

play20:29

okay we have imputed those two values

play20:31

with the mean and that is outliers on

play20:37

continuous variables okay next is

play20:40

variable screening we already did this

play20:42

we looked at missing data and columns

play20:45

the last thing is skewness and kurtosis

play20:46

and then I'll start a new video

play20:48

okay skewness and kurtosis back to here

play20:51

go to analyze descriptives frequencies

play20:54

and we have it all in there already

play20:57

except these guys throw those back in

play20:59

those the ones we imputed and go to

play21:02

statistics and what we want we want

play21:04

skewness and kurtosis continue continue

play21:08

and here we go

play21:10

if you go up to the very top it shows it

play21:14

right here boom okay right click this

play21:17

copy this go back over to excel and you

play21:20

tab throw this in there and what we're

play21:23

looking for

play21:24

depends on who you quote who you cite

play21:26

but a fairly liberal rule which I'll use

play21:29

because I find it works just fine

play21:31

is that values over three are

play21:35

problematic if you want to be a little

play21:38

more rigid there's a two point two and a

play21:40

little more rigid there's a one absolute

play21:44

value of one and even more rigid it's

play21:47

something about dividing

play21:49

the value by the standard error and

play21:52

making sure it's not more than three

play21:53

times that so let's use the easy one

play21:56

let's say three so if any of these are

play21:58

greater than three we have a problem if

play22:00

you highlight both of those and go to

play22:02

conditional formatting highlight cell

play22:05

rules greater than three then make it

play22:10

red and we're going to do one more time

play22:12

conditional formatting highlight cell

play22:13

rules less than negative three then

play22:17

highlight it okay

play22:18

I highlight these because they're not

play22:20

numbers but let's go look over looks

play22:22

like we're doing pretty good pretty good

play22:27

actually age is the only one Wow I'm

play22:30

surprised okay

play22:31

so age has a kurtosis issue and you may

play22:34

ask why would that be well we were

play22:37

sampling undergraduate students at a

play22:40

college they're all like 19 to 21 years

play22:44

old so of course you're gonna have a

play22:46

kurtosis issue on everything else fairly

play22:49

normal so we're pretty good there we

play22:51

wouldn't need to make any changes now

play22:53

what would we do if we did have non

play22:56

normal ordinal measures you can't really

play23:02

transform them they're on a short order

play23:04

five-point scale so transformation

play23:07

really won't do anything um so what do

play23:10

you do you can if they are extremely non

play23:14

normal you know greater than a three

play23:16

watch them and see if they have other

play23:20

problems or just drop them if you have

play23:22

that capability let's say for example

play23:24

decision quality five was highly non

play23:26

normal let's say this was a eight right

play23:29

here right big bad eight then we'd say

play23:33

well we can afford to drop decision

play23:35

quality five why because we had decision

play23:38

quality two three four six seven and

play23:40

eight we have tons of these things and

play23:42

it's reflective so dropping one item

play23:44

really shouldn't change the nature of

play23:45

the construct so I would just get rid of

play23:48

it oh now let's say you had only one or

play23:53

two variables let's say we had let's go

play23:56

here let's get rid of these guys let's

play23:58

say decision quality only had three

play24:00

items and it turns out decision quality

play24:02

three is highly skewed to put a six in

play24:07

there highly skewed what do we do I only

play24:10

have two other items Yi this is tricky

play24:13

what I would do is I would leave it in

play24:15

but watch it and if it gives you

play24:17

problems in the future

play24:18

we may have to cut it out if it starts

play24:22

giving us problems in the factor

play24:23

analysis either exploratory or

play24:26

confirmatory factor analysis and that's

play24:29

it what do you do to report it you'd say

play24:31

well we have one item that was skewed

play24:35

and we decided to watch it because we

play24:38

only had two other items on that

play24:39

construct or you could say we had one

play24:42

item that was highly skewed you know

play24:44

value six

play24:45

we set it delete that item from the set

play24:48

you could say we also had a highly crew

play24:50

tote age variable but this is expected

play24:53

because of the sample population its

play24:56

students so obviously the age is going

play24:58

to be highly Curt oat period there we go

play25:02

okay I think that does it for data

play25:05

screening woohoo