SEM Series (2016) 2. Data Screening
Summary
TLDRThis video script is a comprehensive guide to data screening in statistical analysis. It covers essential steps like identifying missing data, handling unengaged responses, managing outliers, and checking for skewness and kurtosis. The tutorial uses a practical approach with examples from a dataset, demonstrating how to clean and prepare data for analysis in SPSS. It also discusses the implications of each step and provides tips for reporting the findings in research.
Takeaways
- 🔍 The video script discusses a systematic approach to data screening, focusing on handling missing data, identifying unengaged responses, and managing outliers in continuous variables.
- 🗂️ The process starts with organizing the dataset by removing unnecessary variables and keeping track of IDs to ensure data integrity.
- 📊 To detect missing data, the script suggests using Excel to count blank cells and sort the data to identify rows with a high number of missing values, which may be candidates for removal.
- 🚫 The script identifies unengaged responses by looking for patterns such as identical answers across all questions or extremely short survey completion times, indicating a lack of attention from respondents.
- 📈 For continuous variables, the script recommends checking for outliers using scatter plots and replacing extreme values with the mean or median to maintain data integrity.
- 📉 The video also covers the importance of dealing with skewness and kurtosis, suggesting the use of conditional formatting in Excel to highlight values that exceed certain thresholds, indicating potential issues with data distribution.
- 📝 It's advised to report the data screening process in a research paper, including details about missing data imputation, removal of unengaged responses, and handling of outliers.
- ❌ The script emphasizes the need to be cautious when dealing with a high percentage of missing data in a variable, as it can dilute the potency of the variable and affect analysis outcomes.
- 🔢 The process of imputing missing values differs for ordinal and continuous scales, with medians used for ordinal scales and means for continuous scales, to maintain the integrity of the data.
- ⏱️ The video mentions the use of attention traps and reverse-coded items as strategies to identify unengaged respondents, which can help in cleaning the dataset.
Q & A
What is the first step in data screening as described in the script?
-The first step in data screening is to check for missing data in the rows of the dataset.
How does the script suggest handling unengaged responses in a survey?
-Unengaged responses can be detected by visually inspecting data, using attention traps, recording time elapsed for survey completion, employing reverse-coded items, or identifying respondents who give the same response to every question.
What is an 'attention trap' in the context of survey data?
-An 'attention trap' is a question in a survey designed to identify respondents who are not paying attention, such as asking them to select a specific, counterintuitive answer to a straightforward question.
How can outliers on continuous variables be identified and addressed according to the script?
-Outliers on continuous variables can be identified using scatter plots or by calculating the standard deviation of the values. Addressing them may involve removing the outlier if it's due to an erroneous response or imputing a value like the mean or median for the variable.
What is the purpose of using reverse-coded items in a survey?
-Reverse-coded items are used to detect unengaged respondents. They are negatively worded questions where the expected answer should be the opposite of what is typically agreed upon, helping to identify respondents who may not be paying attention.
Why is it important to check for missing data before conducting further analysis?
-Checking for missing data is important because it can affect the validity and reliability of the analysis. Missing data can lead to biased results or require imputation, which should be reported in the study.
What is the recommended approach for handling missing data in ordinal scales versus continuous scales?
-For ordinal scales, the median is recommended for imputation, while for continuous scales, the mean is more appropriate. This distinction is made because ordinal scales do not have actual mean values between the scale points.
How does the script suggest detecting respondents who may not be engaged in the survey?
-The script suggests detecting unengaged respondents by looking for those who provide the same response to every question (liners), those who complete the survey in an unrealistically short amount of time, or those who do not correctly answer attention trap questions.
What is the significance of checking for skewness and kurtosis in the data?
-Checking for skewness and kurtosis is important to assess the normality of the data distribution. Extreme values can indicate that the data may not meet the assumptions required for certain statistical tests, potentially affecting the analysis and its interpretation.
How should the presence of outliers be reported in a research paper according to the script?
-The presence of outliers should be reported by noting the specific variables affected, the nature of the outliers (e.g., extremely high or low values), and the actions taken to address them, such as removal or correction of the values.
Outlines
📊 Data Screening and Missing Data Analysis
The paragraph discusses the initial steps of data screening, focusing on identifying rows with missing data in a dataset. It explains the process of removing rows with unengaged responses and handling outliers in continuous variables. The speaker demonstrates how to access and organize the data, checking for missing values, and deciding which variables to keep or remove based on their relevance to the study. The importance of keeping a unique identifier for each row is emphasized to maintain data integrity during sorting or other manipulations.
🔢 Imputing Missing Values and Dealing with Unengaged Responses
This section delves into the process of imputing missing values in the dataset. The speaker explains the difference between ordinal and continuous variables and the appropriate methods for imputing missing data for each type. For ordinal variables, the median is used, while the mean is suitable for continuous variables. The paragraph also addresses the issue of unengaged responses during surveys, suggesting methods to detect and remove such data. The speaker demonstrates how to use SPSS for data manipulation and emphasizes the importance of reporting these steps in research.
🕵️♂️ Detecting and Removing Unengaged Responses
The focus of this paragraph is on detecting unengaged responses in survey data. Techniques such as attention traps, time elapsed for survey completion, and reverse-coded items are discussed to identify respondents who may not have paid attention. The speaker also introduces the concept of using standard deviation to find respondents who gave the same response to every question, indicating a lack of engagement. The paragraph concludes with the speaker's decision to remove unengaged responses and the rationale for reporting these actions in research.
📉 Identifying and Handling Outliers in Continuous Variables
The paragraph discusses the identification and handling of outliers in continuous variables such as age and experience. The speaker demonstrates how to use scatter plots to visualize potential outliers and the importance of corroborating extreme values with other data points. The decision-making process for dealing with outliers, including the choice to either correct erroneous responses or remove them, is explored. The speaker also discusses the implications of not having recorded timestamps for additional verification.
📋 Variable Screening, Skewness, and Kurtosis
In this final paragraph, the speaker addresses the last steps of data screening, including checking for skewness and kurtosis in the variables. The process of using descriptive statistics in SPSS to obtain skewness and kurtosis values is demonstrated. The speaker sets a rule for what constitutes problematic values (greater than three) and uses conditional formatting in Excel to highlight these. The paragraph concludes with a discussion on how to report these findings and the potential actions to take if significant skewness or kurtosis is found, such as watching the variable or removing it from the analysis.
Mindmap
Keywords
💡Case Screening
💡Missing Data
💡Outliers
💡Ordinal Scales
💡Continuous Variables
💡Imputation
💡Unengaged Responses
💡Attention Trap
💡Skewness and Kurtosis
💡Descriptives Frequencies
Highlights
Case screening is essential for identifying rows with missing data in a dataset.
Unengaged responses can be detected by analyzing missing data in rows and removing them if necessary.
Outliers in continuous variables should be examined and addressed to maintain data integrity.
Data organization is crucial before analyzing missing data to ensure variables are correctly aligned.
Using Excel to check for missing data by counting blank cells can be an efficient method.
Deleting variables not required for analysis streamlines the dataset and focuses on relevant information.
ID variables are important for tracking individual responses within a dataset.
Imputing missing values is necessary for variables with less than 5% missing data to maintain data completeness.
Median imputation is suitable for ordinal scales, while mean imputation is appropriate for continuous scales.
SPSS can be used to replace missing values with median or mean values for specific variables.
Attention traps in surveys can help identify unengaged respondents by their inconsistent responses.
Recording time elapsed during a survey can be a method to detect unengaged respondents who complete it too quickly.
Reverse-coded items in surveys can reveal unengaged respondents who fail to answer them correctly.
Standard deviation can be used to detect respondents who give the same response to every question.
Outliers on continuous variables can be identified through scatter plots and addressed accordingly.
Skewness and kurtosis should be checked for normality in the data, with values over three indicating potential issues.
Extremely non-normal variables may need to be watched or dropped if they cause problems in further analysis.
Reporting the handling of missing data, unengaged responses, and outliers is important for transparency in research.
Transcripts
alright first things first let's go back
to here and we're going to do some case
screening now cases refer to the rows in
your data set let's go get that data set
and then we'll look at the missing data
in the rows unengaged responses these
are potential candidates for dismissal
removing their row from the data set and
then outliers on our continuous
variables so the data again is right
here on the home page of stat wiki it's
the YouTube SCM series I'm just going to
click this so you know I'm using the
exact same data as you open it up okay
we have the data now this has more
variables than you will need but just
barely so first things first we need to
see if we have missing data on the rows
let's see
missing nope here missing data in rows
now the easiest way for you to do this
is actually first to organize the data
let me get rid of any variables we don't
need do we need playful let's go back to
this model here here's the model here's
the data and do we need playful the
answer is yes do we need comp latent new
let me just delete that for now you
might want to just drag it down to the
bottom of yours I'm going to just delete
it because I won't be using it a typical
use yes we are using usefulness yes joy
yes that's enjoyment the same info AK
yes decision quality yes gender yes age
yes education no frequency experience
yes yes and ID I'm actually going to
keep the ID because it allows us to keep
keep track of which road belongs to
which person so we have all the data we
need I'm going to save this as something
else in my downloads folder that's fine
YouTube SEM series trimmed okay alright
now what I'd like to do is first check
to see if I have a lot of missing data
I'm just going to do ctrl a ctrl C
that's a select all and copy and then
I'm going to go into Excel let's do a
new excel sheet here
and paste that in here and I'm just
going to go to the very end
ooh here it is the very end for me
that's column a s and I'm going to
include a formula it's equals count
blank now what that'll do is it'll tell
me how many missing values I have in
whatever range I specify so my range is
from a 1 to a r1 and in this first row I
have no missing data that's awesome
just double click that and we'll see we
can get here I have 1 1 2 you know let
me just do this I'm going to sort by
this let's see if it lets me a to Z sort
largest to smallest now before you start
sorting data outside of SPSS make sure
you have the unique identifier in there
because if you start sorting and you
only have a portion of the data you need
to be able to match it back up now in
this case I have all of the data so what
I'm going to do is if I remove any rows
it doesn't matter X I'm just going to
delete all of the data in my current
data set and replace it all with
whatever I end up with here but again if
you're only sorting a few of the
variables or manipulating a few of the
variables out of all of the variables
you need to be able to resort it in the
correct order so that when you put it
back in SPSS the data for each row
belongs to the right row hope that made
sense
okay so largest to smallest we can see
whoa I have eggs there are two these two
records 288 and 303 that have 42 missing
values that's epic Wow
um okay looks like they didn't fill out
any of the survey except whatever this
question was um so I can't use those
roads they are completely useless to me
so I'm going to go over to SPSS and
delete those rows that's 288 and 303 for
the identifiers let me go over here find
ID 288 and 303 so here's the ID looks
like it is in order which is helpful
288 here yep missing and you can see
this one right here 303 also missing so
I just held ctrl while clicking single
clicking on the row number and that lets
me select both of them and now if I
right-click one of those selected ones
and hit clear it will delete those and
shift the others up so those two are now
gone from the data set other than that
it looks like I'm doing pretty good just
one or two missing it's not more than 5%
missing so what I'm going to do is just
impute those values so which variables
are the ones that have a missing no good
what I can do is go back over to SPSS
and do analyze descriptives frequencies
and just stick all my variables in there
except ID I don't need ID oops throw
those in there and then make sure I
display frequencies table and hit OK and
this will show me right here which ones
are missing data ah useful three and why
don't we do make my life easier I'm
going to copy this over go back to excel
new tab paste this in here and I know
I'm going fast you're welcome to pause
stop you can even play these in slow-mo
if you click on the COG on the bottom
right you can put me at like 0.5 speed
or 0.75 speed and I'll still be talking
but it will be like this super slow
anyway if that is helpful so I'm just
going to get rid of the ones that don't
have missing just so I can consolidate
okay so it looks like we only have a few
that have missing values oh I didn't
select gender here we go get rid of that
one okay so these are all ordinal Likert
scales so what we'll do for these will
impute the median for each of those for
decision quality one that's also ordinal
experience as continuous so we'll want
to impute the mean now why is that
median for ordinal and mean for
continuous well in an ordinal scenario
there is no such thing as a one point
two seven which might be the mean or
three point six six on those don't exist
on ordinal scale and ordinal is a one
two three four etc whereas experience
could be any range of numbers
this is experienced in years I believe
so it could be anything from zero to
however old Excel is twenty-five or
something like that so a mean is more
appropriate here although median would
work just as well okay so let's go do
that over in SPSS here we go we're going
to go to transform and replace missing
values and let's put the ones in there
that had some missing it was useful two
three and five so go down here useful
two three five throw those in and joy
six
it's choice six and decision quality one
there's decision quality one and that
the last one was experience throw that
in there okay well with these guys I
wonder if I can Slyke multiple at a time
I can Ray with all of them except
experience and expand that um I want to
use Oh poop he won't let me change them
one uh more than one at a time I want to
just replace the existing variable so
I'm going to get rid of this underscore
one change and I want it to be a median
of nearby points and how many points all
points
and then a change then this one get rid
of the same
but for experience I want it to be the
mean I'll still get rid of this
underscore one so it just replaces the
existing variable okay and then I hit
okay and it says change existing
variables it says that because I'm
naming them the exact same thing as I
have variables named already the answer
is yes I want to replace them
okay and at this point if you want to
play it say if you might save as not
this screen you might save your dataset
as something else like imputed or no
missing or something like that now why
do we want no missing values well in
Amos later on you can't do a few things
if you have missing values you can't
estimate modification indices you can't
run a bootstrap I believe there are a
few different issues that come up when
you have missing values so it's best to
impute them now what do you do if you
have more than five or ten maybe percent
of missing values on a single column
well that gets tricky because you start
um what's the word diluting like when
you dilute your your drink with water
you start diluting the potency of the
variable because you start bringing
everything towards the mean and so all
of the effects the regression of the
correlations will be diluted or dampened
and so that's not ideal so moving on
what do we report well what we could say
is may bring this down again what we
could say in our paper is something like
we had one two three four five six seven
variables with missing values all less
than 5% missing which we replaced with
the median for ordinal scales and the
mean for continuous scales period you
also might want to say that we deleted
two rows due to having fully incomplete
responses so more than 20% of the
responses we're missing and so we
removed those rows that's what you'd say
in your report nothing more than that
you don't need to say exactly how many
were
each variable in which variables they
were just tell it like it is like that
okay now is simple what's next
unengaged responses so this is tricky
and not everybody looks into this and
unengaged response is when somebody is
taking your survey but not really paying
attention so how do we detect this well
let's go back let me copy all this and
ctrl C and let's create a new sheet
trophy what we can do is visually
inspect all of this data which is kind
of tricky right especially if you have
300 respondents or more I mean that is a
lot of visual inspection so the easiest
way to do this is to actually throw in
an attention trap so an attention trap
is something like um if you are still
paying attention please answer strongly
disagree or if you are still pegan
tition please answer somewhat agree and
then anyone who does not answer that
question correctly obviously wasn't
paying attention and so you can remove
you are justified to remove their data
another way you can do this is by
recording the time elapsed for taking
the survey and then sort by that time in
the end and those who took less than
whatever's reasonable let's say you had
a 60 item survey and you see many people
finish the survey in under 60 seconds
that is mentally physically impossible
right
um and so clearly they were not engaged
if they spent less than minute filling
out your 60 item questionnaire another
way to do this is to use reverse coded
items so let me go back to the survey
here and go right here and you can see
here in the labels we have the wording
of each of the questions a reverse code
item is a negatively code coded item so
for playful you can see I am playful
when I interact with Excel what if I had
another question that said I am rigid or
not playful or
or something like that when I play it
when I interact with Excel well you
would expect them to answer in the
opposite direction so if they answer
strongly agree to this one then if you
have a negatively coded item reverse
coded item they should answer like
strongly disagree for that item or
something like that in the negative end
of the scale and if they don't well then
they weren't paying attention you can
move their row the other kind which is
what we call it be liner is someone who
answers the exact same value for every
question so there's one way to detect
this and that is to do an equals
standard deviation stdp
there we go standard deviation on the
row again that was equals stdev
there it is and see if there is much of
a standard deviation
now I actually don't need it for this
whole row um once we start looking at
gender and age I'm no longer interested
so let me change this range here just to
there okay and we'll see a standard
deviation and what this is going to tell
us is that there is zero variance on a
row meaning the answer the exact same
value for every question and oh we do
have one actually do we have any others
everything else looks reasonable
reasonable what is reasonable and
depends on what kind of scales you're
using I use five-point Likert scales so
maybe like a point five standard
deviation
there's no literature support for that
I'm just saying this is a an indicator
sort of like a red flag go then look
well here's another one
okay um looks like we have two okay let
me scroll back up we have two cases
where in fact I should just sort this by
this column small so largest two came
out at the very bottom of zero there
your largest smallest whoops okay these
two rows which are cases nineteen and
295 they had zero variance on the
questions look this guy answered for all
the way across the sky answered three
all the way across clearly not engaged
if you answer the exact same thing for
every single question now if I had
recorded times I could also see how long
it took them and so anyway there's a
good way to detect them what I would do
at this point now that I've visually
inspected them after looking that
standard deviation I'm going to go to
IDs 19 and 295 and just remove those two
rows and what would I report I report
that I found two respondents who were
unengaged as evidenced by giving the
exact same response for every single
item and that is fully justifiable 19
and 295 okay go back to here ID and 19
there is I'm going to right-click and
clear and then go find number 295
here we go ID 295 this guy I know the
rule number says 293 but the IDS 295
don't get those confused there's 295 I'm
going to delete this one clear okay we
have now looked at unengaged responses
excellent
last thing outliers on continuous
variables what continuous variables am I
actually using well we have age and
experience here's age you can see it's
fairly continuous and experience I
believe this went in years so it's up to
as many years as excels been in
existence so let's take these two and
see if there are any outliers so how do
we do this well one simple way is to go
to graphs go to chart builder and select
a scatter dot just a simple one drag
that out and on the y-axis throw in your
variable I'm use experience in this case
and on the x-axis we can throw ID in
there and then hit OK and looks like
we're pretty good except whoa this guy
out here so we want to know who this is
how could we figure that out they say
they have
I have years of Excel experience now
that is possible Excel has existed for
quite a while let me go ahead and sort
this row right click it and sort
descending
here's that person of 25 years of
experience how can we corroborate
whether this is possible well we have
their age look this person is 21 years
old they've been using this for years
prior to being born probably not
possible so and these two people have
used since they were 6 or 7 hmm
definitely possible so we'll leave those
in but this one impossible so what do we
think happened here is this an erroneous
response is this a useless case you know
looking at this they responded for for
everything really even paying attention
their 21-6 useful for 2 2 2 2 2 2 2 2 1
1 1 1 1 2 4 4 4 4 having it's hard to
say whether this was an engaged person
or not 25 was the top of the scale I
wonder if they just clicked on the
slider and pulled it too right oh this
is a tough choice what I'm going to do
and this is your call I would probably
delete this guy
why because selected to for all the play
holes for for all the joys those two
should move together right um and then
ones for all atypicals
that might be valid or oh this is our
decision if you want to play it safe
then what you can do is just correct
this erroneous response what is the
median for experience let's go find out
I have experienced just over here let's
see here's the experience column let me
just get the median or the mean if I
just highlight that whole column you can
see down in Excel says the average is
four point four one so let me go ahead
and stick that in for point four one
nine so four point four two as well
stick in there four point four two I'm
going to give this person the average
not knowing what the real value might
have been
also fairly suspicious that this is just
a bad respondent I wish I had recorded
timestamps oh well okay that was the
only one on experience let's go do the
same thing for age in fact if we want we
can just take age here right click it
sort descending we have a 35 year old a
34 year old 31 year old nothing really
outlying on top let's sort it ascending
we have two people who didn't report age
now are we even using age in our
analyses let's go back to here we are
controlling for age so we have two here
that haven't even been imputed yet what
respondent didn't respond let's go ahead
and stick mean in there as well back to
excel let's find out what that average
was here's the age column the average
was twenty one point six five I'll go
ahead and stick that in here twenty one
point six five twenty one point six five
okay we have imputed those two values
with the mean and that is outliers on
continuous variables okay next is
variable screening we already did this
we looked at missing data and columns
the last thing is skewness and kurtosis
and then I'll start a new video
okay skewness and kurtosis back to here
go to analyze descriptives frequencies
and we have it all in there already
except these guys throw those back in
those the ones we imputed and go to
statistics and what we want we want
skewness and kurtosis continue continue
and here we go
if you go up to the very top it shows it
right here boom okay right click this
copy this go back over to excel and you
tab throw this in there and what we're
looking for
depends on who you quote who you cite
but a fairly liberal rule which I'll use
because I find it works just fine
is that values over three are
problematic if you want to be a little
more rigid there's a two point two and a
little more rigid there's a one absolute
value of one and even more rigid it's
something about dividing
the value by the standard error and
making sure it's not more than three
times that so let's use the easy one
let's say three so if any of these are
greater than three we have a problem if
you highlight both of those and go to
conditional formatting highlight cell
rules greater than three then make it
red and we're going to do one more time
conditional formatting highlight cell
rules less than negative three then
highlight it okay
I highlight these because they're not
numbers but let's go look over looks
like we're doing pretty good pretty good
actually age is the only one Wow I'm
surprised okay
so age has a kurtosis issue and you may
ask why would that be well we were
sampling undergraduate students at a
college they're all like 19 to 21 years
old so of course you're gonna have a
kurtosis issue on everything else fairly
normal so we're pretty good there we
wouldn't need to make any changes now
what would we do if we did have non
normal ordinal measures you can't really
transform them they're on a short order
five-point scale so transformation
really won't do anything um so what do
you do you can if they are extremely non
normal you know greater than a three
watch them and see if they have other
problems or just drop them if you have
that capability let's say for example
decision quality five was highly non
normal let's say this was a eight right
here right big bad eight then we'd say
well we can afford to drop decision
quality five why because we had decision
quality two three four six seven and
eight we have tons of these things and
it's reflective so dropping one item
really shouldn't change the nature of
the construct so I would just get rid of
it oh now let's say you had only one or
two variables let's say we had let's go
here let's get rid of these guys let's
say decision quality only had three
items and it turns out decision quality
three is highly skewed to put a six in
there highly skewed what do we do I only
have two other items Yi this is tricky
what I would do is I would leave it in
but watch it and if it gives you
problems in the future
we may have to cut it out if it starts
giving us problems in the factor
analysis either exploratory or
confirmatory factor analysis and that's
it what do you do to report it you'd say
well we have one item that was skewed
and we decided to watch it because we
only had two other items on that
construct or you could say we had one
item that was highly skewed you know
value six
we set it delete that item from the set
you could say we also had a highly crew
tote age variable but this is expected
because of the sample population its
students so obviously the age is going
to be highly Curt oat period there we go
okay I think that does it for data
screening woohoo
Voir Plus de Vidéos Connexes
一夜。統計學:SPSS介紹&問卷編碼
SPSS tutorial in tamil for beginners part -1 | Introduction
Quantitative Data Analysis 101 Tutorial: Descriptive vs Inferential Statistics (With Examples)
Analisis Statistik Deskriptif dengan SPSS beserta Interpretasinya
Standard Deviation - Explained and Visualized
04 Correlation in SPSS – SPSS for Beginners
5.0 / 5 (0 votes)