Statistics For Data Science | Data Science Tutorial | Simplilearn
Summary
TLDRThis script offers an insightful overview of statistics, a mathematical science for data collection, analysis, and interpretation. It distinguishes between statistical and non-statistical analysis, emphasizing the former's ability to reveal patterns and trends. The script delves into descriptive and inferential statistics, explaining their roles in summarizing data and making inferences about populations. It introduces key statistical concepts, measures, and terms, and demonstrates how to perform descriptive and inferential analysis using SAS software, including hypothesis testing and the application of various parametric and non-parametric tests.
Takeaways
- 📚 Statistics is a mathematical science for the collection, presentation, analysis, and interpretation of data, crucial for simplifying complex real-world problems and making informed decisions.
- 🔍 There are two main types of analysis: statistical (quantitative) and non-statistical (qualitative), with statistical analysis providing deeper insights and clearer pictures through data patterns and trends.
- 📈 Descriptive statistics organizes data and summarizes its main characteristics using measures like average, mode, standard deviation, and correlation.
- 🔎 Inferential statistics uses probability theory to generalize from a sample to a larger population, allowing for predictions and modeling of relationships within the data.
- 📝 The script introduces key statistical terms such as population, sample, variable, and different types of variables including quantitative, qualitative, discrete, and continuous.
- 📊 Descriptive statistics involves measures of frequency, central tendency, spread, and position to provide a comprehensive understanding of data.
- 🛠️ The Statistical Analysis System (SAS) offers various procedures for performing descriptive statistics, such as proc print, proc contents, proc means, and proc frequency.
- 🧐 Hypothesis testing is an inferential technique to determine if there's sufficient evidence in a data sample to infer a condition holds true for the entire population.
- 📉 Different types of variables are categorized based on their nature: nominal, ordinal, interval, and ratio, each with distinct properties and uses in statistical analysis.
- 📝 The script explains hypothesis testing procedures in SAS, including setting up a null hypothesis, choosing an alpha value, and conducting a t-test to check the validity of the hypothesis.
- 📊 The advantages and disadvantages of both parametric and non-parametric tests are highlighted, with parametric tests providing detailed population information but requiring specific distributional assumptions, while non-parametric tests are more flexible but less efficient.
Q & A
What is the definition of statistics as mentioned in the script?
-Statistics is defined as a mathematical science related to the collection, presentation, analysis, and interpretation of data, which is used to understand and simplify complex real-world problems for making well-informed decisions.
How does statistical analysis differ from non-statistical analysis?
-Statistical analysis, also known as quantitative analysis, involves collecting, exploring, and presenting large amounts of data to identify patterns and trends. In contrast, non-statistical analysis, or qualitative analysis, provides generic information and may include text, sound, still images, and moving images but does not delve into numerical data patterns.
What are the two major categories of statistics?
-The two major categories of statistics are descriptive statistics and inferential statistics. Descriptive statistics organize and summarize data, while inferential statistics generalize from a sample to draw conclusions about a larger population.
Can you explain the role of descriptive statistics in analyzing data?
-Descriptive statistics help to organize data and focus on its main characteristics. It provides a summary of the data, either numerically or graphically, using measures such as average, mode, standard deviation, and correlation to describe the features of a dataset.
What is inferential statistics and how does it apply to data analysis?
-Inferential statistics generalizes from a larger dataset and applies probability theory to draw conclusions. It allows for the inference of population parameters based on sample statistics and to model relationships within the data, which helps in developing mathematical equations that describe the inner relationships between variables.
What is the purpose of hypothesis testing in inferential statistics?
-Hypothesis testing is an inferential statistical technique used to determine if there is enough evidence in a data sample to infer that a certain condition holds true for the entire population. It involves testing whether the identified conclusions from a sample correctly represent the population as a whole.
What are the differences between a null hypothesis and an alternative hypothesis?
-The null hypothesis is a statement of no effect or no difference, assumed to be true unless there is strong evidence to the contrary. The alternative hypothesis is any hypothesis other than the null, and it is assumed to be true when the null hypothesis is proven false.
What are the different types of variables mentioned in the script?
-The script mentions several types of variables: population, sample, quantitative, qualitative, discrete, and continuous. A population is the entire group from which data is collected, a sample is a subset of this population, and quantitative and qualitative variables differ in whether they measure quantity or quality, respectively. Discrete variables do not have values between given values, while continuous variables can have any value within a range.
What are the four types of statistical measures used to describe data?
-The four types of statistical measures used to describe data are measures of frequency, measures of central tendency, measures of spread, and measures of position. Frequency measures how often a data value occurs, central tendency shows where data values tend to cluster, spread describes the variability of the data, and position identifies the location of a data value within the dataset.
Can you describe the role of the PROC MEANS procedure in SAS for descriptive statistics?
-The PROC MEANS procedure in SAS is used for data summarization. It computes descriptive statistics for variables across all observations and within groups of observations, providing insights into the central tendency, variability, and other summary measures of the dataset.
What is the significance of hypothesis testing procedures like parametric and non-parametric tests?
-Hypothesis testing procedures, both parametric and non-parametric, are significant for making inferences about a population based on sample data. Parametric tests make assumptions about the population distribution and are used when the data meets certain criteria, while non-parametric tests make fewer assumptions and are used when the data does not meet the assumptions required for parametric tests.
Outlines
📊 Introduction to Statistics and Its Importance
The first paragraph introduces the concept of statistics as a mathematical science for data collection, presentation, analysis, and interpretation. It highlights the role of statistics in simplifying complex real-world problems for informed decision-making. The paragraph distinguishes between statistical and non-statistical analysis, with the former being quantitative and revealing patterns and trends through data exploration. It also outlines the two major categories of statistics: descriptive, which organizes and summarizes data, and inferential, which generalizes findings to larger datasets using probability theory. Examples such as student heights in a classroom illustrate these concepts, emphasizing the prevalence of statistics in everyday life and business.
📈 Descriptive and Inferential Statistics with SAS Demo
This paragraph delves into the specifics of descriptive statistics, detailing measures like average, mode, standard deviation, and correlation used to describe data sets. It provides a practical example of analyzing student heights using descriptive methods. The paragraph then transitions to inferential statistics, explaining how it uses sample data to infer population parameters and model relationships. A demonstration using SAS software is described, showing how to import a dataset and use procedures like 'proc means' to analyze data. The concept of hypothesis testing as a part of inferential statistics is introduced, explaining the null and alternative hypotheses in the context of a pharmaceutical company's safety claims.
🔍 Understanding Variables and Hypothesis Testing in Statistics
The third paragraph focuses on the categorization of variables in statistics, distinguishing between nominal, ordinal, interval, and ratio variables. It explains the characteristics of each type with examples, such as gender for nominal and the Fahrenheit scale for interval variables. The paragraph also discusses the importance of recognizing variable types before statistical testing. A demonstration using SAS for hypothesis testing is provided, including a t-test example to determine if the mean delivery time deviates from a hypothesized value. The explanation includes the concepts of null hypothesis, alternative hypothesis, and p-values in the context of statistical significance.
🧐 Hypothesis Testing Techniques and Their Applications
This paragraph explores various hypothesis testing procedures, starting with an overview of parametric tests such as t-tests, ANOVA, chi-square, and linear regression. It describes the scenarios where each test is applicable, such as comparing means or assessing variances between groups. The paragraph also introduces non-parametric tests like the Wilcoxon rank sum test and Kruskal-Wallis H-test, which do not require strict distributional assumptions. Advantages and disadvantages of both parametric and non-parametric tests are listed, providing insight into their respective use cases and limitations in statistical analysis.
📚 Conclusion and Invitation to Learn More on Big Data
The final paragraph serves as a conclusion, summarizing the importance of understanding statistical methods and hypothesis testing for big data analysis. It encourages viewers to subscribe to the Simply Learn channel for more educational content on big data and to gain expertise in the field. The paragraph ends with a call to action, inviting the audience to watch more videos on the topic and pursue certification.
Mindmap
Keywords
💡Statistics
💡Descriptive Statistics
💡Inferential Statistics
💡Population
💡Sample
💡Variable
💡Quantitative Variable
💡Qualitative Variable
💡Hypothesis Testing
💡Parametric Tests
💡Non-Parametric Tests
Highlights
Statistics is defined as a mathematical science for data collection, presentation, analysis, and interpretation.
Statistics helps in simplifying complex real-world problems for well-informed decision-making.
Statistical analysis can be divided into two types: statistical (quantitative) and non-statistical (qualitative).
Descriptive statistics organizes data and focuses on its main characteristics, using measures like average, mode, and standard deviation.
Inferential statistics uses probability theory to generalize findings from a sample to a larger population.
Descriptive statistics can summarize data numerically or graphically, such as finding maximum, minimum, and average values.
Inferential statistics categorizes data and uses samples to infer population parameters and model relationships.
The impact of statistics is evident in daily life, from home routines to the operation of major cities.
Key statistical terms include population, sample, variable, quantitative variable, qualitative variable, discrete variable, and continuous variable.
There are four types of statistical measures: frequency, central tendency, spread, and position.
SAS provides various procedures for performing descriptive statistics, such as PROC MEANS and PROC FREQUENCY.
Hypothesis testing is an inferential statistical technique used to determine if a condition holds true for an entire population based on a sample.
Hypothesis testing involves the null hypothesis and the alternative hypothesis, used to make conclusions about population parameters.
Variables in hypothesis testing are classified into nominal, ordinal, interval, and ratio types.
Parametric tests like t-test and ANOVA are used when the population distribution is known, while non-parametric tests are used when it's not.
Parametric tests provide information about population parameters and relationships between variables, but require normally distributed data.
Non-parametric tests are simpler, make fewer assumptions, and do not require data to be normally distributed.
Examples of non-parametric tests include the Wilcoxon rank sum test and the Kruskal-Wallis H-test.
The advantages and disadvantages of parametric and non-parametric tests are discussed, highlighting their applicability and limitations.
Transcripts
[Music]
let's begin this lesson by defining the
term statistics
statistics is a mathematical science
pertaining to the collection
presentation analysis and interpretation
of data
it's widely used to understand the
complex problems of the real world and
simplify them to make well-informed
decisions
several statistical principles functions
and algorithms can be used to analyze
primary data build a statistical model
and predict the outcomes
an analysis of any situation can be done
in two ways statistical analysis or a
non-statistical analysis
statistical analysis is the science of
collecting exploring and presenting
large amounts of data to identify the
patterns and trends
statistical analysis is also called
quantitative analysis
non-statistical analysis provides
generic information and includes text
sound still images and moving images
non-statistical analysis is also called
qualitative analysis although both forms
of analysis provide results statistical
analysis gives more insight and a
clearer picture
a feature that makes it vital for
businesses
there are two major categories of
statistics descriptive statistics and
inferential statistics
descriptive statistics helps organize
data and focuses on the main
characteristics of the data
it provides a summary of the data
numerically or graphically
numerical measures such as average mode
standard deviation or sd and correlation
are used to describe the features of a
data set
suppose you want to study the height of
students in a classroom
in the descriptive statistics you would
record the height of every person in the
classroom and then find out the maximum
height minimum height and average height
of the population
inferential statistics generalizes the
larger data set and applies probability
theory to draw a conclusion
it allows you to infer population
parameters based on the sample
statistics and to model relationships
within the data
modeling allows you to develop
mathematical equations which describe
the inner relationships between two or
more variables
consider the same example of calculating
the height of students in the classroom
in inferential statistics you would
categorize height as tall
medium and small and then take only a
small sample from the population to
study the height of students in the
classroom
the field of statistics touches our
lives in many ways from the daily
routines in our homes to the business of
making the greatest cities run the
effect of statistics are everywhere
there are various statistical terms that
one should be aware of while dealing
with statistics
population sample variable quantitative
variable qualitative variable discrete
variable continuous variable
a population is the group from which
data is to be collected
a sample is a subset of a population
a variable is a feature that is
characteristic of any member of the
population differing in quality or
quantity from another member
a variable differing in quantity is
called a quantitative variable for
example the weight of a person number of
people in a car
a variable differing in quality is
called a qualitative variable or
attribute for example color the degree
of damage of a car in an accident
a discrete variable is one which no
value can be assumed between the two
given values
for example the number of children in a
family
a continuous variable is one in which
any value can be assumed between the two
given values
for example the time taken for a 100
meter run
typically there are four types of
statistical measures used to describe
the data
they are measures of frequency measures
of central tendency measures of spread
measures of position
let's learn each in detail
frequency of the data indicates the
number of times a particular data value
occurs in the given data set
the measures of frequency are number and
percentage
central tendency indicates whether the
data values tend to accumulate in the
middle of the distribution or toward the
end
the measures of central tendency are
mean
median and mode
spread describes how similar or varied
the set of observed values are for a
particular variable
the measures of spread are standard
deviation variance and quartiles
the measure of spread are also called
measures of dispersion
position identifies the exact location
of a particular data value in the given
data set
the measures of position are percentiles
quartiles and standard scores
statistical analysis system or sas
provides a list of procedures to perform
descriptive statistics
they are as follows
proc print
proc contents
proc means
proc frequency proc univariate
proc g chart
proc box plot
proc g plot
proc print
it prints all the variables in a sas
data set
proc contents it describes the structure
of a data set
proc means
it provides data summarization tools to
compute descriptive statistics for
variables across all observations and
within the groups of observations
proc frequency
it produces one way to inway frequency
and cross tabulation tables
frequencies can also be an output of a
sas data set
proc univariate
it goes beyond what proc means does and
is useful in conducting some basic
statistical analyses and includes high
resolution graphical features
proc g chart
the g chart procedure produces six types
of charts block charts horizontal
vertical bar charts
pi doughnut charts and star charts
these charts graphically represent the
value of a statistic calculated for one
or more variables in an input sas data
set
the tread variables can be either
numeric or character
proc box plot
the box plot procedure creates side by
side box and whisker plots of
measurements organized in groups
a box and whisker plot displays the mean
quartiles and minimum and maximum
observations for a group
proc g-plot
g-plot procedure creates two-dimensional
graphs including simple scatter plots
overlay plots in which multiple sets of
data points are displayed on one set of
axes
plots against the second vertical axis
bubble plots and logarithmic plots
in this demo you'll learn how to use
descriptive statistics to analyze the
mean from the electronic data set
let's import the electronic data set
into the sas console
in the left plane right-click the
electronic.xlsx dataset and click import
data
the code to import the data generates
automatically
copy the code and paste it in the new
window
the proc means procedure is used to
analyze the mean of the imported data
set
the keyword data identifies the input
data set
in this demo the input data set is
electronic
the output obtained is shown on the
screen
note that the number of observations
mean standard deviation and maximum and
minimum values of the electronic data
set are obtained
this concludes the demo on how to use
descriptive statistics to analyze the
mean from the electronic data set
so far you've learned about descriptive
statistics
let's now learn about inferential
statistics
hypothesis testing is an inferential
statistical technique to determine
whether there is enough evidence in a
data sample to infer that a certain
condition holds true for the entire
population
to understand the characteristics of the
general population we take a random
sample and analyze the properties of the
sample
we then test whether or not the
identified conclusions correctly
represent the population as a whole
the population of hypothesis testing is
to choose between two competing
hypotheses about the value of a
population parameter
for example
one hypothesis might claim that the
wages of men and women are equal while
the other might claim that women make
more than men
hypothesis testing is formulated in
terms of two hypotheses
null hypothesis which is referred to as
alternative hypothesis which is referred
to as h1
the null hypothesis is assumed to be
true unless there is strong evidence to
the contrary
the alternative hypothesis is assumed to
be true when the null hypothesis is
proven false
let's understand the null hypothesis and
alternative hypothesis using a general
example
null hypothesis attempts to show that no
variation exists between variables and
alternative hypothesis is any hypothesis
other than the null
for example say a pharmaceutical company
has introduced a medicine in the market
for a particular disease and people have
been using it for a considerable period
of time and it's generally considered
safe
if the medicine is proved to be safe
then it is referred to as null
hypothesis
to reject null hypothesis we should
prove that the medicine is unsafe
if the null hypothesis is rejected then
the alternative hypothesis is used
before you perform any statistical tests
with variables it's significant to
recognize the nature of the variables
involved
based on the nature of the variables
it's classified into four types
they are categorical or nominal
variables ordinal variables
interval variables and ratio variables
nominal variables are ones which have
two or more categories and it's
impossible to order the values
examples of nominal variables include
gender and blood group
ordinal variables have values ordered
logically however the relative distance
between two data values is not clear
examples of ordinal variables include
considering the size of a coffee cup
large medium and small and considering
the ratings of a product bad good and
best
interval variables are similar to
ordinal variables except that the values
are measured in a way where their
differences are meaningful
with an interval scale equal differences
between scale values do have equal
quantitative meaning
for this reason an interval scale
provides more quantitative information
than the ordinal scale
the interval scale does not have a true
zero point a true zero point means that
a value of zero on the scale represents
zero quantity of the construct being
assessed examples of interval variables
include the fahrenheit scale used to
measure temperature and distance between
two compartments in a train
ratio scales are similar to interval
scales in that equal differences between
scale values have equal quantitative
meaning
however ratio scales also have a true
zero point which give them an additional
property
for example the system of inches used
with a common ruler is an example of a
ratio scale there is a true zero point
because zero inches does in fact
indicate a complete absence of length
in this demo you'll learn how to perform
the hypothesis testing using
sas this example let's check against the
length of certain observations from a
random sample
the keyword data identifies the input
data set
the input statement is used to declare
the aging variable and cards to read
data into sas
let's perform a t-test to check the null
hypothesis
let's assume that the null hypothesis to
be that the mean days to deliver a
product is six days
so null hypothesis equals six
alpha value is the probability of making
an error which is 5 percent standard and
hence alpha equals 0.05
the variable statement names the
variable to be used in the analysis
the output is shown on the screen
note that the p-value is greater than
the alpha value which is 0.05 therefore
we fail to reject the null hypothesis
this concludes the demo on how to
perform the hypothesis testing using sas
let's now learn about hypothesis testing
procedures
there are two types of hypothesis
testing procedures
they are parametric tests and
non-parametric tests
in statistical inference or hypothesis
testing the traditional tests such as
t-test and anova are called parametric
tests
they depend on the specification of a
probability distribution except for a
set of free parameters
in simple words
you can say that if the population
information is known completely by its
parameter then it is called a parametric
test
if the population or parameter
information is not known and you are
still required to test the hypothesis of
the population then it's called a
non-parametric test
non-parametric tests do not require any
strict distributional assumptions
there are various parametric tests they
are as follows
t-test
anova
chi squared
linear regression
let's understand them in detail
t-test
a t-test determines if two sets of data
are significantly different from each
other
the t-test is used in the following
situations
to test if the mean is significantly
different than a hypothesized value
to test if the mean for two independent
groups is significantly different to
test if the mean for two dependent or
paired groups is significantly different
for example
let's say you have to find out which
region spends the highest amount of
money on shopping
it's impractical to ask everyone in the
different regions about their shopping
expenditure
in this case you can calculate the
highest shopping expenditure by
collecting sample observations from each
region
with the help of the t-test you can
check if the difference between the
regions are significant or a statistical
fluke
anova
anova is a generalized version of the
t-test and used when the mean of the
interval dependent variable is different
to the categorical independent variable
when we want to check variance between
two or more groups we apply the anova
test
for example let's look at the same
example of the t-test example
now you want to check how much people in
various regions spend every month on
shopping
in this case there are four groups
namely east west
north and south
with the help of the anova test you can
check if the difference between the
regions is significant or a statistical
fluke
chi-square
chi-square is a statistical test used to
compare observed data with data you
would expect to obtain according to a
specific hypothesis
let's understand the chi-square test
through an example
you have a data set of male shoppers and
female shoppers
let's say you need to assess whether the
probability of females purchasing items
of 500 or more is significantly
different from the probability of males
purchasing items of 500 or more
linear regression
there are two types of linear regression
simple linear regression and multiple
linear regression
simple linear regression is used when
one wants to test how well a variable
predicts another variable
multiple linear regression allows one to
test how well multiple variables or
independent variables predict a variable
of interest
when using multiple linear regression we
additionally assume the predictor
variables are independent
for example finding relationship between
any two variables say sales and profit
is called simple linear regression
finding relationship between any three
variables say sales cost telemarketing
is called multiple linear regression
some of the non-parametric tests are
wilcoxon rank sum test and
kruskal-wallis h-test
wilcoxon rank sum test
the wilcoxon signed rank test is a
non-parametric statistical hypothesis
test used to compare two related samples
or matched samples to assess whether or
not their population mean ranks differ
in wilcoxon rank some test you can test
the null hypothesis on the basis of the
ranks of the observations
kruskal-wallis h-test
kruskal-wallis h-test is a rank-based
non-parametric test used to compare
independent samples of equal or
different sample sizes
in this test you can test the null
hypothesis on the basis of the ranks of
the independent samples
the advantages of parametric tests are
as follows
provide information about the population
in terms of parameters and confidence
intervals
easier to use in modeling analyzing and
for describing data with central
tendencies and data transformations
express the relationship between two or
more variables
don't need to convert data into rank
order to test
the disadvantages of parametric tests
are as follows
only support normally distributed data
only applicable on variables not
let's now list the advantages and
disadvantages of non-parametric tests
the advantages of non-parametric tests
are as follows
simple and easy to understand
do not involve population parameters and
a sampling theory
make fewer assumptions
provide results similar to parametric
procedures
the disadvantages of non-parametric
tests are as follows
not as efficient as parametric tests
difficult to perform operations on large
samples manually
hey want to become an expert in big data
then subscribe to the simply learn
channel and click here to watch more
such videos to nerd up and get certified
in big data click here
تصفح المزيد من مقاطع الفيديو ذات الصلة
Descriptive Statistics vs Inferential Statistics | Measure of Central Tendency | Types of Statistics
Statistics for Social Work Lecture 01
KUPAS TUNTAS: Apakah Perbedaan Statistik Inferensial dengan Statistik Deskriptif ?
Pendahuluan Metode Penelitian Part I
Statistics For Data Analytics | Complete Syllabus | Data Science | Statistics Tutorial | Part 1
Statistics Terminology and Definitions| Statistics Tutorial | MarinStatsLectures
5.0 / 5 (0 votes)