The most important ideas in modern statistics
Summary
TLDRThis video script discusses eight revolutionary ideas in statistics that have shaped the field from 1970 to 2021. It highlights the importance of counterfactual causal inference, the bootstrap method, simulation-based inference, overparameterization, regularization techniques, multi-level models, and the role of computational power. The script emphasizes the evolution of statistical practice, the significance of robust inference, and the innovative use of plots and visuals for data analysis. It also touches on the concept of adaptive decision analysis and the impact of robust statistics in providing trustworthy analyses despite potential assumption violations.
Takeaways
- 📊 Statistics is an evolving field with influential ideas shaping its trajectory.
- 📈 Andrew Gelman and Aki Vehtari are authorities in Bayesian statistics, known for their work on Bayesian data analysis.
- 🔍 The concept of counterfactual causal inference allows making causal statements from observational data.
- 🔄 The bootstrap method is a versatile algorithm for estimating the sample distribution of a statistic using a single dataset.
- 💻 The rise of computational power has highlighted the importance of computation in statistics, enabling complex simulations and analyses.
- 🔧 Overparameterized models and neural networks offer extreme flexibility in modeling a wide range of phenomena.
- 🔧 Regularization techniques are used to prevent overfitting in flexible models by maintaining a degree of simplicity.
- 📈 Multi-level models, also known as hierarchical or mixed effect models, are used to aggregate information and provide more nuanced analyses.
- 🔄 The Expectation-Maximization (EM) algorithm and the Metropolis algorithm are key statistical algorithms that address complex estimation problems.
- 🔧 Adaptive decision analysis allows for the modification of experiments based on interim data, improving the design and decision-making process.
- 🔍 Robust inference provides trustworthy statistical analyses even when assumptions are violated, offering more confidence in the results.
Q & A
What is the main focus of the video?
-The video discusses eight innovations in statistics that have significantly shaped the field, making it accessible to a general audience.
Who are Andrew Gelman and Aki Vitter, and why are they considered authorities in statistics?
-Andrew Gelman and Aki Vitter are renowned statisticians known for their work in Bayesian statistics. They are considered authorities because they have written extensively on Bayesian data analysis, which is highly respected among practitioners.
What is the difference between experimental data and observational data in statistics?
-Experimental data comes from controlled experiments where researchers can manipulate variables, allowing for causal claims. Observational data, on the other hand, comes from observing real-world scenarios where researchers cannot control who receives a treatment, limiting them to making only correlational claims.
How does counterfactual causal inference help in dealing with observational data?
-Counterfactual causal inference allows statisticians to make adjustments to observational data, getting closer to causal statements by considering what would have happened in an alternate reality where the treatment was not applied.
What is the bootstrap method and why is it significant?
-The bootstrap is an algorithm for estimating the sample distribution of a statistic by resampling with replacement from the original data set. It is significant because it simplifies the process of creating confidence intervals and is applicable to many kinds of statistics, highlighting the importance of computation in statistics.
What is the role of simulations in statistics?
-Simulations allow statisticians to assess experiments and new statistical models without actually conducting them, saving resources and time. They can be used to evaluate the power and type I error of experimental designs, such as clinical trials.
Why is increasing the number of parameters in a statistical model beneficial?
-Increasing the number of parameters in a model provides more flexibility, allowing it to better represent complex real-world scenarios. This can lead to more accurate predictions and a better understanding of the data.
What is regularization in the context of statistical models?
-Regularization is a technique used to prevent overfitting in extremely flexible models by enforcing simplicity. It helps balance the complexity of the model, ensuring it does not just approximate the data but represents a more general phenomenon.
How do multi-level models differ from simpler statistical models?
-Multi-level models, also known as hierarchical or mixed effect models, assume additional structure over the parameters, allowing for the aggregation of data from different levels or groups. This structure is useful for combining information from various sources and can be applied in both frequentist and Bayesian frameworks.
What is the expectation maximization (EM) algorithm and its significance?
-The EM algorithm is a statistical algorithm used for estimation problems, particularly when the model contains latent variables or unobserved data. It allows for the estimation of parameters in complex models that cannot be solved directly.
What is robust inference and its importance in statistics?
-Robust inference provides trustworthy statistical analyses even when assumptions are violated. It ensures that the results of statistical analyses, such as confidence intervals or estimated values, remain reliable even if the underlying assumptions are not perfectly met.
Outlines
📊 Introduction to Statistical Innovations
The video script begins by discussing the importance of statistics in various aspects of life and introduces eight innovations that have shaped the field. The speaker, Christian, aims to make statistics accessible and highlights the authority of Andrew Gelman and Aki Vittachi, authors of a thought-provoking essay on the most important statistical ideas in the past 50 years. The script also touches on the limitations of observational data and the introduction of counterfactual causal inference, which allows for causal statements in observational studies.
🔍 The Bootstrap Method and Computational Power
This paragraph delves into the bootstrap method, a technique for estimating the sample distribution of a statistic using a single dataset. It emphasizes the significance of computation in statistics and how the rise of computational power has facilitated the use of simulations and the development of complex statistical models. The script also mentions the importance of understanding statistical parameters and the role of overparameterization in allowing models to capture more complex realities, such as those represented by neural networks.
📈 Multi-Level Models and Bayesian Approaches
The third paragraph discusses multi-level models, also known as hierarchical or mixed effect models, which are used to aggregate data from multiple sources and incorporate prior knowledge into the analysis. It highlights the flexibility of Bayesian methods, particularly in handling small sample sizes and the ability to choose different priors for various levels of the model. The paragraph also touches on the importance of computers and computational power in the advancement of statistical algorithms and the development of more complex models.
🔬 Algorithms and Robust Inference
This section introduces two key statistical algorithms: the Expectation-Maximization (EM) algorithm for estimation problems and the Metropolis algorithm for generating samples from complex probability distributions. It also discusses adaptive decision analysis, which allows for the modification of experiments based on interim data, and robust inference, which provides trustworthy analyses even when assumptions are violated. The paragraph concludes with a discussion on the importance of visualizing data through plots and the impact of new technologies on the evolution of statistics.
Mindmap
Keywords
💡Statistics
💡Biostatistics
💡Counterfactual Causal Inference
💡Bootstrap
💡Overparameterization
💡Multi-level Models
💡Expectation Maximization (EM) Algorithm
💡Metropolis Algorithm
💡Adaptive Decision Analysis
💡Robust Inference
💡Propensity Score Matching
Highlights
Statistics is a field of research with influential ideas that have changed its trajectory.
Christian aims to make statistics accessible for practical application in daily life.
Andrew Gelman and Aki Vihavainen published an essay on the most important statistical ideas in the past 50 years.
Gelman and Vihavainen are authorities in Bayesian statistics, known for their work on Bayesian data analysis.
The essay discusses statistical innovations from 1970 to 2021, focusing on modern statistics.
Counterfactual causal inference allows making causal statements from observational data.
The bootstrap method is a general algorithm for estimating the sample distribution of a statistic using a single dataset.
Simulation-based inference uses computational power to assess experiments and statistical models without actual data collection.
Overparameterized models and neural networks provide extreme flexibility in modeling complex phenomena.
Regularization techniques help balance complexity in extremely flexible models.
Multi-level models, also known as hierarchical or mixed effect models, are used to aggregate data from multiple sources.
The expectation-maximization (EM) algorithm is used for estimating parameters in complex models with latent classes.
The Metropolis algorithm and its descendants enable generation of samples from complex probability distributions.
Adaptive decision analysis allows for modifying experiments based on interim data collection.
Robust inference provides trustworthy statistical analyses even when assumptions are violated.
Propensity score matching is a technique used to estimate causal effects by matching similar individuals in treatment and control groups.
Plots and visuals are essential tools for examining data and assessing statistical models.
The tidyverse framework, popularized by Hadley Wickham, simplifies data manipulation and visualization in R.
Transcripts
most people only ever interact with
statistics for a limited part of their
lives but statistics is a field of
research like other areas statistics has
evolved influential ideas have come and
changed the trajectory of Statistics as
a student of biostatistics it's my
responsibility to be familiar with these
revolutionary ideas in the field in this
video we'll talk about eight Innovations
and statistics that have shaped how we
know it today and I'll do my best to
explain what these Innovations are and
why they were so impactful and a way
that makes sense to a general audience
if you're new to the channel welcome my
name is Christian my goal is to make
statistics accessible to more people so
that they can apply to their daily lives
in 2021 Andrew Gman and Aki vitari
published an article in the Journal of
the American statistical Association or
jassa jasa is one of the most
prestigious journals in the field of
Statistics so publishing here is a big
deal but instead of a research
manuscript Gman and vitari publish an
essay this essay is titled what are the
the most important statistical ideas in
the past 50 years and this article is
what motivates this video but two
statisticians do not make up an entire
field of Statistics so what gives these
two authors the authority to answer such
a question the essay was meant to be
thought-provoking not authoritative
though I would argue that both Andrew
Gelman and Aki vitari are in fact
authorities in the field they are widely
known among practitioners of beian
Statistics since they basically wrote
the Bible on it beian data analysis as
of the writing of this video Andrew
Gelman is a professor at Columbia
University in both statistics and
political science Aki vitari is a
professor of computational probabilistic
modeling at alter University in Finland
Andrew Gman also maintains a fantastic
blog on statistics political science and
her intersection which I highly
recommend the article considers
statistical innovations that happened
from around 1970 2021 so this is the
time period for which I call Modern
statistics without further ado let's
have a look at the
list in an Ideal World all data comes
from experimental data where a
researcher can control who receives an
intervention and who doesn't when we can
do this in a carefully controlled manner
such as in an RCT we can claim cause and
effect between an intervention and some
outcome of interest but we live in the
real world and the real world gives us
observational data sometimes where we
can't control who receives a treatment
and who doesn't we can still perform
statistic analyses on observational data
we cannot make the same causal claims
about them only correlational claims
this was until counterfactual causal
inference came onto the scene this
framework allows us to take
observational data and make adjustments
in a way that gets us closer to causal
statements how this works is the topic
of an entire other video so I'll give
you the basic breakdown let's consider a
world where I have an upcoming test I
can choose to study a bit more or I can
choose not to in this reality I choose
to study and I get some score on the
test later I'll denote this as y sub one
if a supernatural statistician wanted to
know if this decision caused the change
in my score then they would have to
examine another reality they would have
to find the reality where I didn't
choose to study and measured the test
score of that version of me who didn't
study I'll call that outcome y subz the
only difference between these two
versions of me is that I chose to study
in one but not in another this
unobserved version of myself is called
the counteract factual because this
version of me is counter to what
actually or factually happened then the
causal effect of me studying on my test
score is the difference between y sub
one and Y Sub 0 the fundamental problem
in causal inference is that we can only
ever observe one reality and therefore
one outcome in essence it's a missing
data problem the counter effectual
framework is important because it helped
give statisticians a way to formalize
causal effects in mathematical models
this is significant because several
fields of study are prone to having more
observational data such as economics and
psychology if you've been with my
channel for a while you may be familiar
with this one already that video delves
into more technical detail but I'll
briefly explain what it is in this video
the bootstrap is a general algorithm for
estimating the sample distribution of a
statistic ordinarily this would require
Gathering multiple data sets which no
one has time for or it would require a
mathematical derivation which I don't
have time for rather than do either of
these the bootstrap takes the
interesting approach of reusing data
from a single data set the bootstrap
generates several bootstrap data sets by
sampling withd replacement from the
original for each of these bootstrap
data sets a statistic of interest is
calculated and their distribution can be
derived from this entire collection this
is incredibly valuable not only because
it's super simple and therefore easy for
more people to use it's applicable to
many kinds of statistics we can use boot
trrap to create confidence intervals for
Point parameters like a regression
coefficient or we could create
confidence bands for coefficient
functions like we might see in
functional data analysis the bootstrap
is significant not only because of its
usefulness but because it highlights the
significance of computation in
statistics a quote from one of my heroes
is very relevant here you see killbots
have a preset kill limit knowing their
weakness i s wave after wave of my own
men at them until they reach their limit
and shut down instead of human life
statisticians can do a lot just by using
Wave After Wave of our own computer's
processing power the rise of
computational power has made it easier
to perform simulations and simulated
data allows us to assess experiments and
new statistical models for example
simulations can be used to assess power
and type on error of experimental
designs for clinical trials without
actually needing to run them this means
a lot of money and effort is safe for
pharmaceutical companies another example
of simulation based inference come from
Bean statistics beans encode knowledge
in the form of Prior or probability
distributions on parameters using these
priors we can actually simulate data
from a prior distribution and check if
the resulting data we collected actually
makes sense here this is called a prior
predictive check the same can be done
for the posterior distribution of a
parameter which makes it a posterior
predictive check and these are
incredibly useful for validating our
models
to understand this idea we need some
context on statistical parameters one
way to view parameters is that they are
representations of ideas that are
important to us within statistical
models in a two sample T Test the mean
parameter represents the difference of
two groups such as a placebo and
treatment group in linear regression
we're interested in the coefficient
associated with treatment which
represents the associated change to the
outcome that the treatment has
statistical models are approximations of
the real world but we can we can
actually change our models to match the
real world a little better one way we
can do this is by increasing the number
of parameters there are in the model
consider the simple linear regression it
tells you that the distribution of an
outcome shifts according to this
coefficient but what if we expect this
change to vary over time in the current
model there's no parameter for time so
this model simply can't capture this
complexity we can move up a level by
incorporating more parameters into the
model and adding a coefficient for both
time and the interaction between time
and treatment what if we suspect that
each individual in the study will react
differently to the treatment the current
model tells us that this single
parameter will explain the change for
this population on average to give
everyone their own subject specific
effect we can make the model even more
complex and turn it into a mixed effect
model more parameters more flexibility
overparameterized models take this idea
to the extreme make the model extremely
flexible by adding tons and tons of
paramet
neural networks are a prime example of
this each Edge in the neural network is
associated with a parameter or weight
along with some extra bias parameters we
can easily overp parameterize by making
these networks very large and by doing
so the universal approximation theorem
tells us that these networks can
approximate a wide variety of functions
and this extra flexibility is important
because it lets us Model A Wider range
of phenomena that simpler models just
can't handle one problem with extremely
flexible models is that they may start
to approximate the data itself rather
than representing a more General
phenomena we can learn from
statisticians employ a regularization
techniques which help to balance out
this complexity by enforcing that these
models maintain some degree of
Simplicity multi-level models also known
as hierarchical or mixed effect models
are models that assume additional
structure over the parameters for
example multi-level models are commonly
used to aggregate several n one trials
together each individual is associated
with their own treatment effect which
will denote data J to indicate that each
individual has their own effect these
individuals form the second level of the
model the first level can be thought of
as describing the distribution or
structure of these individual effects
and in N1 context the first level might
be a normal distribution centered at
some population treatment effect Theta
with some variance Sigma squ in
different context the units of the
second level of the model could be
different things in a study taking place
over many locations these may be
different hospitals or cities something
to indicate a cluster of related units
in a Baska trial each second level unit
is a specific disease and we suspect
that their treatment effects will be
similar because they share a common
mutation in meta analyses the second
level units could be estimated effects
from Individual research studies entry
Gman says that he used the multi-level
model as a way to combine different
sources of information into a single
analysis this kind of structure is
incredibly common in statistics so
that's why multi-level models take a
spot on the list multi-level models can
be both frequentist and beijan so why is
beian specifically mentioned in the
article but my guess is that the bean
framework allows us to incorporate prior
knowledge into the models this is
especially helpful when deciding on
prior for the first level parameters
especially on the variants if you choose
a wide uninformative prior it encourages
the resulting model to treat second
level units as being independent of each
other on the other hand choosing a
narrow and formed prior allows us to
pull data together which can help us
estimate treatment effects for second
level units with small sample sizes
being able to choose different priors
give statistician much part flexibility
in the modeling
process a recurring theme among the top
eight ideas is the importance of
computers and computational power to the
development of Statistics advances in
technology have allowed more complex
models to be invented for harder
problems to account for this several
important statistical algorithms have
been invented to help solve them an
algorithm is just a set of steps that
can be followed so a statistical
algorithm is an algorithm designed to
help some statistical problem but there
are so many types of statistical
problems out there it's hard to get an
appreciation for how useful these
algorithms are so I'll explain two to
give you a taste the expectation
maximization algorithm or em algorithm
is famously known from the 1977 paper in
the Journal of the royal statistical
Society another heavy-hitting journal on
statistics the EM algorithm solves an
estimation problem which is where we
need to use data to compute educated
guesses about the values of parameters
in a model maximum likelihood estimation
is another example of an estimation
approach what makes the EM algorithm
distinct is that it tries to estimate
the parameters in a model that we can't
solve directly one instance where this
can happen is in the case of mixture
models with so-called latent classes in
this type of model we have data that may
come from one of several groups but we
don't have the group labels to tell us
who belongs where without delving into
the details the eem algorithm gives us a
way to still estimate the parameters in
this model despite not knowing these
classes the second example is the
Metropolis algorithm and its more modern
Descendants the Metropolis algorithm is
interesting because its roots actually
stem from physics as opposed to
statistics the propis algorithm is
significant because it lets us generate
samples from very complex probability
distributions random number generation
According to some distribution may seem
weird but it's important for
statisticians to be able to do so the
posterior distribution that comes from
Bas rule concern ugly if we turn away
from conveniences like conjugate
families the posterior can be so ugly
that we can't even derive an equation
for it but despite this we can still
generate samples from a complicated
posterior thanks to the Metropolis
algorithm even if we don't have a
formula for the posterior distribution
we can still use the generated data to
recover important quantities about the
distribution such as the mean the
quantiles and credible intervals these
two algorithms are just two examples
mentioned in the article there are many
I couldn't cover and still more that
have been developed since this article
was
written when statisticians designed
experiments it used to be a said and
forget type of thing fure out the sample
size and just run the experiment to
completion but Midway through the
experiment we might need to stop it
under a frequentist framework this would
hurt our power and our P value
interpretation but in modern times we
have a way to account for this adaptive
decision analysis is the idea that maybe
we don't have to wait for the entire
experiment to finish instead we can
adapt our experiment Based on data we
collect in the interim before it
finishes in the context of clinical
trials we may decide to stop a trial
early if preliminary evidence suggest
that the treatment sucks conversely if a
treatment shows early promise we can
even stop based on efficacy these
changes still have to be decided ahead
of time to make sure that we make good
decisions overall and that the trial is
well
designed statisticians have to make a
lot of assumptions if these assumptions
are right or at least plausible then we
can feel comfortable trusting the
results of statistical analyses stuff
like confidence intervals or estimated
values but of course assumptions will
always be right and it's often hard to
even know if they actually are or not
and that's where robust inference comes
in robust statistics still provides
trustworthy statistical analyses even in
the face of violated assumptions if we
have a robust model then we don't have
to be SOI on possibly shaky assumptions
the sample median is often cited as a
robust estimator for a typical value in
a distribution compared to the mean we
often hear that the mean is unduly
influenced by outliers in a data set and
this is true but what assumption do
outliers violate many times we assume a
distribution to be normal normal
distributions have the property that
most of their probability is
concentrated near the mean you often
hear this phrase as the 68 9599 rule
outliers challenge this concentration if
there's a possibility that there can be
many outliers it poses a danger that the
data may come from a so-called heavy
tail distribution where extreme events
are more likely and this would violate
the normal distribution assumption in
causal inference there's a technique
called prop it score matching bity score
matching is used to try to match people
in a treatment group to people in a
control group who are very similar to
them by doing this you can produce
estimates that better resemble a causal
effect propensity score matching
requires two models one model to
estimate the effect of the treatment on
the outcome and another to produce a
score that is used to match people
together both of these models have to be
correctly specified for the results to
be useful correct specification
essentially means that we choose the Cor
correct model for its purpose but this
is almost never the case to account for
this there are robust versions of
propensity score matching that allow for
one of these models to be wrong the less
assumptions we have to make the better
we have to make sure our models can
actually account for
this yes you read that right we're done
with the theory we're done with the
computation we're going back to plots
and visuals plots give us a way to
examine our data and assess our
statistical models it's just easier to
learn from your data if you can look at
it rather than just have it in a CSV but
it's undeniable that the skill is
important part in any statisticians or
data scientists toolkit there's even an
entire Paradigm of art programming
dedicated to formalizing exploratory
data analysis there are people who code
in boring base R and then there's people
who code using the tidyverse framework
popularized by the god Hadley Wickham
the tidyverse set of packages makes it
extremely easy to get your data into R
clean it and visualize it I highly
recommend learning it and I hope to have
a more in-depth video on it in the
future what does it mean for an idea to
be important at first I thought that a
statistical idea would be important if
the paper that introduced it was cited
many times but this was not the case the
author specifically mentioned avoiding
citation counts rather they view
important ideas as those that influence
the development of ideas that have
influenced statistical practice I highly
recommend reading the original article
it's free to read online and atro
gelman's blog you can just Google most
important ideas and statistics and look
for his name this video only covers part
of the article it's full of citations so
readers can pick it up and read more
about a particular bullet point that
they were interested in other articles
have even performed actual statistical
analyses to answer this question if you
think the author's missed a cool idea
tell me about it in the comments I hope
that I've showed you that statistics
didn't stop with the two sample T tests
and the new regression new technologies
create new types of data so statistics
needs to innovate to keep up if you
think I've earned it please like the
video and subscribe to the channel for
more I've also started a newsletter to
accompany the YouTube channel so that
people can get my videos delivered
straight to their inbox I'll see you all
on the next
[Music]
one
Weitere ähnliche Videos ansehen
100+ Statistics Concepts You Should Know
Statistical Inference: Introduction and Terminology (in Hindi)
Understanding Statistical Inference - statistics help
Lecture 1.1 - Introduction and Types of Data - Basic definitions
How P-Values Help Us Test Hypotheses: Crash Course Statistics #21
PERBEDAAN STATISTIK DAN STATISTIKA
5.0 / 5 (0 votes)