BIOL 4330 Unit 2 1 2 Bayesian Analysis and Markov Chain Monte Carlo
Summary
TLDRThe transcript explains Bayesian analysis, particularly its use in phylogenetic studies. It highlights how complex probability distributions are sampled using the Markov Chain Monte Carlo (MCMC) method to identify optimal evolutionary trees. The analogy of searching for the highest peak in a rugged landscape is used to illustrate how random starting points help explore possible outcomes. Bayesian analysis iteratively refines the model of evolution and parameters, improving accuracy while being computationally efficient. The process is compared with other methods like maximum likelihood and parsimony, emphasizing Bayesian analysis' speed and ability to provide support values.
Takeaways
- 📊 Statistical analysis requires understanding the distribution of outcomes, which can be simple or complex depending on the population.
- 🌳 For human height, a normal distribution is suitable, but phylogenetic analysis involves more complex, multi-dimensional probability distributions.
- 🔍 Markov Chain Monte Carlo (MCMC) is used to estimate complex probability distributions by simulating random walks through the search space.
- 🗺 The MCMC process starts with a random point and explores nearby areas, using an algorithm to move towards better scoring areas, akin to finding the highest peak in a rugged landscape.
- 📈 The process involves running the MCMC, pausing to re-estimate parameters, and repeating until a plateau of scores is reached, indicating the best overall solution.
- 🌐 Phylogenetic search space is multi-dimensional and large, requiring extensive sampling to ensure the best solutions are found.
- 🔄 The Bayesian analysis involves an iterative process of estimating parameters, running MCMC, and re-estimating parameters until convergence.
- 🌲 The 'burn-in' period discards initial trees with lower scores, focusing on saving trees with high scores that represent the best solutions.
- 📊 A consensus tree is made from the saved trees, with support values indicating how often certain relationships appear, providing a measure of confidence in the phylogeny.
- 🤔 Bayesian analysis is favored for its computational efficiency and ability to provide support measures, although other methods like maximum likelihood and parsimony are also discussed.
Q & A
What is the primary purpose of using statistical analysis in phylogenetic studies?
-The primary purpose is to estimate the probability distribution of phylogenetic outcomes due to the complexity of phylogenies and the vast number of possible trees. Statistical analysis helps in identifying the most likely phylogenetic trees that represent evolutionary relationships.
Why are normal distributions insufficient for phylogenetic studies?
-Normal distributions are insufficient for phylogenetic studies because phylogenies involve complex, multidimensional probability distributions, unlike simpler distributions like human height, which follow a normal distribution.
What is the Markov Chain Monte Carlo (MCMC) method, and why is it useful in phylogenetics?
-MCMC is a statistical approach used to estimate probability distributions by sampling nearby regions in a complex search space. In phylogenetics, it helps simulate and estimate the likelihood of various phylogenetic trees and outcomes, even in highly rugged topologies with multiple local maxima.
How does the analogy of searching for the highest peak relate to phylogenetic tree optimization?
-The analogy compares phylogenetic search space to a landscape with peaks. In MCMC, a random starting point is selected, and the algorithm explores surrounding areas to find the highest peak, much like trying to optimize a phylogenetic tree by improving its score step-by-step.
What is the 'burn-in period' in Bayesian phylogenetic analysis?
-The burn-in period refers to the early stage of MCMC sampling where the trees with lower scores are discarded as the algorithm converges on better-scoring phylogenetic trees. During this phase, poorer solutions are filtered out.
How do Bayesian analyses differ from maximum likelihood and parsimony analyses?
-Bayesian analyses not only search for the best trees based on a model of evolution but also provide a posterior probability for relationships, giving a measure of support for different parts of the tree. Maximum likelihood and parsimony do not automatically provide this support for relationships.
What is the role of the model of evolution in Bayesian phylogenetic analysis?
-The model of evolution provides the framework for estimating the likelihood of different phylogenetic trees. Bayesian analysis refines the parameters of this model iteratively to improve tree scores, ultimately producing a tree that best fits the model.
Why is it important to re-estimate parameters during the Bayesian analysis?
-Re-estimating parameters during Bayesian analysis allows for the adjustment of the model of evolution to better fit the data. This iterative process helps in converging toward the best phylogenetic tree with improved accuracy.
What does it mean when the analysis reaches a plateau phase?
-The plateau phase occurs when repeated iterations of the Bayesian analysis no longer improve the tree scores, indicating that the best solutions have been found, and further iterations will not lead to better trees.
How does a consensus tree in Bayesian analysis provide support values for relationships?
-In Bayesian analysis, after saving the trees with good scores, a consensus tree is created by comparing all saved trees. Relationships found in most trees are given higher support values, while those found in fewer trees are given lower values or collapsed into polytomies.
Outlines
📊 Understanding Probability Distributions in Phylogenetic Analysis
This paragraph introduces the concept of analyzing complex probability distributions in phylogenetic trees. It begins by comparing simpler distributions like human height to the more complex multi-dimensional probability spaces in phylogenetic analysis. To navigate this complexity, Markov Chain Monte Carlo (MCMC) is introduced as a technique to estimate and sample probability distributions in rugged search topologies. The analogy of a blind person searching for a peak in a rugged landscape is used to explain how MCMC samples areas with good solutions and repeats the process to ensure accuracy.
🧬 Refining Models in Phylogenetic Analysis Through Iteration
This paragraph explains the iterative process used to refine evolutionary models in phylogenetic analysis. Starting with a random model, scores are calculated and gradually improve as the analysis is repeated. The model is continually informed by posterior probabilities, and the process is repeated until scores plateau, indicating the best solution has been found. Rugged and simple topologies are contrasted, with an example of rugged peaks in the Superstition Mountains and a smoother topology like Mount Fuji, to illustrate how MCMC explores different possibilities.
🏔️ Searching for the Best Phylogenetic Tree: Multiple Iterations and Burn-in Periods
This section delves into the process of searching for the best phylogenetic tree by running multiple iterations and saving trees with good scores. It discusses the 'burn-in' period where initial bad trees are discarded, and the process repeats multiple times (four to eight times typically) to ensure the best trees are found. A consensus tree is eventually created from all saved trees, where relationships are scored based on how frequently they appear across the saved trees. This consensus tree provides support values for different branches, which help indicate the reliability of the relationships.
🌲 Using Bayesian Analysis for Efficient Tree Searches
This paragraph highlights the advantages of Bayesian analysis over other methods like maximum likelihood and parsimony in phylogenetic studies. Bayesian analysis is praised for being computationally efficient and providing a built-in support measure for relationships within the trees. The process is particularly useful when maximum likelihood is too computationally intense. The paragraph emphasizes that Bayesian analysis allows researchers to better understand phylogenetic relationships by using a statistical approach to evaluate results.
🧠 Optimizing Models of Evolution with Bayesian Analysis
This final paragraph discusses how Bayesian analysis optimizes the model of evolution by refining parameters iteratively. It notes that while maximum likelihood and parsimony focus on finding the best tree based on set criteria, Bayesian analysis modifies both the model of evolution and its parameters to reach the best solution. This iterative refinement continues until a plateau is reached, ensuring the best possible phylogenetic model. The conclusion emphasizes the need to understand the principles of various phylogenetic methods to achieve the best overall phylogeny.
Mindmap
Keywords
💡Normal Distribution
💡Phylogeny
💡Search Space
💡Markov Chain Monte Carlo (MCMC)
💡Local Maximum
💡Posterior Probability
💡Burn-in Period
💡Model of Evolution
💡Support Values
💡Maximum Likelihood
Highlights
Understanding distribution of outcomes is crucial for statistical analysis, particularly when analyzing complex systems like phylogenies.
Normal distributions work well for simple populations, such as human height, but phylogenies require much more complex probability distributions.
Markov Chain Monte Carlo (MCMC) is used to simulate and estimate the probability distribution for complex phylogenetic spaces.
The analogy of optimizing elevation in a rugged landscape is used to explain how MCMC searches for the highest scoring phylogenetic trees.
By sampling local areas, MCMC efficiently searches complex, multi-dimensional spaces to find local and potentially global maximums.
In rugged search topologies, multiple high-scoring solutions can be found, and MCMC helps in identifying the best ones.
Random starting points and iterative sampling allow MCMC to explore vast phylogenetic spaces and avoid being trapped in local maxima.
The posterior probability distribution from MCMC informs models of evolution, helping to refine and improve subsequent analyses.
Scores typically improve rapidly at the start of an MCMC analysis, plateauing as the best solutions are found.
The 'burn-in period' involves discarding early trees with lower scores and saving those with higher scores as the analysis progresses.
MCMC helps create consensus trees by comparing all high-scoring trees and assigning support values based on the frequency of common relationships.
Bayesian analysis provides support values, which are useful for gauging the reliability of different phylogenetic relationships.
Compared to maximum likelihood, Bayesian analysis is computationally less intense but still provides reliable statistical results.
Bayesian analyses optimize both the model of evolution and the tree parameters, making it versatile for large datasets.
A key advantage of Bayesian analysis is its ability to incorporate statistical support measures directly into the tree-building process.
Transcripts
so as an example we've set up this
idea of having some understanding of a
distribution
of outcomes for us to be able to
do a statistical analysis now this is a
very simple these
normal distributions whether they're
flat or very pronounced
are fairly simple distributions and for
some
populations that works really really
well if we're looking at height
in humans fall is a fairly normal
distribution that's a fairly
simple thing to analyze now
when we are dealing with phylogenys our
probability
distributions are much much more complex
and this is because phylogenys are
complex and there are so many
of them and so although it's not ideal
maybe the best way to think about it is
like a
three-dimensional but in reality it's
multi-dimensional
space and so in this we might have some
areas that are pretty good
these local peaks and the higher we are
here the the better the
tree scores are but this is a very
rugged and
difficult topology and so because
phylogenys can have such a complex
probability of outcomes we and because
we don't know what that might be ahead
of time
we need a way that we can estimate and
look
at this probability outcome this search
space
and to do that we use a process called
markov chain monte carlo now
you don't need to know all the details
about it it's a fairly advanced
statistical approach we're going to give
you the bird's eye overview
so it's a way to quickly simulate and
estimate what the probability
distribution is
okay and to do this very simply what you
do is you pick a random
point to start at and then you
use some sort of an algorithm to sample
areas that are nearby
and i'm going to use the analogy of
a landscape let's say that you're
looking
at to find the highest peak you're
optimizing
elevation and you're trying to find the
highest peak so what you do is you
randomly drop an individual off
and then because you can't look around
right so
in phylogeny if you've got an individual
phylogen you can't just look at the
phylogeny look at all of the phylogenys
next to it see which ones are better you
have to test them
and so in this analogy let's say you
have a blind individual
and you're trying to help that blind
individual find the highest peak in an
area
so you drop them off you have them do
some
explorations right next to them so they
take a step
north south east west and if one of
those steps makes them
a little bit higher in elevation that's
the new place
and then they go from that one they do
north south east west and that's a new
place and so by doing
a very quick rough algorithm we could
potentially
continually go up in elevation until we
reach the highest point at that point
they would keep going north southwest
gate lower and we would sample this area
and sample and sample and sample
that would be a local maximum it's no
guarantee that it's overall the global
maximum
and so we would randomly start another
one and maybe we randomly start here we
go up up up and then we sample this peak
and so this is a very fast method
for sampling very rugged search
topologies and we're using this
three-dimensional search topology as an
analogy but in reality remember
that phylogenetic search space is a
multi-dimensional space and a very very
large one that we might need to spend
some time and effort
sampling but our markov chain monte
carlo
is an approach to sample the probability
distribution
and do so in a way that is fairly
certain of guaranteeing finding us all
of the really
best ones whether it's a you know very
rugged space like this we have a few
very good answers or it's a little bit
more spread out
and we'd have a sampling distribution
that was a little bit more spread out
and so this is again as long as you know
the details
which is what we we pick a random start
we look around it if we find a better
answer nearby
that becomes our new one we do that over
and over and over again until we've
sampled the search topology
fairly rapidly okay now
in addition to this markov chain monte
carlo for sampling the probability
search space we then use those
probabilities to
to calculate a uh model of evolution
and what we see when we do this for most
analyses
is that we start off with a model of
evolution
and we find a tree we then allow that
tree using this markov chain monte carlo
search strategy we allow that tree to
then inform that becomes our posterior
probability
and it informs our model of evolution we
re-estimate the parameters for that
model of evolution and then we run the
analysis again
again and we do that over and over and
over again
and what we find is that when we start
randomly we usually don't have a really
great score
but our scores and usually will improve
rather rapidly
until we are getting really similar
scores and as we
do this over and over again right
allowing our posterior probability
to then inform the
parameters for our next round of that
analysis
and we get scores out and we usually
will find a plateau and this plateau
represents
a peak in scores where we found a very
best overall solution
and if it's a very rugged search
topology we might get slightly different
trees during an another random event but
the scores are going to be very similar
if it's a very let me just i'm going to
google i'll bring in some scores
some uh possible over here um
i grew up in an area uh in arizona where
the nearby
mountains were called the superstition
uh mountains
and i choose them as an example because
they have this very rugged search
topology if you're trying to find the
highest peak in the superstitions
it could be difficult this is just the
most famous set of peaks but they're
actually others
so here let's use this one let's say
you're trying to find the highest peak
in something that looks like this
and obviously there are multiple
candidates and so this might take a lot
of searching but with enough we could be
relatively
certain that we found it and we might
get a set of answers that are very
similar from that one and that one as
far as what the highest peak is
a very different search topology would
be something like this
and i picked mount fuji because no
matter where you started in the in the
surroundings around mount fuji you would
be going up the slope to find the very
best
overall one and so if we have a very
small set of phylogenies that are all in
one kind of general vicinity with all
very similar
answers it would be a fairly simple one
no matter where we started we would end
up
recovering that but with markov chain
monte carlo we can also recover
all the best answers our population of
best answers
even if our search topology is very very
rugged so we do
random beginning points and then we save
all of the trees
that are in this very good score area we
discard
the phylogenesis that have lower scores
and this is called the burn-in period
now during this burn-in period and
actually even during the
plateau period we also allow our model
of evolution to vary
so let's do a step-by-step
walkthrough and i'm going to i'll type
it out here but let's do a step by step
walk through of what we do for a
bayesian analysis
number one we estimate
our parameters
for the model of evolution
and we do this i'll put this here it's a
little bit long but we do this based
on our initial
prior probability
now remember we don't know anything at
all about the search topology we don't
know if it's mount fuji we don't know if
it's superstition mountains
we have no idea and so our best estimate
at the beginning let's make
so our neighbor joining tree although
it's not the
probably the very best overall set of
relationships it gets us in the ballpark
and that helps us then estimate our
parameters for our model of evolution
and then we run some iterations we use
markov chain monte carlo we use that
model of evolution with the parameters
estimated from it to get a
answer and then we pause
so number two run
some analyses
and we'll put mcmc this markov chain
monte carlo as we run those analyses
and then at that point we pause
and re-estimate
our parameters so maybe as we're running
this analysis our scores for our trees
parameters
are getting a little bit better
and that's great if they're getting a
little bit better that means our
estimation of phylogeny is probably
better
and then we can re-estimate our
parameters and then we
repeat
and as long as our scores are getting
better and better and better we continue
to do that
then at some point our scores will reach
kind of this plateau phase where no
matter how many times we go through
these steps of repeating re-estimating
our
parameters for our model of evolution
getting the tree scores out of it
we get this burn-in or the burn-in
period is gone right so that says trees
are getting better and better and better
we reject all of those trees and then we
start saving all these trees that have
really good scores and many of these
trees might be
repetitions of ones we found before
we're finding the trees over and over
and over again
but we're kind of dialing in our model
of evolution
and then we repeat the whole thing
usually four times sometimes six or
eight times repeat the entire process
four to we'll say four to eight times
you can do more but after a while you're
kind of getting uh
limited returns for your efforts so
repeat this in prior
prior entire process four to eight times
and each of these times when we started
we're doing these
random starts we save all of these trees
so the end result of this is this
large large trees hundreds of thousands
of trees
that we've saved that are all pretty
good scores as we've looked at our
entire search to policy
we then make a consensus tree for all of
these trees that we've we've
found so basically we compare all of the
trees we look at
what relationships they have in common
and what relationships are different and
we and we put numbers on those
and these numbers become our support
value so for instance
if this relationship between pleodorana
and
clam idomonas these two species
if that is found in every single one of
those good trees that we've saved then
we put 100 there representing 100
of the time if a set of relationships so
these three species together is only
found there and 82 percent of the trees
then we put that number there and if a
set of relationships is not found in the
majority of the trees we collapse that
node down
and we make it a polytomy so the nice
thing
about the maximum
i'm sorry the bayesian analysis is that
we end up with support values
as part of the process of doing the tree
we'll look at other support values in
our next unit that are used for other
analyses but a maximum likelihood
analysis
gives you a tree but it doesn't tell you
if there are some of those relationships
that are better supported than others
a parsimony analysis the same thing you
get an end result and says this is the
best tree overall based on this
parsimony criteria
but it doesn't tell you if some of those
relationships are better supported than
others
a bayesian analysis does that so it's a
another time saving a part of a bayesian
analysis
so not only are bayesian analyses
overall faster than maximum likelihood
analyses
but they also give us a support measure
at the end of the analysis so that's a
huge
advantage for bayesian analysis okay
so
review of some things we've talked about
already
we can use known phylogenies like from
bacteria and determine whether parsimony
maximum likelihood or bayesian analyses
are best
and we can also use simulated data
and we've talked about areas like long
branch attraction where parsimony fails
we talked about areas like rate
heterogeneity also called heterochronic
where maximum likelihood might fail
but overall if we're in kind of a
majority of these trees we get
pretty good results from even neighbor
joining but certainly maximum likelihood
parsimony and bayes analyses are going
to give us good results
but because of statistical
considerations many people want to use
either maximum likelihood analysis but
that can be computationally intense and
so bayesian analyses is a little bit of
a less computationally intense way
to analyze and then we can look at
results
so here's a result where different
methods you know part weighted parsimony
versus parsimony
find the very best tree and notice that
in
most considerations given enough data
we're going to find the very best trees
but some methods perform better overall
like maximum likelihood even neighbor
joining if we have
correct models of evolution okay so
the take-home message again is that we
may want to use a variety of methods
if we want to use statistical approaches
we might do a maximum likelihood unless
it's too computationally intense for the
size of our data
and then we might want to do bayesian
but also compare it to a parsimony
analysis and our best overall phylogeny
might represent a consensus
among methods and so this congruence
among methods is a very good way and if
we have areas that
are not congruent then maybe we want to
reevaluate or gather more data to
determine if we can resolve that or
maybe we say well
i'm going to use the maximum likelihood
result or the bayesian result
because i want to do a falsifiability
test i want to
use those statistical approaches to be
able to do that
and so here's our big picture overview
of all of the methods that we've talked
about
now originally i put the bayesian
analysis in this
category of using nucleotide sites and
we certainly do that we're not using
distance data so it's very clearly in
this right hand column
but whether it goes in a clustering
algorithm or an optimality criterion is
a little bit different
in both parsimony and maximum likelihood
we really are
optimizing we're doing this by doing a
very intensive search of lots and lots
of trees
and we're really optimizing the score
based on either a parsimony criterion
or this likelihood score that we gave
you an overview of
in a bayesian analysis we're doing
something slightly different which is
why i put it up here but with a question
mark
we are really optimizing the model of
evolution
and the parameters for that model by
doing this iterative approach where we
start with an unknown distribution
maybe we give a neighbor joining tree we
then allow the analysis as we progress
to inform it
and to then go back in and re-evaluate
or
change the parameters for our model of
evolution and then see if that gives us
a better tree
if it does great we keep them if not and
we allow at the next round to
change those parameters for the model of
evolution so in reality a bayesian
analysis is
optimizing our model of evolution and
our parameters for that model of
evolution
and then allowing that to inform our
overall outcome
so it's not really maybe a better place
for bayesian analysis is kind of in the
gray area between these two things
because we're not really finding the
best tree based on a single optimality
criteria
we're allowing our analyses to inform
the parameters
change our model of evolution or at
least the parameters for that model of
evolution
and then reiterate to it again and again
and again until we don't get any better
we've kind of
reached a plateau phase where we're not
getting better scores no matter how many
times
we do this iterative approach so
review this make sure you're familiar
with these albeit somewhat of an
overview of methods for some of these
you should know how to map trees map
characters onto a tree in a parsimony
analysis and find a score for that tree
and then just understand the basic
principles for these more
computationally intense methods maximum
likelihood and statistical methods
for bayesian analysis
5.0 / 5 (0 votes)