BIOL 4330 Unit 2 1 2 Bayesian Analysis and Markov Chain Monte Carlo

Matthew Terry
29 Sept 202017:21

Summary

TLDRThe transcript explains Bayesian analysis, particularly its use in phylogenetic studies. It highlights how complex probability distributions are sampled using the Markov Chain Monte Carlo (MCMC) method to identify optimal evolutionary trees. The analogy of searching for the highest peak in a rugged landscape is used to illustrate how random starting points help explore possible outcomes. Bayesian analysis iteratively refines the model of evolution and parameters, improving accuracy while being computationally efficient. The process is compared with other methods like maximum likelihood and parsimony, emphasizing Bayesian analysis' speed and ability to provide support values.

Takeaways

  • 📊 Statistical analysis requires understanding the distribution of outcomes, which can be simple or complex depending on the population.
  • 🌳 For human height, a normal distribution is suitable, but phylogenetic analysis involves more complex, multi-dimensional probability distributions.
  • 🔍 Markov Chain Monte Carlo (MCMC) is used to estimate complex probability distributions by simulating random walks through the search space.
  • 🗺 The MCMC process starts with a random point and explores nearby areas, using an algorithm to move towards better scoring areas, akin to finding the highest peak in a rugged landscape.
  • 📈 The process involves running the MCMC, pausing to re-estimate parameters, and repeating until a plateau of scores is reached, indicating the best overall solution.
  • 🌐 Phylogenetic search space is multi-dimensional and large, requiring extensive sampling to ensure the best solutions are found.
  • 🔄 The Bayesian analysis involves an iterative process of estimating parameters, running MCMC, and re-estimating parameters until convergence.
  • 🌲 The 'burn-in' period discards initial trees with lower scores, focusing on saving trees with high scores that represent the best solutions.
  • 📊 A consensus tree is made from the saved trees, with support values indicating how often certain relationships appear, providing a measure of confidence in the phylogeny.
  • 🤔 Bayesian analysis is favored for its computational efficiency and ability to provide support measures, although other methods like maximum likelihood and parsimony are also discussed.

Q & A

  • What is the primary purpose of using statistical analysis in phylogenetic studies?

    -The primary purpose is to estimate the probability distribution of phylogenetic outcomes due to the complexity of phylogenies and the vast number of possible trees. Statistical analysis helps in identifying the most likely phylogenetic trees that represent evolutionary relationships.

  • Why are normal distributions insufficient for phylogenetic studies?

    -Normal distributions are insufficient for phylogenetic studies because phylogenies involve complex, multidimensional probability distributions, unlike simpler distributions like human height, which follow a normal distribution.

  • What is the Markov Chain Monte Carlo (MCMC) method, and why is it useful in phylogenetics?

    -MCMC is a statistical approach used to estimate probability distributions by sampling nearby regions in a complex search space. In phylogenetics, it helps simulate and estimate the likelihood of various phylogenetic trees and outcomes, even in highly rugged topologies with multiple local maxima.

  • How does the analogy of searching for the highest peak relate to phylogenetic tree optimization?

    -The analogy compares phylogenetic search space to a landscape with peaks. In MCMC, a random starting point is selected, and the algorithm explores surrounding areas to find the highest peak, much like trying to optimize a phylogenetic tree by improving its score step-by-step.

  • What is the 'burn-in period' in Bayesian phylogenetic analysis?

    -The burn-in period refers to the early stage of MCMC sampling where the trees with lower scores are discarded as the algorithm converges on better-scoring phylogenetic trees. During this phase, poorer solutions are filtered out.

  • How do Bayesian analyses differ from maximum likelihood and parsimony analyses?

    -Bayesian analyses not only search for the best trees based on a model of evolution but also provide a posterior probability for relationships, giving a measure of support for different parts of the tree. Maximum likelihood and parsimony do not automatically provide this support for relationships.

  • What is the role of the model of evolution in Bayesian phylogenetic analysis?

    -The model of evolution provides the framework for estimating the likelihood of different phylogenetic trees. Bayesian analysis refines the parameters of this model iteratively to improve tree scores, ultimately producing a tree that best fits the model.

  • Why is it important to re-estimate parameters during the Bayesian analysis?

    -Re-estimating parameters during Bayesian analysis allows for the adjustment of the model of evolution to better fit the data. This iterative process helps in converging toward the best phylogenetic tree with improved accuracy.

  • What does it mean when the analysis reaches a plateau phase?

    -The plateau phase occurs when repeated iterations of the Bayesian analysis no longer improve the tree scores, indicating that the best solutions have been found, and further iterations will not lead to better trees.

  • How does a consensus tree in Bayesian analysis provide support values for relationships?

    -In Bayesian analysis, after saving the trees with good scores, a consensus tree is created by comparing all saved trees. Relationships found in most trees are given higher support values, while those found in fewer trees are given lower values or collapsed into polytomies.

Outlines

00:00

📊 Understanding Probability Distributions in Phylogenetic Analysis

This paragraph introduces the concept of analyzing complex probability distributions in phylogenetic trees. It begins by comparing simpler distributions like human height to the more complex multi-dimensional probability spaces in phylogenetic analysis. To navigate this complexity, Markov Chain Monte Carlo (MCMC) is introduced as a technique to estimate and sample probability distributions in rugged search topologies. The analogy of a blind person searching for a peak in a rugged landscape is used to explain how MCMC samples areas with good solutions and repeats the process to ensure accuracy.

05:00

🧬 Refining Models in Phylogenetic Analysis Through Iteration

This paragraph explains the iterative process used to refine evolutionary models in phylogenetic analysis. Starting with a random model, scores are calculated and gradually improve as the analysis is repeated. The model is continually informed by posterior probabilities, and the process is repeated until scores plateau, indicating the best solution has been found. Rugged and simple topologies are contrasted, with an example of rugged peaks in the Superstition Mountains and a smoother topology like Mount Fuji, to illustrate how MCMC explores different possibilities.

10:00

🏔️ Searching for the Best Phylogenetic Tree: Multiple Iterations and Burn-in Periods

This section delves into the process of searching for the best phylogenetic tree by running multiple iterations and saving trees with good scores. It discusses the 'burn-in' period where initial bad trees are discarded, and the process repeats multiple times (four to eight times typically) to ensure the best trees are found. A consensus tree is eventually created from all saved trees, where relationships are scored based on how frequently they appear across the saved trees. This consensus tree provides support values for different branches, which help indicate the reliability of the relationships.

15:01

🌲 Using Bayesian Analysis for Efficient Tree Searches

This paragraph highlights the advantages of Bayesian analysis over other methods like maximum likelihood and parsimony in phylogenetic studies. Bayesian analysis is praised for being computationally efficient and providing a built-in support measure for relationships within the trees. The process is particularly useful when maximum likelihood is too computationally intense. The paragraph emphasizes that Bayesian analysis allows researchers to better understand phylogenetic relationships by using a statistical approach to evaluate results.

🧠 Optimizing Models of Evolution with Bayesian Analysis

This final paragraph discusses how Bayesian analysis optimizes the model of evolution by refining parameters iteratively. It notes that while maximum likelihood and parsimony focus on finding the best tree based on set criteria, Bayesian analysis modifies both the model of evolution and its parameters to reach the best solution. This iterative refinement continues until a plateau is reached, ensuring the best possible phylogenetic model. The conclusion emphasizes the need to understand the principles of various phylogenetic methods to achieve the best overall phylogeny.

Mindmap

Keywords

💡Normal Distribution

A normal distribution is a probability distribution that is symmetrical around its mean, where most occurrences take place close to the mean, and fewer occurrences happen as you move away from it. In the video, it's used as an example of a simple statistical model that works well for traits like human height, but it contrasts with the complex distributions required for phylogenetic analysis.

💡Phylogeny

Phylogeny refers to the evolutionary relationships between different species or organisms, often represented in the form of a tree. The video highlights that phylogenies are more complex and have much more intricate probability distributions compared to simpler traits like height, which adds complexity to their analysis.

💡Search Space

The search space refers to the range of possible outcomes or solutions that an algorithm must explore. In the context of phylogenies, this search space is described as 'rugged' and 'multi-dimensional,' meaning there are many possible solutions that need to be explored to find the best evolutionary tree.

💡Markov Chain Monte Carlo (MCMC)

MCMC is a statistical method used to estimate the probability distribution of complex systems. In the video, it is applied to phylogenetic analysis, allowing for efficient sampling of the search space to identify the most probable evolutionary trees by starting at random points and optimizing step-by-step.

💡Local Maximum

A local maximum is a peak in a probability landscape where the solution found is better than its immediate neighbors but might not be the best overall (global maximum). The video uses this concept to explain how MCMC might settle on a good but not the best solution when searching through phylogenies.

💡Posterior Probability

Posterior probability is the updated probability of a model or hypothesis given new data. In Bayesian analysis of phylogenies, the posterior probability is used to update the model of evolution as the algorithm finds better trees, ensuring that the analysis improves over time.

💡Burn-in Period

The burn-in period refers to the initial phase of a Bayesian or MCMC analysis, during which early, less accurate results are discarded. The video describes this as the period where trees with lower scores are eliminated, and only high-quality trees are kept for further analysis.

💡Model of Evolution

The model of evolution is a framework used to describe how genetic sequences change over time. In the video, the speaker explains how different models are used to estimate parameters that guide the search for the best evolutionary trees during the Bayesian analysis.

💡Support Values

Support values are measures of confidence in the relationships found in a phylogenetic tree. In the video, these values represent the percentage of times a particular relationship or grouping of species is found in the best-scoring trees, offering a measure of how reliable the tree is.

💡Maximum Likelihood

Maximum likelihood is a statistical method used to find the tree that best explains the observed genetic data. The video contrasts this with Bayesian analysis, explaining that while maximum likelihood is highly accurate, it can be computationally intense, making Bayesian methods a faster alternative.

Highlights

Understanding distribution of outcomes is crucial for statistical analysis, particularly when analyzing complex systems like phylogenies.

Normal distributions work well for simple populations, such as human height, but phylogenies require much more complex probability distributions.

Markov Chain Monte Carlo (MCMC) is used to simulate and estimate the probability distribution for complex phylogenetic spaces.

The analogy of optimizing elevation in a rugged landscape is used to explain how MCMC searches for the highest scoring phylogenetic trees.

By sampling local areas, MCMC efficiently searches complex, multi-dimensional spaces to find local and potentially global maximums.

In rugged search topologies, multiple high-scoring solutions can be found, and MCMC helps in identifying the best ones.

Random starting points and iterative sampling allow MCMC to explore vast phylogenetic spaces and avoid being trapped in local maxima.

The posterior probability distribution from MCMC informs models of evolution, helping to refine and improve subsequent analyses.

Scores typically improve rapidly at the start of an MCMC analysis, plateauing as the best solutions are found.

The 'burn-in period' involves discarding early trees with lower scores and saving those with higher scores as the analysis progresses.

MCMC helps create consensus trees by comparing all high-scoring trees and assigning support values based on the frequency of common relationships.

Bayesian analysis provides support values, which are useful for gauging the reliability of different phylogenetic relationships.

Compared to maximum likelihood, Bayesian analysis is computationally less intense but still provides reliable statistical results.

Bayesian analyses optimize both the model of evolution and the tree parameters, making it versatile for large datasets.

A key advantage of Bayesian analysis is its ability to incorporate statistical support measures directly into the tree-building process.

Transcripts

play00:01

so as an example we've set up this

play00:03

idea of having some understanding of a

play00:06

distribution

play00:08

of outcomes for us to be able to

play00:12

do a statistical analysis now this is a

play00:15

very simple these

play00:16

normal distributions whether they're

play00:18

flat or very pronounced

play00:20

are fairly simple distributions and for

play00:22

some

play00:24

populations that works really really

play00:26

well if we're looking at height

play00:27

in humans fall is a fairly normal

play00:29

distribution that's a fairly

play00:31

simple thing to analyze now

play00:35

when we are dealing with phylogenys our

play00:37

probability

play00:38

distributions are much much more complex

play00:40

and this is because phylogenys are

play00:42

complex and there are so many

play00:44

of them and so although it's not ideal

play00:47

maybe the best way to think about it is

play00:49

like a

play00:50

three-dimensional but in reality it's

play00:52

multi-dimensional

play00:53

space and so in this we might have some

play00:56

areas that are pretty good

play00:58

these local peaks and the higher we are

play01:00

here the the better the

play01:01

tree scores are but this is a very

play01:03

rugged and

play01:05

difficult topology and so because

play01:07

phylogenys can have such a complex

play01:10

probability of outcomes we and because

play01:14

we don't know what that might be ahead

play01:16

of time

play01:17

we need a way that we can estimate and

play01:19

look

play01:20

at this probability outcome this search

play01:23

space

play01:24

and to do that we use a process called

play01:27

markov chain monte carlo now

play01:31

you don't need to know all the details

play01:32

about it it's a fairly advanced

play01:34

statistical approach we're going to give

play01:36

you the bird's eye overview

play01:38

so it's a way to quickly simulate and

play01:40

estimate what the probability

play01:42

distribution is

play01:43

okay and to do this very simply what you

play01:46

do is you pick a random

play01:48

point to start at and then you

play01:52

use some sort of an algorithm to sample

play01:54

areas that are nearby

play01:56

and i'm going to use the analogy of

play01:59

a landscape let's say that you're

play02:03

looking

play02:04

at to find the highest peak you're

play02:06

optimizing

play02:08

elevation and you're trying to find the

play02:10

highest peak so what you do is you

play02:11

randomly drop an individual off

play02:13

and then because you can't look around

play02:16

right so

play02:16

in phylogeny if you've got an individual

play02:20

phylogen you can't just look at the

play02:21

phylogeny look at all of the phylogenys

play02:23

next to it see which ones are better you

play02:25

have to test them

play02:26

and so in this analogy let's say you

play02:29

have a blind individual

play02:31

and you're trying to help that blind

play02:32

individual find the highest peak in an

play02:34

area

play02:35

so you drop them off you have them do

play02:38

some

play02:40

explorations right next to them so they

play02:42

take a step

play02:43

north south east west and if one of

play02:46

those steps makes them

play02:47

a little bit higher in elevation that's

play02:50

the new place

play02:51

and then they go from that one they do

play02:53

north south east west and that's a new

play02:54

place and so by doing

play02:55

a very quick rough algorithm we could

play02:58

potentially

play02:59

continually go up in elevation until we

play03:02

reach the highest point at that point

play03:04

they would keep going north southwest

play03:05

gate lower and we would sample this area

play03:07

and sample and sample and sample

play03:09

that would be a local maximum it's no

play03:11

guarantee that it's overall the global

play03:13

maximum

play03:14

and so we would randomly start another

play03:16

one and maybe we randomly start here we

play03:17

go up up up and then we sample this peak

play03:20

and so this is a very fast method

play03:23

for sampling very rugged search

play03:25

topologies and we're using this

play03:27

three-dimensional search topology as an

play03:28

analogy but in reality remember

play03:31

that phylogenetic search space is a

play03:34

multi-dimensional space and a very very

play03:36

large one that we might need to spend

play03:38

some time and effort

play03:39

sampling but our markov chain monte

play03:42

carlo

play03:42

is an approach to sample the probability

play03:45

distribution

play03:46

and do so in a way that is fairly

play03:49

certain of guaranteeing finding us all

play03:50

of the really

play03:51

best ones whether it's a you know very

play03:54

rugged space like this we have a few

play03:56

very good answers or it's a little bit

play03:57

more spread out

play03:58

and we'd have a sampling distribution

play04:00

that was a little bit more spread out

play04:02

and so this is again as long as you know

play04:05

the details

play04:06

which is what we we pick a random start

play04:09

we look around it if we find a better

play04:12

answer nearby

play04:13

that becomes our new one we do that over

play04:15

and over and over again until we've

play04:16

sampled the search topology

play04:18

fairly rapidly okay now

play04:21

in addition to this markov chain monte

play04:23

carlo for sampling the probability

play04:25

search space we then use those

play04:27

probabilities to

play04:29

to calculate a uh model of evolution

play04:33

and what we see when we do this for most

play04:36

analyses

play04:37

is that we start off with a model of

play04:41

evolution

play04:42

and we find a tree we then allow that

play04:45

tree using this markov chain monte carlo

play04:48

search strategy we allow that tree to

play04:50

then inform that becomes our posterior

play04:52

probability

play04:54

and it informs our model of evolution we

play04:57

re-estimate the parameters for that

play04:58

model of evolution and then we run the

play05:00

analysis again

play05:01

again and we do that over and over and

play05:03

over again

play05:04

and what we find is that when we start

play05:06

randomly we usually don't have a really

play05:08

great score

play05:09

but our scores and usually will improve

play05:12

rather rapidly

play05:14

until we are getting really similar

play05:16

scores and as we

play05:17

do this over and over again right

play05:19

allowing our posterior probability

play05:22

to then inform the

play05:25

parameters for our next round of that

play05:28

analysis

play05:29

and we get scores out and we usually

play05:31

will find a plateau and this plateau

play05:33

represents

play05:34

a peak in scores where we found a very

play05:37

best overall solution

play05:38

and if it's a very rugged search

play05:40

topology we might get slightly different

play05:42

trees during an another random event but

play05:45

the scores are going to be very similar

play05:47

if it's a very let me just i'm going to

play05:49

google i'll bring in some scores

play05:51

some uh possible over here um

play05:54

i grew up in an area uh in arizona where

play05:57

the nearby

play05:58

mountains were called the superstition

play06:00

uh mountains

play06:02

and i choose them as an example because

play06:06

they have this very rugged search

play06:08

topology if you're trying to find the

play06:10

highest peak in the superstitions

play06:13

it could be difficult this is just the

play06:14

most famous set of peaks but they're

play06:16

actually others

play06:18

so here let's use this one let's say

play06:19

you're trying to find the highest peak

play06:21

in something that looks like this

play06:23

and obviously there are multiple

play06:25

candidates and so this might take a lot

play06:27

of searching but with enough we could be

play06:29

relatively

play06:29

certain that we found it and we might

play06:31

get a set of answers that are very

play06:33

similar from that one and that one as

play06:34

far as what the highest peak is

play06:37

a very different search topology would

play06:39

be something like this

play06:44

and i picked mount fuji because no

play06:46

matter where you started in the in the

play06:48

surroundings around mount fuji you would

play06:51

be going up the slope to find the very

play06:52

best

play06:53

overall one and so if we have a very

play06:55

small set of phylogenies that are all in

play06:57

one kind of general vicinity with all

play06:59

very similar

play07:00

answers it would be a fairly simple one

play07:02

no matter where we started we would end

play07:04

up

play07:04

recovering that but with markov chain

play07:07

monte carlo we can also recover

play07:09

all the best answers our population of

play07:12

best answers

play07:13

even if our search topology is very very

play07:15

rugged so we do

play07:16

random beginning points and then we save

play07:19

all of the trees

play07:21

that are in this very good score area we

play07:23

discard

play07:24

the phylogenesis that have lower scores

play07:27

and this is called the burn-in period

play07:29

now during this burn-in period and

play07:31

actually even during the

play07:33

plateau period we also allow our model

play07:36

of evolution to vary

play07:38

so let's do a step-by-step

play07:41

walkthrough and i'm going to i'll type

play07:43

it out here but let's do a step by step

play07:45

walk through of what we do for a

play07:47

bayesian analysis

play07:49

number one we estimate

play07:54

our parameters

play07:59

for the model of evolution

play08:04

and we do this i'll put this here it's a

play08:05

little bit long but we do this based

play08:08

on our initial

play08:14

prior probability

play08:18

now remember we don't know anything at

play08:20

all about the search topology we don't

play08:21

know if it's mount fuji we don't know if

play08:23

it's superstition mountains

play08:25

we have no idea and so our best estimate

play08:28

at the beginning let's make

play08:30

so our neighbor joining tree although

play08:32

it's not the

play08:33

probably the very best overall set of

play08:35

relationships it gets us in the ballpark

play08:37

and that helps us then estimate our

play08:40

parameters for our model of evolution

play08:42

and then we run some iterations we use

play08:45

markov chain monte carlo we use that

play08:48

model of evolution with the parameters

play08:50

estimated from it to get a

play08:52

answer and then we pause

play08:56

so number two run

play09:00

some analyses

play09:04

and we'll put mcmc this markov chain

play09:06

monte carlo as we run those analyses

play09:08

and then at that point we pause

play09:12

and re-estimate

play09:16

our parameters so maybe as we're running

play09:19

this analysis our scores for our trees

play09:21

parameters

play09:23

are getting a little bit better

play09:31

and that's great if they're getting a

play09:32

little bit better that means our

play09:33

estimation of phylogeny is probably

play09:35

better

play09:36

and then we can re-estimate our

play09:37

parameters and then we

play09:39

repeat

play09:46

and as long as our scores are getting

play09:47

better and better and better we continue

play09:48

to do that

play09:49

then at some point our scores will reach

play09:52

kind of this plateau phase where no

play09:54

matter how many times we go through

play09:56

these steps of repeating re-estimating

play09:58

our

play09:58

parameters for our model of evolution

play10:00

getting the tree scores out of it

play10:02

we get this burn-in or the burn-in

play10:04

period is gone right so that says trees

play10:06

are getting better and better and better

play10:07

we reject all of those trees and then we

play10:10

start saving all these trees that have

play10:11

really good scores and many of these

play10:12

trees might be

play10:13

repetitions of ones we found before

play10:15

we're finding the trees over and over

play10:17

and over again

play10:18

but we're kind of dialing in our model

play10:20

of evolution

play10:21

and then we repeat the whole thing

play10:26

usually four times sometimes six or

play10:28

eight times repeat the entire process

play10:35

four to we'll say four to eight times

play10:37

you can do more but after a while you're

play10:38

kind of getting uh

play10:40

limited returns for your efforts so

play10:42

repeat this in prior

play10:44

prior entire process four to eight times

play10:47

and each of these times when we started

play10:50

we're doing these

play10:51

random starts we save all of these trees

play10:55

so the end result of this is this

play10:59

large large trees hundreds of thousands

play11:01

of trees

play11:02

that we've saved that are all pretty

play11:03

good scores as we've looked at our

play11:05

entire search to policy

play11:07

we then make a consensus tree for all of

play11:09

these trees that we've we've

play11:11

found so basically we compare all of the

play11:13

trees we look at

play11:14

what relationships they have in common

play11:18

and what relationships are different and

play11:20

we and we put numbers on those

play11:23

and these numbers become our support

play11:25

value so for instance

play11:27

if this relationship between pleodorana

play11:30

and

play11:31

clam idomonas these two species

play11:34

if that is found in every single one of

play11:36

those good trees that we've saved then

play11:38

we put 100 there representing 100

play11:40

of the time if a set of relationships so

play11:42

these three species together is only

play11:44

found there and 82 percent of the trees

play11:47

then we put that number there and if a

play11:49

set of relationships is not found in the

play11:51

majority of the trees we collapse that

play11:53

node down

play11:54

and we make it a polytomy so the nice

play11:57

thing

play11:57

about the maximum

play12:01

i'm sorry the bayesian analysis is that

play12:03

we end up with support values

play12:05

as part of the process of doing the tree

play12:08

we'll look at other support values in

play12:10

our next unit that are used for other

play12:12

analyses but a maximum likelihood

play12:13

analysis

play12:14

gives you a tree but it doesn't tell you

play12:17

if there are some of those relationships

play12:18

that are better supported than others

play12:20

a parsimony analysis the same thing you

play12:22

get an end result and says this is the

play12:24

best tree overall based on this

play12:25

parsimony criteria

play12:27

but it doesn't tell you if some of those

play12:28

relationships are better supported than

play12:30

others

play12:31

a bayesian analysis does that so it's a

play12:34

another time saving a part of a bayesian

play12:38

analysis

play12:39

so not only are bayesian analyses

play12:40

overall faster than maximum likelihood

play12:43

analyses

play12:43

but they also give us a support measure

play12:47

at the end of the analysis so that's a

play12:49

huge

play12:50

advantage for bayesian analysis okay

play12:54

so

play12:58

review of some things we've talked about

play13:00

already

play13:01

we can use known phylogenies like from

play13:03

bacteria and determine whether parsimony

play13:06

maximum likelihood or bayesian analyses

play13:08

are best

play13:09

and we can also use simulated data

play13:13

and we've talked about areas like long

play13:14

branch attraction where parsimony fails

play13:17

we talked about areas like rate

play13:18

heterogeneity also called heterochronic

play13:21

where maximum likelihood might fail

play13:25

but overall if we're in kind of a

play13:27

majority of these trees we get

play13:28

pretty good results from even neighbor

play13:30

joining but certainly maximum likelihood

play13:32

parsimony and bayes analyses are going

play13:34

to give us good results

play13:36

but because of statistical

play13:37

considerations many people want to use

play13:40

either maximum likelihood analysis but

play13:42

that can be computationally intense and

play13:43

so bayesian analyses is a little bit of

play13:45

a less computationally intense way

play13:48

to analyze and then we can look at

play13:51

results

play13:52

so here's a result where different

play13:55

methods you know part weighted parsimony

play13:57

versus parsimony

play13:58

find the very best tree and notice that

play14:00

in

play14:01

most considerations given enough data

play14:04

we're going to find the very best trees

play14:06

but some methods perform better overall

play14:08

like maximum likelihood even neighbor

play14:10

joining if we have

play14:11

correct models of evolution okay so

play14:15

the take-home message again is that we

play14:17

may want to use a variety of methods

play14:19

if we want to use statistical approaches

play14:20

we might do a maximum likelihood unless

play14:22

it's too computationally intense for the

play14:24

size of our data

play14:25

and then we might want to do bayesian

play14:27

but also compare it to a parsimony

play14:30

analysis and our best overall phylogeny

play14:33

might represent a consensus

play14:35

among methods and so this congruence

play14:38

among methods is a very good way and if

play14:40

we have areas that

play14:41

are not congruent then maybe we want to

play14:44

reevaluate or gather more data to

play14:46

determine if we can resolve that or

play14:47

maybe we say well

play14:49

i'm going to use the maximum likelihood

play14:50

result or the bayesian result

play14:52

because i want to do a falsifiability

play14:54

test i want to

play14:55

use those statistical approaches to be

play14:57

able to do that

play14:59

and so here's our big picture overview

play15:01

of all of the methods that we've talked

play15:02

about

play15:03

now originally i put the bayesian

play15:05

analysis in this

play15:07

category of using nucleotide sites and

play15:09

we certainly do that we're not using

play15:10

distance data so it's very clearly in

play15:12

this right hand column

play15:14

but whether it goes in a clustering

play15:17

algorithm or an optimality criterion is

play15:19

a little bit different

play15:21

in both parsimony and maximum likelihood

play15:23

we really are

play15:24

optimizing we're doing this by doing a

play15:26

very intensive search of lots and lots

play15:28

of trees

play15:29

and we're really optimizing the score

play15:31

based on either a parsimony criterion

play15:33

or this likelihood score that we gave

play15:35

you an overview of

play15:37

in a bayesian analysis we're doing

play15:39

something slightly different which is

play15:40

why i put it up here but with a question

play15:42

mark

play15:43

we are really optimizing the model of

play15:45

evolution

play15:47

and the parameters for that model by

play15:48

doing this iterative approach where we

play15:50

start with an unknown distribution

play15:52

maybe we give a neighbor joining tree we

play15:55

then allow the analysis as we progress

play15:58

to inform it

play15:58

and to then go back in and re-evaluate

play16:01

or

play16:02

change the parameters for our model of

play16:03

evolution and then see if that gives us

play16:05

a better tree

play16:06

if it does great we keep them if not and

play16:09

we allow at the next round to

play16:11

change those parameters for the model of

play16:13

evolution so in reality a bayesian

play16:15

analysis is

play16:16

optimizing our model of evolution and

play16:19

our parameters for that model of

play16:20

evolution

play16:21

and then allowing that to inform our

play16:23

overall outcome

play16:25

so it's not really maybe a better place

play16:28

for bayesian analysis is kind of in the

play16:30

gray area between these two things

play16:32

because we're not really finding the

play16:34

best tree based on a single optimality

play16:35

criteria

play16:36

we're allowing our analyses to inform

play16:39

the parameters

play16:40

change our model of evolution or at

play16:42

least the parameters for that model of

play16:44

evolution

play16:45

and then reiterate to it again and again

play16:47

and again until we don't get any better

play16:48

we've kind of

play16:50

reached a plateau phase where we're not

play16:51

getting better scores no matter how many

play16:54

times

play16:54

we do this iterative approach so

play16:58

review this make sure you're familiar

play16:59

with these albeit somewhat of an

play17:02

overview of methods for some of these

play17:04

you should know how to map trees map

play17:06

characters onto a tree in a parsimony

play17:08

analysis and find a score for that tree

play17:11

and then just understand the basic

play17:12

principles for these more

play17:14

computationally intense methods maximum

play17:17

likelihood and statistical methods

play17:18

for bayesian analysis

Rate This

5.0 / 5 (0 votes)

関連タグ
Bayesian AnalysisMCMCPhylogeneticsEvolution ModelsProbability DistributionsTree EstimationComputational MethodsSearch AlgorithmsMaximum LikelihoodOptimization
英語で要約が必要ですか?