Neighbour Joining
Summary
TLDRThe video script discusses the application of distance-based methods in reconstructing genetic trees, focusing on the use of least squares and minimum evolution criteria for optimal tree selection. It highlights the impracticality of exhaustive search due to the exponential growth of possible trees with more leaves, and introduces heuristic and clustering algorithms as alternative approaches. The script provides a detailed explanation of the neighbor-joining method, a popular clustering algorithm, which involves iterative steps to build a tree by combining the nearest nodes based on specific criteria. The process is illustrated with an example using a distance matrix, demonstrating how the algorithm constructs a tree with branch lengths that match observed distances, although it does not guarantee finding the best possible tree.
Takeaways
- 🌳 The script discusses distance-based methods for reconstructing genetic trees, focusing on finding the optimal tree with the smallest sum of squared differences.
- 🔍 It highlights the impracticality of exhaustive search for large datasets due to the explosive growth in the number of possible trees with an increase in leaves.
- 🔄 An alternative to exhaustive search is heuristic search, which uses a hill climbing algorithm to gradually move towards better trees, starting from a random tree and improving upon it.
- 👉 The script introduces clustering algorithms, specifically neighbor-joining, as a third method for distance-based methods, which is widely used and faster than the other two methods.
- 📊 The starting point for clustering algorithms is a distance matrix, from which the algorithm iteratively builds the tree by combining the two nearest nodes at each step.
- 🔬 The neighbor-joining algorithm computes a specific measure 'u' for each leaf to determine which nodes to combine first, based on the smallest pairwise observed distance minus the 'u' values.
- 🔗 After combining nodes, neighbor-joining uses formulas to calculate the branch lengths connecting the nodes to their new ancestral node, rather than simply splitting the observed distance.
- 📈 The algorithm iteratively updates the distance matrix by replacing the rows and columns of the combined nodes with a new row and column for the ancestral node, and recalculating distances.
- 🔚 The process continues until all nodes are combined into a tree, with the algorithm providing a rapid method to obtain an overview of the data set, even though it does not guarantee the best possible tree.
- 📝 The script provides a detailed step-by-step guide for implementing the neighbor-joining algorithm, which can be followed to construct a tree from a given distance matrix.
- 📉 While neighbor-joining does not always provide the perfect match between observed and patristic distances, especially for larger datasets, it often yields reasonably good trees for an overview.
Q & A
What are the main methods discussed in the transcript for reconstructing genetic trees?
-The transcript discusses three main methods for reconstructing genetic trees: exhaustive search, heuristic search, and clustering algorithms such as neighbor-joining.
What is the primary challenge with using exhaustive search to find the optimal genetic tree?
-The primary challenge with exhaustive search is that the number of possible trees grows exponentially with the number of leaves, making it impractical for more than a small number of leaves.
How does the heuristic search method differ from exhaustive search?
-The heuristic search method, also known as hill climbing, starts with a random tree and iteratively improves it by choosing the best neighboring tree based on the sum of squared differences, rather than examining all possible trees.
What is the basic premise of clustering algorithms in genetic tree reconstruction?
-Clustering algorithms, such as neighbor-joining, work by iteratively combining the two 'nearest' nodes into a tree, gradually building up the tree structure until all nodes are included.
How does the neighbor-joining algorithm decide which nodes to combine first?
-The neighbor-joining algorithm decides which nodes to combine first by calculating a measure for each pair of nodes based on their pairwise observed distance minus the average distances from each node to all other nodes, and then selecting the pair with the smallest value.
What is the purpose of the value 'u' in the neighbor-joining algorithm?
-The value 'u' is the average distance from a leaf to all other leaves, divided by the number of leaves minus two. It is used to determine which pairs of nodes are 'nearest' and should be combined first in the tree.
How does the neighbor-joining algorithm compute the branch lengths when combining two nodes?
-The algorithm uses specific formulas to compute the branch lengths from the combined ancestral node to each of the original nodes, based on the observed distances and the 'u' values of the nodes being combined.
What is the significance of the patristic distance in the neighbor-joining algorithm?
-The patristic distance is the distance from one node to another in the tree. The neighbor-joining algorithm aims to construct a tree where the patristic distances match the observed distances as closely as possible.
Why might the neighbor-joining algorithm not always provide the best possible tree?
-The neighbor-joining algorithm does not guarantee finding the best possible tree because it is an iterative procedure that builds the tree step by step, which can lead to local optima rather than a global optimum.
What is the advantage of using clustering algorithms like neighbor-joining for reconstructing genetic trees?
-Clustering algorithms like neighbor-joining are advantageous because they are much faster than exhaustive or heuristic searches, making them suitable for large datasets where a rapid overview is needed.
Outlines
🌲 Distance-Based Phylogenetic Tree Reconstruction
This paragraph introduces the concept of using distance-based methods to reconstruct genetic trees. It discusses the challenges of using least squares and minimum evolution criteria due to the exponential growth of possible trees with an increasing number of leaves. The paragraph explains that exhaustive search is impractical for more than 15 leaves, making heuristic search methods like hill climbing necessary. It also introduces clustering algorithms, specifically Neighbor-Joining, as a third method for tree construction. The Neighbor-Joining algorithm is highlighted as the most widely used clustering method, which iteratively builds a tree by combining the two nearest nodes based on a distance matrix. The paragraph concludes by explaining that while these methods are faster, they do not guarantee the best possible tree but seem to work well empirically.
🔍 Neighbor-Joining Algorithm: Step-by-Step Explanation
The second paragraph delves into the specifics of the Neighbor-Joining algorithm, starting with the computation of a measure 'u' for each leaf, which is the average distance from the leaf to all others, scaled by n-2. The algorithm aims to combine the two tips with the smallest pairwise observed distance minus their respective 'u' values. After combining nodes, the paragraph explains the process of determining the branch lengths connecting the nodes to their new ancestral node and how to update the distance matrix with the new ancestral node's distances to the remaining tips. The paragraph concludes by emphasizing that the Neighbor-Joining method is a rapid iterative procedure that does not guarantee the optimal tree but can provide a good least squares tree for large datasets.
📊 Applying the Neighbor-Joining Algorithm to a Distance Matrix
This paragraph provides a practical example of applying the Neighbor-Joining algorithm to a real distance matrix. It walks through the initial step of computing the 'u' values for each node and then identifies the pair of nodes with the smallest combined 'u' value to start the tree construction. The paragraph details the calculation of branch lengths for the ancestral node and the process of updating the distance matrix to include the new node. It continues by repeating the process with the reduced dataset until only two nodes remain, which are then connected to complete the tree. The example concludes by verifying that the patristic distances in the resulting tree match the observed distances, demonstrating the Neighbor-Joining algorithm's effectiveness in this instance.
🤔 Limitations and Practicality of the Neighbor-Joining Algorithm
The final paragraph acknowledges the limitations of the Neighbor-Joining algorithm, noting that while it can provide good trees, it does not always guarantee the best possible tree, especially for larger datasets. It emphasizes that the algorithm is a heuristic method that offers a reasonably good solution without the need for exhaustive search. The paragraph concludes by highlighting the practicality of the Neighbor-Joining algorithm for quickly obtaining an overview of large datasets, suggesting its value in preliminary phylogenetic analysis.
Mindmap
Keywords
💡Distance-based methods
💡Least squares
💡Minimum evolution
💡Heuristic search
💡Neighbor-joining
💡Distance matrix
💡Clustering algorithms
💡Optimality criterion
💡Parsimony
💡Hill climbing algorithm
Highlights
Discussion on using distance-based methods to reconstruct genetic trees.
Exploration of least squares and minimum evolution optimality criteria for finding the best possible trees.
Challenge of the exponential growth of possible trees with an increasing number of leaves.
Introduction of heuristic search as an alternative to exhaustive search for large datasets.
Explanation of the hill climbing algorithm for gradually moving towards better tree structures.
Introduction of clustering algorithms as a third method for distance-based methods.
Description of the neighbor-joining method as the most widely used clustering algorithm.
Process of starting with a distance matrix in clustering algorithms.
Iterative procedure of building a tree by combining the two nearest nodes.
Explanation of the neighbor-joining algorithm steps for tree construction.
Computation of a specific measure 'u' for deciding which nodes to combine first.
Use of the smallest pairwise observed distance for node combination in neighbor-joining.
Calculation of branch lengths in the combined ancestral node.
Updating the distance matrix with a new ancestral node and its distances to other nodes.
Example walkthrough of applying the neighbor-joining algorithm to a real distance matrix.
Demonstration of how the neighbor-joining algorithm builds a tree step by step.
Final result of the neighbor-joining algorithm matching observed and patristic distances.
Emphasis on the neighbor-joining algorithm not always guaranteeing the best tree but often providing good results.
Transcripts
[Music]
so we're discussing how to use distance
based methods to reconstruct the genetic
trees we've been seeing how the least
squares and minimum evolution optimality
criteria can be used to find the best
possible trees we in essence the ideas
that we try to look through all possible
trees for each of them we can compute
this measure this sum of squared
differences and the idea is then to pick
the tree with the smallest sum of square
differences that the optimal tree
however as we saw when we were
discussing persimmony that's a problem
in that the number of trees grows
explosively with the number of leaves so
find a realistic number of leaves if we
have more than 1015 leaves it's
impossible to look through all the trees
so distance based method is therefore
possible just as it was for presumably
based methods for the persimmon a method
to do a heuristic search it's exactly
the same idea you start with a random
tree you look for a set of neighboring
trees by doing small rearrangements for
each of the neighboring trees you can
compute this sum of squared differences
you then pick the neighbor that has the
best sum of squared difference the
smallest value of Q pick that as your
new starting point
take the neighboring trees from that
again take the neighbor with the best
value climb one more step and that way
using this hill climbing algorithm you
will gradually move towards beta and
beta trees essentially reaching a local
and hopefully global optimum where you
have a good least squares tree so
exhaustive search one option if you have
few tips heuristic search another option
if you have more tips and that's
possible both from a simile and for
distance basement however in the case of
distance based methods there's actually
a third way and that is the group of
methods known as clustering algorithms
clustering methods neighbor-joining
is by far the the most widely used of
these and I'm going to explain how how
that works the main idea and clustering
algorithms is that instead of having an
actual optimality criterion instead of
being able to for any given tree compute
how good is this tree and then use some
method to find the best tree the
that instead you have an iterative
procedure where you gradually step by
step build your tree in the case of
clustering algorithms as always in
distance based methods the starting
point is a distance matrix the IDN
clustering algorithms is that you then
take two nodes that are in some sense of
the word and I'll return to that the two
nearest ones so you take two nodes that
you believe to be adjacent you put those
together in the tree in the sense that
you connect both of them to a common
ancestral node in the distance matrix
you now replace the columns and rows
that you had for these two original tips
you replace that by a new column and row
corresponding to your new and special
node you then just need to figure out
what should I put in the distance matrix
as the distance from this ancestral node
to the rest of them and depending on the
method there different ways of doing
this you repeat these steps each step
putting on putting together the two
nearest nodes putting together the two
Lunars nodes etc gradually building up
the tree onto all the nodes in the tree
length this is very rapid you just have
an iterative procedure you look at each
node one and you will end up with a tree
quite rapidly there is of course no
guarantee that the tree you find will be
the best possible tree it won't be the
necessary at least be the optimal
least-squares tree or the optimal
minimum evolution tree
however empirically these methods do
seem to work fairly well and they are
definitely a lot faster so especially in
situations where you want to rapidly get
an overview of where your data set is
very large it might be an area with
looking the Astoria so in the case of
neighbor-joining
which is the method will just have a bit
close read as I say each step in a
clustering algorithm is about picking
two nodes to put together and they are
the nodes that are in some sense of the
word nearest now in the case of
neighbor-joining the sense of nearest is
a bit peculiar perhaps
but it nevertheless works quite well so
I've put on this slide the entire
algorithm the entire recipe for doing
maple joining in principle if you know
how to compute a order to program a
computer you can take this slide and
make a small computer program that will
take as its input a distance matrix and
then construct a neighbor joining tree
as its output just by following the
steps on this single slide so I'm going
to just rapidly read through it it may
not make sense while I'm going through
it now but after I've done that we will
do I will walk you through an example
where we use the neighbor-joining
algorithm on a real distance matrix to
build a tree and you will see how that
works out so the idea in
neighbor-joining
is that you compute a particular measure
that you use as the basis for deciding
which nodes to combine first for each
tip for each leaf you compute this value
called u which is the sum of the
distance from your tip to all the other
tips divided by n minus 2 where n is the
number of these so this is essentially
the average distance from your note from
your leaf I to all bliss
except you divide by n minus 2 instead
of by n minus 1 there's a deeper reason
for why he looks like that I'm not going
to go into this now I'm just stating the
algorithm in neighbor-joining
the idea is now to combine the two tips
I and J where this expression is
smallest okay you combine the two trips
I and J where the pairwise observed
distance between the nodes minus UI
minus u J is the smallest for reasons
that I won't go into again yeah this
turns out to to be a good way of
creatively building the tree once you
have combined the two nodes there are
some things you need to figure out you
have combining the two nodes to an
ancestral node so you now need to figure
out what is the length of the branch
connecting each of these nodes to the
ancestral node in principle you could
just divide it in half and say half the
observed distance would be put on either
of the branches going out to either
that's what's done in a method called
upgma that we won't be discussing but in
neighbor-joining
you instead use the expressions down
here which I again will not go into for
computing the two separate branch names
finally you need to figure out this new
ancestral node that you just put in
place of the two original nodes what is
the distance from that and personal
notes to the rest of your tips in your
tree there's a formula for that also
that's based on the observed pairwise
distances and in the following manner I
won't again go into that bottom line
follow these these steps compute you for
each of the tips compute this value for
each pair of tips pick the pair what is
the smallest compute new branch lengths
from your q combined tips down to the
ancestral node compute branch lengths
from your ancestral notes all the
remaining nodes and then run one extra
iteration where you're now treating the
new ancestral node as a tip you repeat
until all the nodes are combined so
let's walk through an example and see
how that works let's say we start with
the distance matrix we have four
sequences we have these observed
distances step number one is to compute
the value you so for each node you take
the sum of the distance from your node
to the remaining nodes and divide it by
n minus 2 in this case n is 4 we have 4
beats so n minus 2 is 2 in the case of a
the difference the distance from A to B
is 17 hoc a 21-8 to D is 27 so we add up
those three numbers and divide by 2
that's 32.5 same for B 12 plus 18 plus
17 are the three distances from B to the
remaining three nodes divided by two we
get those numbers do the same for C and
D now we compute this number for each
pair of sequences the pairwise distance
between the two tips
- you for one tip - you for the other
tip
so in the case of HB for instance the
observed distance from A to B is 17 you
for a is thirty two point five you for B
is twenty three point five so 17 minus
thirty-two point five minus 23.5 inserts
at minus thirty nine okay we do the same
for all the other pair of sequences in
this particular case we now need to find
the smallest value in this table so in
this particular case we actually have
two equally small values we have a and B
and C and D so we arbitrarily select C
and D as the starting point for this
combination and put them together on the
tree we have now decided C and B must be
adjacent in the tree so we connect both
of them to some ancestral node which
I've called X on this particular slide
we now need to figure out and what are
the lengths of the branches connecting C
and D to X I showed you the formulas in
the original algorithm slide the formula
for node I here for instance is half the
observed distance between in this case C
and D plus half the difference between
UI and UJ so the observed distance
between C and D was 14 0.5 perfor team
plus 0.5 times U of C that's twenty
three point five minus U of B that's
twenty nine point five plug in the
numbers do the computations comes out at
four do the same for the other node zero
point five times fourteen plus 0.5 times
u for in this case D minus u 4 plug in
the numbers the value is 10 so according
to this formula we have now combined C
and D connected them to a common
ancestral node and we've computed the
branch lengths from that ancestral node
after them
notice that the some of the branch
lengths in this case 4 plus 10 is 14
that's the same as the observed distance
between them so in this case we have
actually the patristic distance between
C and D matches the one we observed
originally away we Dowe need in the
distance matrix to replace the rows and
columns for C and D with and you roll
and a new column for our single
ancestral node to do that we need to
compute the distance from our new node
here X to the remaining nodes in the
tree a and B at the two remaining nodes
that we haven't dealt with yet there's a
formula for that also it's essentially
these distances I put it up here plug in
the numbers you get the distance from X
to a 17 the distance from X to B is 8 so
we put in those numbers and the distance
matrix and we can now delete the C and D
rows and columns giving us this new
distance matrix and we're now ready to
run an X reservation again for each of
the nodes in the distance matrix we
compute this value u we take the
distance the sum of the distances from a
node to the remaining nodes so in the
case of a that the distance to B plus
the distance to X 17 plus 17 we divide
it by n minus 2 we now have three nodes
meaning we divided by 1 17 plus 17
divided by 1 is 34 4 be 8 plus 17 is 25
and finally 4 X 8 plus 17 is also 25 we
now compute this other value for each
pair of sequences the pairwise distance
minus u for one node - you for the other
node in this case the smallest value in
this table is between a and B we
therefore combine a and B in the tree we
connect them both through a an ancestral
node again we compute the two separate
branch lengths using the formulas I
showed you before in this case they turn
out to be
thirteen and four and again you will
actually notice that this means that the
sum of these thirteen plus four is 17
that's the same as the observed distance
between these notes so again in this
case in this part of the tree the
patristic distance matches the observed
distance we now have just one remaining
pair of sequences we now need to replace
a and B with the new ancestral note why
we need to compute the distance from Y
to X that's again done using the formula
that was on the original algorithm slide
it turns out to be four and at this
point we're actually done because we now
only have these two nodes in the tree so
we just need to connect them with a
branch which has the link that we just
computed so the result of all of this is
the following we started with this
distance matrix we went through the
different steps in the algorithm and you
can see that we ended up with the tree
looking like this we have C and E
together and B and a and we have these
branch lengths the distance from A to B
is 17 that's the same as we see in the
tree
13 plus 4 the distance from A to C is 21
13 plus 4 plus 4 that's 21 the distance
from A to D is 27 13 plus 4 plus 10 web
27 distance from B to C 4 plus 4 plus 4
that's 12 that also fits distance from B
to B 4 plus 4 plus 10 that's 18 and
finally the distance from C HD 10 plus 4
is 14 so in this particular case the
neighbor-joining algorithm actually gave
us the perfect answer there's a perfect
match between the observed distances and
the patristic distances that will not
always be the case in particular for
larger data sets and for cases where
it's not possible of course to find the
best possible tree but oftenly the
joining does give reasonably good trees
it will not guarantee you that you find
the best tree
[Music]
5.0 / 5 (0 votes)