ModelAngelo
Summary
TLDRThe video transcript features a presentation by Kiarash Jamali from the MRC Laboratory of Molecular Biology in Cambridge, UK, discussing ModelAngelo, an automated atomic model building program for cryo-electron microscopy (cryo-EM) maps. Jamali explains the significance of Pi Day and its relation to the celebration of science, before delving into the technical aspects of ModelAngelo. He outlines the three-step process of the program: C alpha atom predictions, full atomic model building using sequence information, and post-processing to refine the model. The use of a graphical network and attention algorithm, inspired by machine learning models like ChapGPT and AlphaFold, is highlighted. Jamali also addresses the application of ModelAngelo in building models without prior sequence knowledge and its potential in identifying unknown sequences from high-resolution maps. The talk concludes with a Q&A session where common issues are discussed, future improvements to ModelAngelo are mentioned, including the addition of nucleotide modeling and an integrated HMMER search, and the presenter's future targets in the field of structural biology are briefly touched upon.
Takeaways
- 🎓 **Pi Day Celebration**: The video begins with a mention of Pi Day (March 14), which is a celebration of mathematics and science, significant to the presenter's family.
- 🧬 **Introduction to Kiarash Jamali**: Kiarash Jamali, from the MRC Laboratory of Molecular Biology in Cambridge, UK, discusses ModelAngelo, a program for automated atomic model building for cryo-EM maps.
- 🛠️ **ModelAngelo Overview**: ModelAngelo is a three-step automated process for building atomic models from cryo-EM maps, involving C alpha atom predictions, full atomic model building, and post-processing.
- 🧠 **Machine Learning Approach**: ModelAngelo utilizes a graphical network with an attention algorithm, a technique used in various machine learning applications, to combine cryo-EM map data, sequence information, and initial C alpha positions.
- 📈 **C Alpha Prediction**: The process starts with predicting C alpha atoms within a 1.5 Å cube of the cryo-EM map, transforming it into a machine learning segmentation problem.
- 🔬 **Integration of Data**: The graphical network in ModelAngelo is designed to integrate cryo-EM map data, sequence data, and initial C alpha positions into a comprehensive model.
- 🧬 **Sequence Module**: A key component is the sequence module, which uses a protein language model to analyze the relationship between the C alpha representation and the input amino acid sequence.
- 🧬 **Spatial Invariant Point and Attention Module**: This module is similar to AlphaFold and focuses on integrating geometric information about nearby residues, such as alpha helices and beta sheets.
- 🔍 **Post-Processing**: Post-processing involves turning probabilities for amino acid types into an actual atomic model, using a hidden Markov model to refine sequence predictions.
- 🧬 **Sequence Search**: ModelAngelo can perform sequence searches against a genome using HMM profiles, which can be useful for identifying unknown sequences in a cryo-EM map.
- ⚙️ **ModelAngelo Usage**: The command-line interface for ModelAngelo is designed to be simple, with options to build with or without sequence information, and detailed instructions available on GitHub.
Q & A
What is the significance of Pi Day in the context of this video?
-Pi Day, celebrated on March 14, is a holiday that is significant to this video because it is a celebration of mathematics and science in general. The presenter, Jason Key, mentions that it is a big holiday in his family and thematically appropriate for introducing Kiarash Jamali, who has a background in mathematics and computational approaches to structural biology.
What is ModelAngelo and what does it automate?
-ModelAngelo is an automated, atomic model building program for cryo-EM maps. It is designed to automate the process of building atomic models into cryo-EM maps in an end-to-end fashion, which traditionally is done manually on a residue-by-residue basis.
How does ModelAngelo utilize the attention algorithm in its process?
-ModelAngelo uses the attention algorithm as a differentiable search procedure within its graphical network. This allows the system to extract information from a library of vectors representing different pieces of input data, such as the cryo-EM map, sequence information, and initial C alpha positions. The attention algorithm helps in integrating this data together to build a full atomic model.
What is the role of the sequence module in ModelAngelo?
-The sequence module in ModelAngelo processes the amino acid sequence for all the sequences expected to be seen in the map. It uses a protein language model to convert the text sequence into a series of vectors. These vectors are then used to calculate similarity to the sequence and update the representation of each residue with the added sequence information.
How does ModelAngelo handle the prediction of amino acid types for each residue?
-ModelAngelo predicts the probabilities for each amino acid type for every residue. Instead of choosing the highest scoring amino acid, it uses these probabilities to inform the selection of the correct sequence, often turning this into a hidden Markov model to align with the user-provided sequences.
What is the purpose of the post-processing step in ModelAngelo?
-The post-processing step in ModelAngelo is used to fix issues and refine the atomic model. It involves taking the probabilities for amino acid types and using them to determine the actual amino acid in each residue, correcting any issues with the initial C alpha positions, and refining the orientation of the atomic model.
How does ModelAngelo deal with unknown sequences in the map?
-ModelAngelo can be run without a sequence module, allowing it to build models without prior knowledge of the sequences. It can then use the built model to search against a genome or a database of known sequences to identify the unknown sequences present in the cryo-EM map.
What are the common issues faced when using ModelAngelo and how can they be diagnosed?
-Common issues include the map being in the wrong handedness, which can be diagnosed by checking the confidence values predicted by ModelAngelo or by flipping the hand of the map and re-running the model. Other issues include low local resolution, which can be addressed by ensuring the sequence is correctly provided in the fasta file, and low global resolution, which might require higher resolution maps for better results.
What are the improvements expected in the upcoming ModelAngelo 1.0 version?
-The upcoming ModelAngelo 1.0 version will include the ability to build nucleotides, improved performance with an integrated HMMER search, optimizations for faster processing, and the use of a Nyquist of 2 Angstroms for better results with nucleotides.
What is the ideal resolution range for ModelAngelo to build high-quality models?
-The ideal resolution range for ModelAngelo to build high-quality models is 3.5 Angstroms and better. The performance starts to drop off quickly after 3.5 Angstroms.
How does ModelAngelo handle symmetry in cryo-EM maps?
-ModelAngelo builds everything independently, assuming no symmetry. If symmetry is present in the map, the user may need to manually apply symmetry to the built chains, choosing the best chain and duplicating it accordingly.
What are the next steps or targets for Kiarash Jamali after working on ModelAngelo?
-Kiarash Jamali plans to continue improving ModelAngelo, particularly with the addition of nucleotide building capabilities. He is also involved with other projects in cryo-EM processing, with further plans to be disclosed as the projects develop.
Outlines
🎓 Introduction to ModelAngelo and Cryo-EM
The video begins with a greeting from Jason Key on Pi Day, emphasizing the significance of mathematics and science. He introduces Kiarash Jamali, a mathematician and computational biologist from the MRC Laboratory of Molecular Biology in Cambridge, UK. Jamali discusses ModelAngelo, an automated program for building atomic models from cryo-electron microscopy (cryo-EM) maps. The process involves three main steps: C alpha atom predictions, integrating sequence information to build a full atomic model, and post-processing to refine the model. The video also explains the importance of the cryo-EM map and the challenges of de novo residue-by-residue modeling.
🧠 The Attention Algorithm and ModelAngelo's Graph Network
Jamali explains the attention algorithm, a differentiable search procedure used in machine learning, which is central to ModelAngelo's operation. The algorithm helps the program identify connections between atoms by comparing vectors derived from the cryo-EM map and the amino acid sequence. ModelAngelo uses a graphical network designed to combine cryo-EM map data, sequence information, and initial C alpha positions. The network employs convolutional neural networks (CNNs) to process density data and attention mechanisms to integrate this information, leading to an updated representation used to build the atomic model.
📐 Model Building Process and Post-Processing
The video outlines the process of using the updated representations from the graph network to build the atomic model. It details the backbone framework that predicts the shift of atoms to refine the model's orientation. The program also predicts side chain orientations, amino acid probabilities, and a confidence measure based on the expected backbone ROIC. Post-processing involves converting probabilities into an actual atomic model, using a hidden Markov model to align against provided sequences and refine amino acid predictions.
🧬 Sequence-Informed and De Novo Model Building
Jamali demonstrates how ModelAngelo can build models both with and without user-provided sequences. He shows an example of building a model with the sequence and discusses the output directory structure, which includes folders for C alpha outputs and graph neural network (GNN) outputs. The video also explores de novo model building without sequences, highlighting the program's ability to predict and identify sequences even when they are not initially provided, using high-resolution cryo-EM maps.
🛠️ ModelAngelo's Command Line Interface and Common Issues
The video provides instructions on how to use ModelAngelo through its command line interface, detailing the process for building models with and without sequences. It explains the output directory structure and the contents of the output files. Jamali addresses common issues such as map handedness, local resolution problems, and the challenges of building models with low global resolution. He also suggests using hhblits for sequence searches and provides insights into troubleshooting and improving results.
🔍 Addressing Fragmented Chains and Future Directions
Jamali responds to a question about building fragmented chains, suggesting that local resolution might be the issue and manual fixes might be necessary. He discusses the potential of AI software in addressing model building challenges and the possibility of using ModelAngelo's intermediate files for localization. The video concludes with a Q&A session where Jamali addresses questions about the sensitivity of ModelAngelo to different density types and the possibility of using three-dimensional structures for unknown sequence cases. He also shares his future targets, including improvements to ModelAngelo and other projects in cryo-EM processing.
🚀 Upcoming Features in ModelAngelo 1.0
The video script mentions the upcoming release of ModelAngelo 1.0, which will include significant improvements. New features will enable the building of nucleotides, which are more challenging due to their complexity. The program will also integrate an HMMER search, streamlining the process of searching a genome. Additionally, ModelAngelo 1.0 will be faster and work better with higher resolutions, particularly at 2.5 Angstroms for nucleotides and 3.5 Angstroms and better for proteins.
Mindmap
Keywords
💡Cryo-EM
💡ModelAngelo
💡C Alpha Predictions
💡Attention Algorithm
💡Hidden Markov Model (HMM)
💡Resolution
💡Amino Acid Sequence
💡Post-Processing
💡Backbone Framework
💡Torsion Angle Prediction
💡Symmetry
Highlights
Kiarash Jamali introduces ModelAngelo, an automated atomic model building program for cryo-EM maps.
ModelAngelo is designed to automate the process of building atomic models into cryo-EM maps in an end-to-end fashion.
The program uses a three-step process involving C alpha atom predictions, sequence information integration, and post-processing.
A graphical network is employed to combine cryo-EM map, sequence, and initial C alpha positions into a full atomic model.
The attention algorithm, used in ModelAngelo, is a differentiable search procedure aiding in extracting information from the library of vectors.
ModelAngelo utilizes convolutional neural networks for interpreting the density around C alpha atoms and predicting peptide bonds.
The sequence module uses a protein language model to integrate amino acid sequence information into the atomic model building process.
A spatial invariant point and attention module is used to integrate geometric information about nearby residues.
ModelAngelo predicts confidence measures for the model based on the expected backbone ROIC, aiding in identifying areas of lower confidence.
Post-processing involves turning probabilities into an actual atomic model and using a hidden Markov model for sequence alignment.
ModelAngelo can build models without user-provided sequences, relying on the highest probability amino acids.
The program can identify and build models for unknown sequences in high-resolution cryo-EM maps.
ModelAngelo's command line interface is designed to be simple, with options to build with or without sequence information.
The program provides intermediate and final output files, with the final output being a CIF file for the atomic model.
Common issues with ModelAngelo include wrong map handedness, local resolution problems, and the need for manual sequence correction in some cases.
Upcoming ModelAngelo 1.0 will include improvements such as nucleotide modeling, integrated HMMER search, and performance optimizations.
ModelAngelo has been successfully used to build large models with up to 160,000 residues.
The ideal resolution for ModelAngelo to build high-quality models is 3.5 Angstroms or better.
The program is not limited by the size of the model and can handle large biomolecular complexes.
Transcripts
[COMPUTER SOUNDS]
Welcome to the SBGrid YouTube channel
software tutorials by developers,
lectures by structural biologists, unique content
brought to you by SBGrid.
[MUSIC PLAYING]
Hello, everybody.
Thank you for joining.
This is Jason Key at SBGrid.
Happy Pi Day.
Today is March 14, Pi Day.
So that's a celebration of mathematics,
and I take it as a celebration of science in general.
So Pi Day is a big holiday in our family.
We always make pi.
So with that, it is my pleasure to introduce Kiarash Jamali,
joining us from the MRC in Sjors Scheres Group.
Kiarash thematically appropriate for pi day,
has a background in mathematics and has
taken on computational approaches
to structural biology problems.
And so I guess the first application borne
of those efforts that I've seen is ModelAngelo, which he's
going to tell us about today.
So Kiarash, go right ahead.
Thanks, so yes, I'm Kiarash Jamali
in the Scheres Group at the MRC laboratory of molecular biology
in Cambridge, United Kingdom.
And today, I'm going to be talking about ModelAngelo,
which is an automated, atomic, model building
program for cryo-EM maps.
So I'll get started.
Here we go.
So this is a cryo-EM map.
And we want to build this atomic model into it.
And the way this is normally done
is if you want to do Denovo, you basically kind of just
do a residue-by-residue manually.
And so we wanted to automate this process
in an end-to-end fashion.
And so what we came up with is ModelAngelo,
which is a three-step process to build atomic models.
And the three steps are basically you first
start with C alpha atom predictions, which
is to place each residue separately,
just from the cryo-EM map.
And then you take the sequence information as well.
And you try to build the full atomic model
from those initial points as well as the cryo-EM map.
And then afterwards, we have post-processing
to fix some issues.
So for the C alpha part, which I won't really
be talking too much about today, is you have your C alpha atoms.
They're in this cryo-EM volume.
And we've oxidized it.
So we basically, say, ask the question within a 1 and 1/2
Angstrom cube of the cryo=EM map,
is there or is there not a C alpha atom existing there.
And then this can become a machine
learning segmentation problem.
So basically, you have a cryo=EM map that goes as an input,
and the output is just a bunch of probabilities for each voxel
whether or not it includes a C alpha atom.
And so here's your map.
And this is real data.
So that's an actual map that we give online.
And on the right, for each orange dot,
is a C alpha prediction it has made.
But let's assume that this part we have already
trained the network.
And we want to move on to how to build
full atomic model from this.
And the way that works is basically
the test of what ModelAngelo is, which
is this graphical network.
And we had to specifically design it
so that you're able to combine all these different pieces
of information together, which is the cryo-EM map which
is a 3D voxel grid.
You have the sequence, which is just text.
The amino acid sequence for all the
sequences that you expect to see in the map.
And then you also have your initial C alpha positions,
which is just a graph of positions.
And how do you combine all this data together
is actually difficult and has never been done before.
So we designed this graphical network,
which basically, takes each module is
responsible for extracting information
from a specific piece of these inputs.
And then it's all integrated together within the graph.
And each module is based on this attention algorithm
that has now been taking all of machine learning way.
So if you've ever use ChapGPT, AlphaFold,
all use this attention algorithm.
So here it is.
And what the attention algorithm is, is it's
a differentiable search procedure.
So you have some object.
You represent it with a vector by putting it through a machine
learning, let's say, NLP.
And then you get a vector at the end.
And then you also have a library of things
that you want to compare it against.
And so what you do here is you do basically a dot
product of this vector to your library of vectors.
You get a similarity.
And then you marginalize over these similarities
with a different representation of the library.
And what this gives you at the end
is for a new object that has come in,
what kind of information can I extract
from the library of things I've seen in order
to get a new representation for the object
based on how it relates to the library.
That's a bit abstract, but I will crystallize a bit
when I go through each of these modules.
So for example, the cryo-EM module, what it does
is it looks at each of those C alpha initial points
that I talked about.
It takes the cube of density around this.
And you've probably heard of convolutional neural networks.
This is how, let's say, image processing with new machine
learning methods works and also 3D volume stuff as well.
And here in ModelAngelo we use it
as well for the cryo-EM images.
So you interpolate the cube around each C alpha,
and the cubes are also oriented based on the backbone
orientation at the C alpha position, which
you might be wondering, how do we even
get the backbone orientation?
And the very beginning, it's completely random.
But then ModelAngelo will update this backbone representation
as it goes along.
So we interpret this cube of density
around the C alpha atom.
And we put it through a convolutional neural network.
We get a vector.
We do similar for rectangles of density between the current C
alpha atom and all its nearest neighbors, so 20
nearest neighbors.
We look at rectangles between them.
And this is supposed to give it an idea of
whether or not there's a peptide bond to C alpha atom.
So then this also goes through a CNN,
and we also get another vector.
And then the cube kind of represents
the query, so the object that comes in.
And all the rectangles represent the library of things.
So the idea is to say, for this C alpha atom,
which of the other atoms around it does it actually connect to?
And then based on that, we can actually get the information
from those nearby atoms to come into the current atom
and update the representation.
The sequence module, which is the second module that
goes after it, then asks the question, OK, for each residue
that we have a C alpha for, given the representation that,
let's say, before it's been updated
from the cryo-EM module, can we see anything between this
and the sequence that came in?
And so the sequence, that's just text,
text goes through a protein language model.
So we use the ESM language model from Facebook.
And that turns it into a series of vectors.
And then we do a similar search against the sequence,
and we calculate similarity to the sequence.
And then we bring it back.
So this is basically now corresponds--
before we just had cryo-EM Information,
now we also add in the sequence information
from the amino acids that were inputted,
and you get an updated representation.
And lastly, we also do this module,
this spatial invariant point and attention module,
that is very similar to the one that's in AlphaFold.
And the idea here is now you want
to integrate information about the geometry
of nearby residues, not just the cryo-EM information and not
just the sequence information but also the geometry,
whether or not something is alpha
helix, greater sheet, et cetera.
And the way this works is, for each residue, you
kind of predict points around what you should look at.
And then if there's a nearby residue that
is close at that point, then you get a high weight,
high similarity from that residue
when you bring in information again, OK.
So in this graphical network, once the information
goes to each of these different modules,
the representations keep getting updated for each residue.
And then using these updated representations,
we put it through this backbone framework.
And what this one does, for example,
this is one of the predictions we make,
is that for this background frame
that we talked about, which is just a triangle with the C
alpha atom, the nitrogen carbon atoms,
it predicts how much to shift each of these atoms.
So in the beginning, when we did the segmentation task,
if you remember, it was like a 1 and 1/2 Angstrom voxel.
We just said that the C alpha atom exists in this voxel.
But that doesn't really tell you anything about where
in the voxel.
So the C alpha atom itself starts off a bit wrong.
And on average, it's like 1 Angstrom off.
And then the nitrogen and carbon atoms, we're actually,
we just started them off in a random rotation.
So those are completely wrong in the beginning.
But then every time these features are updated,
and they go through this backbone frame module,
we get these shift vectors, and then
now we can actually update the orientation of this triangle.
So each layer of the graphical network
updates the orientations.
And we get closer and closer to what
the actual true orientation of the spectrum is.
So this is one of these modules.
But there's a few more.
There's a torsion angle prediction for the side change.
There's amino acid prediction probabilities.
And so all of those are predicted
after each of these modules has updated the representation.
OK one of those things is also a confidence measure.
This is very reminiscent of the AlphaFold confidence measure,
but it's based on the backbone ROIC that we expect to see.
So basically, if the cranium volume at some point
has a low local resolution, ModelAngelo kind of knows this.
And it knows that it can't constantly place the backbone.
So it gives you a lower predicted confidence
for how this loop is going to look, for example.
And that is represented in the V factors
that in the ModelAngelo output file, which we think
is very useful.
So that's also predicted by the graphical network.
So for the post-processing, once everything
goes through this graphical network,
we need to actually take this and turn it
into an actual atomic moment.
So what we have is we have probabilities
for an amino acid, which amino acid type for each residue.
And it's not entirely clear how you
make the statement of which amino acid it actually is.
Because we can take the highest high scoring
amino acid that ModelAngelo thinks
it is, but then you're not really
using the information that you got from the sequences
that the user has provided.
So the user knows what sequences exist in their map
most of the time.
And if you just take ModelAngelo's highest
predictions, you might actually end up with wrong sequences.
And so the way to fix that is to actually instead
of taking the highest scoring sequences, which is what people
normally do, you want to take the probabilities
that ModelAngelo is actually predicting,
and use those to somehow pick where the sequence is.
And the way we do that is we basically
turn this into a hidden Markov model.
And so basically, each amino acid that we have,
we have the probabilities, and then we create a hidden Markov
model profile, and then we do a search against the sequences
that the user has provided.
And so there you go.
That's a hidden Markov model.
The transition probabilities, as of right now,
are just average transition probabilities
but just taken from literature.
But actually, with the new ModelAngelo version,
which I'll talk about later, we've made this better.
So you take this hidden Markov model.
You search it against the sequence file with the HMM
Align, and then you get much closer amino acid predictions
than you had before, so more correct ones.
And so this is an example of a model
we built with the sequence.
So the Black outline is what's deposited.
And the blue is what ModelAngelo predicted.
And you can also see that using the torsion
angles that ModelAngelo predicts for the side chains,
we actually, more or less, get good side chain
orientations predicted.
And now the thing is this is before refining.
So you actually will, if you use servo cap or something,
these side chains orientations become much better.
So we also wanted to see whether or not
we can do this without the user providing sequences.
And we know that if the resolution is high enough, when
people are building things in [INAUDIBLE],, a lot of the time
they can make some judgments about what they think it is.
So we want to see if that's possible.
And so if you actually train the graphical network itself
without having the sequence module,
you are then able to run ModelAngelo without sequences.
And then we wanted to see how that performs.
So this is for an amyloid.
In the Scheres Group we also work a lot with amyloids.
And this is TMEM106B which had a very high resolution map,
but they didn't know what protein it actually was.
And they spent months on this to find out
what it was just by basically looking at the map itself.
And this was much before ModelAngelo.
So we want to see if you could potentially use ModelAngelo
for a purpose like this.
And so this model you see here is built by ModelAngelo.
And how I mentioned before that if you just take the highest
probability without the sequence--
if you take the highest probability amino acids,
and you can see that already it's
pretty accurate because the resolution is very high here.
And then if we do the same thing,
and we build a hidden Markov model based off this,
and then instead of searching against that human-provided
sequence file, we search it against all
of the human genome, you can actually
see that you also find TMEM106B with a high confidence.
So you are able to also use this to find sequences
that you might not know exist in your cryo-EM matter.
And so the conclusion for this is that we now have a novel way
to encode cryo-EM density information in graphs using
for human genome purposes.
It results in automated model building
of proteins when the resolution is high enough, so up to four
Angstrom.
Of course, as you get closer and closer to four Angstrom,
the results get worse.
And also, even if your map has a high global resolution,
if the local resolution is bad, it's
not going to do as well there.
And it's also capable of resolving unknown sequences
when the resolution is high.
So resolution should actually be a bit higher
if you wanted to get things that are unknown sequences.
So how do you actually use this? and this
is what I gather the seminar is mostly based around.
And the command line interface for ModelAngelo,
we strived to keep it very simple.
So there's actually two main ways
of running ModelAngelo, which I said before.
So it's with the sequence and without the sequence.
So if you want to build it with the sequence,
you just run ModelAngelo build.
You give it a path to the volume.
And you give it the path to the fasta file,
and actually, a lot of the time, when people have bugs,
it's because of the fasta file, so please make
sure it's formatted correctly.
And then you also give an output directory.
And what the output directory structure once it's done
looks like, is something like this.
So you get your output directory.
And inside are four folders.
So the C alpha output and then GNN output round 1, 2, and 3.
So I didn't really mention, but the GNN, it has eight layers.
And then once it's gone through the eight layers,
that's basically one round of the GNN.
But then at the very end of it, you have an atomic model.
This is closer to the real atomic model
than when it started.
So then you can apply it again.
And then you get closer.
And you can apply it again, and then output
round 3 is basically the end stage.
But these are intermediate results.
You probably, most of the time, you
don't really need to look inside these folders,
unless you want to debug.
So the two zip files that exist in your output directory
are actually the outputs.
So we have the output.cif and we have output raw.
Output.cif when you build with sequences,
is pretty aggressively pruned based
on the sequences provided.
So if ModelAngelo builds something and is
unsure of the sequence or if it's sure of the sequence
but the sequence does not exist in the fasta file,
it will prune it away.
Those still exist in output_raw.
So if it hasn't been pruned away in the raw file,
probably that could mean two things.
One, ModelAngelo was not very certain
about the actual sequence, so it predicted one kind of sequence,
or it had a low prediction confidence
for a portion of the sequence.
And it just wasn't in the sequence file.
And so that could either happen if the local resolution is bad
or if you forgot to put the sequence in there
or that specific sequence in there.
Now, if you want to build without sequence,
it's the exact same.
It's just the command is to build no seq,
and then you don't have to give it a sequence
file anymore, thankfully.
And then what the directory structure looks like then
is the same intermediate folders as before,
but now you have this other folder called HMM profiles.
HMM profiles is a folder, that for each chain
in your output.cif, also has an HMM profile file in it.
And this is actually what we use, for example, for the TMEM
case, where you want to search against a genome or something,
you would use these profiles.
And I'll explain how.
But then the output.cif here, is actually
the same as the output_raw in the build with sequence.
So it hasn't been pruned because you can't prune anything.
You don't have access to sequences anymore.
So it's a bit more messy than the normal ModelAngelo.
How would you actually do a sequence search?
Currently, the best way to do it is to use hhblits.
So the HMM files, HMM profile files that we provide currently
in ModelAngelo 0.2, are HHN files,
which are hhblits profiles.
And you have to install hhblits, and then you
would have to also use their database files.
It's a bit difficult to use.
And then basically, you run a command like this
to get access to your results, but it does work very well.
And so some common issues.
One really big one, especially if you
think your map is very good and the result just looks horrible,
it's that the map is probably in the wrong handedness.
So first thing to try is to split the hand of the map,
try it again.
If the result gets better, that was probably why.
Another way to diagnose this is if you look--
like in Chimera for example, if you
look at the confidence values ModelAngelo predicts,
there should be some high confidence chains predicted.
So if there's no high confidence change predicted,
yeah, it's probably in the wrong hands.
So that is the main most common issue we have.
Second one is if you see a lot of pruned away sequences
in the output.cif but in output raw
it still exists, either it's a problem
with the local resolution, but if you definitely
think it's not a problem with the local resolution,
then it's highly likely that that sequence is not
provided in the sequence file that you provide.
So to diagnose this you could, for example, run ModelAngelo
without the sequence, then take the HMM profiles,
do a search against whatever genome you think,
and then look through the sequences.
If you see that, yeah, there's a new sequence in here that
could actually be the problem.
The third problem is that the global resolution is too low.
So it's not going to automatically
just stop working at four Angstrom and above,
but it will get worse.
And it'll get worse pretty quickly.
And it's going to perform best at 3.5 and higher resolutions.
Similar problem is the local resolution.
So even if your global resolution is high,
the local resolution, if it's low,
it's just not going to be very confident.
And it's not going to be able to distinguish its sequence and so
on.
And the last one is, yeah, sometimes it
can build the confident backbone.
This is when the local resolution
is on the cusp of something where it can see the backbone,
but it can't build the sequences, even if you've
provided the sequences.
And this happens sometimes.
And it's really hard to do anything about it
because just the side chain densities are probably
not good enough for ModelAngelo to understand what's happening.
So to diagnose this, it's just you're
confident that the sequence exists
in the input and ModelAngelo the raw output has this that chain,
but it just doesn't exist in the pruned sequence,
then you could still use the backbone that ModelAngelo has
predicted, but you would have to go and fix the sequence
yourself.
So yeah, those are some of the common issues.
So what are the next steps for ModelAngelo?
Well, we're going to have the 1.0, which
we will release soon.
And there's a lot of improvements.
The first one is that everything I've explained so far
is for proteins only.
But we are going to now also have the ability
to do nucelotides.
Nucleotides are considerably more
difficult to build automatically.
And so the results won't be as good, especially for lower
resolutions, as the proteins, but it should still
be respectable to build the backbone.
And when the resolution is very high, like 2 and 1/2 Anstrom,
then you could really actually do
a pretty good job of completely automatically building it.
ModelAngelo currently, so from down samples everything
to Nyquist of 3 Angstroms, and because we
had some issues with the nucleotides,
2 Angstroms works much better for the nucleotides,
so that's what we're going to be using for ModelAngelo 1.0.
This actually also slightly improves
performance on proteins as well.
And we will have an integrated HMMER
search, which is going to be very useful,
for ModelAngelo 1.0.
So instead of having to install hhblits
and dealing with all that, we're going
to have an integrated utility that
also understands how ModelAngelo HMM profile directories work.
So you can just finish your build sequence job,
and then just do an HMM search step right from there,
and it's going to understand everything and parse
everything correctly.
And it's going to make your job of just searching
a genome much easier.
So that's going to be in ModelAngelo 1.0 as well.
And it will be quite a bit faster,
the whole thing, because of some optimizations that we've done.
And yeah, so please try it out if you haven't already.
There's instructions available in the GitHub repo.
And also, if you just type ModelAngelo build and then
help, it will give you all the options that
are available for each group.
And yeah, so I'd like to acknowledge my supervisor Sjors
Scheres, Dari, who's the co-author of this paper,
other members of my team, and also Lukas Käll,
who helped with HMM.
And I can take some questions now.
Great, thank you very much, Kiarash.
For questions, you can raise your hand
and give the little clap symbol here.
But if you go to reactions, and then click Raise Your Hand
big button at the bottom, hands will pop up,
and it will bump you up to the top of the list.
And then I can unmute you, and you
can ask your question directly.
You can also send questions in the chat window.
And we can just pass those on.
Just a thought, as you were wrapping up there,
I can tell you that when I installed ModelAngelo,
I installed it, it's available in SBGrid,
I think have 0.2.2 is there, I picked
a sort of arbitrarily hard map and just ran it
as my test case.
The very first time it ran through beautifully.
It built a very nice little model amino.
And it ran great.
And I can't say that for many other software packages,
so that's--
Thank you.
That's great.
One question that came in by chat,
we observed that chain was built but in fragments.
We tried after changing the map orientation,
but still observed the same results.
Can you suggest something regarding the same?
The map local resolution between 3 to 5, much
of it in the 3.2 range.
Right, so yeah, if it's fragmented,
you could take a look at the raw file.
So if that one isn't fragmented, that's probably
what's happening is in the postprocessing script,
it's done a search, so it can see some portions
of the sequence that are confident in
and other portions where it's probably
getting less confident, so it's going to prune those.
So there's not much to do other than just take the chain
and especially if you think that the sequence exists,
not much to do there, just kind of fix it manually.
It's probably just the local resolution
at those specific points is not good enough for ModelAngelo
or something like that.
All right, we've got a hand up.
Chao Kwan, you can unmute and ask your question.
Thank you.
That's good talk, so yeah, I'm actually working on a case
where it's difficult to take a model for the map in Chimera.
So I look at the map.
I find the map quality is pretty--
I mean, resolution is good by face value,
but the poor probably have a preferred orientation.
OK, right.
So I'm wondering, in this case, do you
think that AI software can help?
Like--
So what is the resolution, display resolution number?
It's 3.2.
Again, it's just the face value.
So the quality is actually--
You can try.
So what it's probably going to do is, for a portion of it,
it might be able to build it.
OK.
But yeah, if you already think it's not very good,
It's, yeah,
So basically, ModelAngelo, when a human can build it,
we can build it faster, that's the idea, and easier for you.
But then if you are having trouble,
it might be that ModelAngelo can't do it either,
but you should try it, and let me know.
If it does build it, that's great.
Yeah, yeah, I sure do hope so.
I mean, I'm thinking the first step here is localization
of the cr files, right?
Yes.
If we haven't localized the cr files, then it should help.
In theory, it should help localize
this secondary structure feature.
So that's going to be tough.
Yeah.
That's my style.
Yeah.
Yeah.
So if it's able to do that.
And so we put the intermediate files there for a reason.
So if this is a more advanced whatever, but once it runs,
you can look in the C alpha output folder,
and then there's a cif file with the C alphas in there.
and just look--
that might even help you even if ModelAngelo itself isn't
able to do anything afterwards, it
might help you see where things are, so.
Yeah, yeah, cool.
Thank you.
Thank you.
Appreciate that.
We had a question here from--
I noticed that building into maps in which symmetry has been
imposed, build chains that were not identical,
let's say, in post symmetry, so can symmetry
be accounted for without user map segmentation?
So basically, no.
ModelAngelo will basically build everything independently almost
assuming there's no symmetry whatsoever.
And you will actually see specifically regions.
And this comes up more in when you want to do searches
against a genome.
But it's the exact same symmetry I
understand, but for some reason, ModelAngelo--
and the reason is, it's a bit technical,
but it will have better chains in one symmetry section and not
the other.
And I don't know, just you can duplicate it or something,
but, yeah.
Choose the best chain and apply symmetry.
Exactly, yeah.
Yeah.
one other question, is there a size limit for ModelAngelo
for building a model?
No.
yeah, we've built things now that's like 160,000 residues.
It takes longer, but it should be fine, yeah.
And I think you touched on this here.
Your resolution, you're aiming for about 3 and 1/2 you said
is sort of ideal for the best quality model.
So--
and better.
Yeah, say that again, sorry.
And better.
And better, OK, so--
Yeah.
It drops off pretty quickly after 3 and 1/2.
Yeah, yeah, yeah.
It's like, mainly the side chains that go first, right?
So it's-- yeah.
Yeah, so we had a question, which resolution of the map
can get a good model?
So 3 and 1/2 and then you're--
Yeah, 3 and 1/2.
--living dangerously.
Yeah, I think, yeah.
Pete had a question.
Go ahead, Pete.
yeah so great talk.
And I've actually got two questions.
And they're both a little bit speculative kind of ones.
But for the first one, so do you have
a sense for how sensitive it is for the density type?
If you wanted to use this for a low resolution X-ray map
or if you had a low resolution micro ED map,
would you have to retrain that portion of the model?
Or do you think things would be close enough?
So I've been asked this before.
And I know that there are other machine learning approaches
that work with densities, where apparently it
doesn't need to be retrained.
But I personally have never tested
ModelAngelo on anything else.
So I really couldn't say.
My intuition says it should be retrained, but I don't know.
Not knowing is a perfectly valid answer, and it's speculative.
So I guess the other one might be
a little bit more speculative.
But for the unknown sequence case,
have you thought about using--
if you've got a three-dimensional structure
that you don't know the sequence of,
would doing a search with that three-dimensional structure
rather than sequence probability,
is that something that could be useful?
Yeah, that could be useful.
So like FoldC, for example, does something like that.
But basically, we've seen much, much better specificity
when we use the sequences.
Because if you think like, ModelAngelo has--
it's predicting two things.
It's predicting the structure, but it's also
predicting use probabilities.
And there's a lot more information
in the probabilities there.
Because structure is conserved much more in sequences.
So if you want to find--
Cool, thank you.
Any other questions, anyone?
Feel free to raise your hand or send it by check.
All right, good.
I have a question.
So you're coming from a mathematics background.
You're tackling different problems in structural biology.
This is a challenging one, and this is a pretty productive
approach, at least in my hands.
What are your targets on next?
Are you improving this?
You're going to nucleotides.
Yes.
That's interesting.
So, yeah, we're going to be improving this for now.
And I'm somewhat involved with other things
for cryo-EM processing itself.
But after that, who knows?
There's some other projects, but they're pretty preliminary now,
so.
I'll let you know.
All right, cool.
Yeah, that's great.
I just, I see there's so much rapid development happening
in this space.
So it seems like there's so many great ideas out there.
Just a good time to be developing
methods in structural biology.
And using them, hopefully.
Yeah, yeah, all right, well, with that, we can wrap up.
[MUSIC PLAYING]
Browse More Related Video
DiffDock
What is a Machine Learning Engineer
Atomic and Molecular Structure|Dalton's and Rutherford Atomic Model|Lecture 1|ENGINEERING CHEMISTRY1
O que é CORRELAÇÃO e como analisar os Gráficos
LLM Foundations (LLM Bootcamp)
Mastering Summarization Techniques: A Practical Exploration with LLM - Martin Neznal
5.0 / 5 (0 votes)