ModelAngelo

SBGrid Consortium
18 Apr 202338:10

Summary

TLDRThe video transcript features a presentation by Kiarash Jamali from the MRC Laboratory of Molecular Biology in Cambridge, UK, discussing ModelAngelo, an automated atomic model building program for cryo-electron microscopy (cryo-EM) maps. Jamali explains the significance of Pi Day and its relation to the celebration of science, before delving into the technical aspects of ModelAngelo. He outlines the three-step process of the program: C alpha atom predictions, full atomic model building using sequence information, and post-processing to refine the model. The use of a graphical network and attention algorithm, inspired by machine learning models like ChapGPT and AlphaFold, is highlighted. Jamali also addresses the application of ModelAngelo in building models without prior sequence knowledge and its potential in identifying unknown sequences from high-resolution maps. The talk concludes with a Q&A session where common issues are discussed, future improvements to ModelAngelo are mentioned, including the addition of nucleotide modeling and an integrated HMMER search, and the presenter's future targets in the field of structural biology are briefly touched upon.

Takeaways

  • 🎓 **Pi Day Celebration**: The video begins with a mention of Pi Day (March 14), which is a celebration of mathematics and science, significant to the presenter's family.
  • 🧬 **Introduction to Kiarash Jamali**: Kiarash Jamali, from the MRC Laboratory of Molecular Biology in Cambridge, UK, discusses ModelAngelo, a program for automated atomic model building for cryo-EM maps.
  • 🛠️ **ModelAngelo Overview**: ModelAngelo is a three-step automated process for building atomic models from cryo-EM maps, involving C alpha atom predictions, full atomic model building, and post-processing.
  • 🧠 **Machine Learning Approach**: ModelAngelo utilizes a graphical network with an attention algorithm, a technique used in various machine learning applications, to combine cryo-EM map data, sequence information, and initial C alpha positions.
  • 📈 **C Alpha Prediction**: The process starts with predicting C alpha atoms within a 1.5 Å cube of the cryo-EM map, transforming it into a machine learning segmentation problem.
  • 🔬 **Integration of Data**: The graphical network in ModelAngelo is designed to integrate cryo-EM map data, sequence data, and initial C alpha positions into a comprehensive model.
  • 🧬 **Sequence Module**: A key component is the sequence module, which uses a protein language model to analyze the relationship between the C alpha representation and the input amino acid sequence.
  • 🧬 **Spatial Invariant Point and Attention Module**: This module is similar to AlphaFold and focuses on integrating geometric information about nearby residues, such as alpha helices and beta sheets.
  • 🔍 **Post-Processing**: Post-processing involves turning probabilities for amino acid types into an actual atomic model, using a hidden Markov model to refine sequence predictions.
  • 🧬 **Sequence Search**: ModelAngelo can perform sequence searches against a genome using HMM profiles, which can be useful for identifying unknown sequences in a cryo-EM map.
  • ⚙️ **ModelAngelo Usage**: The command-line interface for ModelAngelo is designed to be simple, with options to build with or without sequence information, and detailed instructions available on GitHub.

Q & A

  • What is the significance of Pi Day in the context of this video?

    -Pi Day, celebrated on March 14, is a holiday that is significant to this video because it is a celebration of mathematics and science in general. The presenter, Jason Key, mentions that it is a big holiday in his family and thematically appropriate for introducing Kiarash Jamali, who has a background in mathematics and computational approaches to structural biology.

  • What is ModelAngelo and what does it automate?

    -ModelAngelo is an automated, atomic model building program for cryo-EM maps. It is designed to automate the process of building atomic models into cryo-EM maps in an end-to-end fashion, which traditionally is done manually on a residue-by-residue basis.

  • How does ModelAngelo utilize the attention algorithm in its process?

    -ModelAngelo uses the attention algorithm as a differentiable search procedure within its graphical network. This allows the system to extract information from a library of vectors representing different pieces of input data, such as the cryo-EM map, sequence information, and initial C alpha positions. The attention algorithm helps in integrating this data together to build a full atomic model.

  • What is the role of the sequence module in ModelAngelo?

    -The sequence module in ModelAngelo processes the amino acid sequence for all the sequences expected to be seen in the map. It uses a protein language model to convert the text sequence into a series of vectors. These vectors are then used to calculate similarity to the sequence and update the representation of each residue with the added sequence information.

  • How does ModelAngelo handle the prediction of amino acid types for each residue?

    -ModelAngelo predicts the probabilities for each amino acid type for every residue. Instead of choosing the highest scoring amino acid, it uses these probabilities to inform the selection of the correct sequence, often turning this into a hidden Markov model to align with the user-provided sequences.

  • What is the purpose of the post-processing step in ModelAngelo?

    -The post-processing step in ModelAngelo is used to fix issues and refine the atomic model. It involves taking the probabilities for amino acid types and using them to determine the actual amino acid in each residue, correcting any issues with the initial C alpha positions, and refining the orientation of the atomic model.

  • How does ModelAngelo deal with unknown sequences in the map?

    -ModelAngelo can be run without a sequence module, allowing it to build models without prior knowledge of the sequences. It can then use the built model to search against a genome or a database of known sequences to identify the unknown sequences present in the cryo-EM map.

  • What are the common issues faced when using ModelAngelo and how can they be diagnosed?

    -Common issues include the map being in the wrong handedness, which can be diagnosed by checking the confidence values predicted by ModelAngelo or by flipping the hand of the map and re-running the model. Other issues include low local resolution, which can be addressed by ensuring the sequence is correctly provided in the fasta file, and low global resolution, which might require higher resolution maps for better results.

  • What are the improvements expected in the upcoming ModelAngelo 1.0 version?

    -The upcoming ModelAngelo 1.0 version will include the ability to build nucleotides, improved performance with an integrated HMMER search, optimizations for faster processing, and the use of a Nyquist of 2 Angstroms for better results with nucleotides.

  • What is the ideal resolution range for ModelAngelo to build high-quality models?

    -The ideal resolution range for ModelAngelo to build high-quality models is 3.5 Angstroms and better. The performance starts to drop off quickly after 3.5 Angstroms.

  • How does ModelAngelo handle symmetry in cryo-EM maps?

    -ModelAngelo builds everything independently, assuming no symmetry. If symmetry is present in the map, the user may need to manually apply symmetry to the built chains, choosing the best chain and duplicating it accordingly.

  • What are the next steps or targets for Kiarash Jamali after working on ModelAngelo?

    -Kiarash Jamali plans to continue improving ModelAngelo, particularly with the addition of nucleotide building capabilities. He is also involved with other projects in cryo-EM processing, with further plans to be disclosed as the projects develop.

Outlines

00:00

🎓 Introduction to ModelAngelo and Cryo-EM

The video begins with a greeting from Jason Key on Pi Day, emphasizing the significance of mathematics and science. He introduces Kiarash Jamali, a mathematician and computational biologist from the MRC Laboratory of Molecular Biology in Cambridge, UK. Jamali discusses ModelAngelo, an automated program for building atomic models from cryo-electron microscopy (cryo-EM) maps. The process involves three main steps: C alpha atom predictions, integrating sequence information to build a full atomic model, and post-processing to refine the model. The video also explains the importance of the cryo-EM map and the challenges of de novo residue-by-residue modeling.

05:01

🧠 The Attention Algorithm and ModelAngelo's Graph Network

Jamali explains the attention algorithm, a differentiable search procedure used in machine learning, which is central to ModelAngelo's operation. The algorithm helps the program identify connections between atoms by comparing vectors derived from the cryo-EM map and the amino acid sequence. ModelAngelo uses a graphical network designed to combine cryo-EM map data, sequence information, and initial C alpha positions. The network employs convolutional neural networks (CNNs) to process density data and attention mechanisms to integrate this information, leading to an updated representation used to build the atomic model.

10:02

📐 Model Building Process and Post-Processing

The video outlines the process of using the updated representations from the graph network to build the atomic model. It details the backbone framework that predicts the shift of atoms to refine the model's orientation. The program also predicts side chain orientations, amino acid probabilities, and a confidence measure based on the expected backbone ROIC. Post-processing involves converting probabilities into an actual atomic model, using a hidden Markov model to align against provided sequences and refine amino acid predictions.

15:03

🧬 Sequence-Informed and De Novo Model Building

Jamali demonstrates how ModelAngelo can build models both with and without user-provided sequences. He shows an example of building a model with the sequence and discusses the output directory structure, which includes folders for C alpha outputs and graph neural network (GNN) outputs. The video also explores de novo model building without sequences, highlighting the program's ability to predict and identify sequences even when they are not initially provided, using high-resolution cryo-EM maps.

20:04

🛠️ ModelAngelo's Command Line Interface and Common Issues

The video provides instructions on how to use ModelAngelo through its command line interface, detailing the process for building models with and without sequences. It explains the output directory structure and the contents of the output files. Jamali addresses common issues such as map handedness, local resolution problems, and the challenges of building models with low global resolution. He also suggests using hhblits for sequence searches and provides insights into troubleshooting and improving results.

25:05

🔍 Addressing Fragmented Chains and Future Directions

Jamali responds to a question about building fragmented chains, suggesting that local resolution might be the issue and manual fixes might be necessary. He discusses the potential of AI software in addressing model building challenges and the possibility of using ModelAngelo's intermediate files for localization. The video concludes with a Q&A session where Jamali addresses questions about the sensitivity of ModelAngelo to different density types and the possibility of using three-dimensional structures for unknown sequence cases. He also shares his future targets, including improvements to ModelAngelo and other projects in cryo-EM processing.

30:06

🚀 Upcoming Features in ModelAngelo 1.0

The video script mentions the upcoming release of ModelAngelo 1.0, which will include significant improvements. New features will enable the building of nucleotides, which are more challenging due to their complexity. The program will also integrate an HMMER search, streamlining the process of searching a genome. Additionally, ModelAngelo 1.0 will be faster and work better with higher resolutions, particularly at 2.5 Angstroms for nucleotides and 3.5 Angstroms and better for proteins.

Mindmap

Keywords

💡Cryo-EM

Cryo-electron microscopy (cryo-EM) is a technique used to study the structure of macromolecules and cells by imaging them in a frozen state. It is central to the video's theme as the software discussed, ModelAngelo, is designed to build atomic models from cryo-EM density maps. The script mentions the use of cryo-EM maps for ModelAngelo to predict C alpha atoms and build full atomic models, which is crucial for understanding the structural biology problems addressed in the video.

💡ModelAngelo

ModelAngelo is an automated, atomic model-building program for cryo-EM maps. It is the main subject of the video, with the speaker, Kiarash Jamali, discussing its features and applications. ModelAngelo is significant as it represents an advancement in structural biology, aiming to automate the process of building atomic models from cryo-EM data, which traditionally was a manual and time-consuming task.

💡C Alpha Predictions

C alpha predictions refer to the process of identifying the position of the C alpha atoms in a protein from a cryo-EM map. This is the first step in ModelAngelo's three-step process for building atomic models. The script explains that C alpha atoms are placed within the cryo-EM volume, and their existence is determined through a machine learning segmentation problem, which is vital for the initial stages of atomic model building.

💡Attention Algorithm

The attention algorithm is a machine learning technique that is used within ModelAngelo to combine different pieces of information, such as the cryo-EM map, sequence information, and initial C alpha positions. It is described as a differentiable search procedure that helps the model understand how new objects relate to a library of known vectors. The attention algorithm is a key component of ModelAngelo's graphical network and is also used in other machine learning models like ChapGPT and AlphaFold.

💡Hidden Markov Model (HMM)

A Hidden Markov Model (HMM) is a statistical model used for pattern recognition and is employed in ModelAngelo for post-processing steps. In the context of the video, HMM is used to refine amino acid predictions by aligning the probabilities predicted by ModelAngelo with the sequences provided by the user. This helps in ensuring that the final atomic model aligns with the known sequences, which is essential for accurate model building.

💡Resolution

In the context of the video, resolution refers to the level of detail that can be discerned in the cryo-EM map, which is crucial for the accuracy of the atomic model built by ModelAngelo. The speaker mentions that higher resolution maps (3.5 Angstrom and better) yield better results, while lower resolution maps can lead to less confident or inaccurate models. Resolution is a critical factor in determining the success of automated model building with ModelAngelo.

💡Amino Acid Sequence

The amino acid sequence is the specific order of amino acids in a protein. In the video, the sequence is used as an input for ModelAngelo to build the full atomic model. The sequence information is integrated with the cryo-EM map and C alpha positions to predict the full atomic structure of the protein. The correct incorporation of the amino acid sequence is essential for the accuracy of the final model, as it ensures that the model aligns with the known biological information.

💡Post-Processing

Post-processing in the context of ModelAngelo refers to the steps taken after the initial model building to refine and correct the atomic model. This includes using a Hidden Markov Model to align predicted amino acid types with known sequences and adjusting the model based on the confidence measures predicted by ModelAngelo. Post-processing is a critical step to ensure the accuracy and reliability of the final atomic model.

💡Backbone Framework

The backbone framework is a component within ModelAngelo's graphical network that predicts the shift vectors for the atoms in the protein backbone, allowing the model to update and refine the initial positions of the C alpha, nitrogen, and carbon atoms. This module is essential for achieving a more accurate representation of the protein's structure, as it adjusts the orientation and position of the protein's backbone based on the updated representations from the network.

💡Torsion Angle Prediction

Torsion angle prediction is a feature of ModelAngelo that predicts the angles of rotation around the bonds of the protein's backbone and side chains. These predictions are crucial for determining the three-dimensional structure of the protein. The video mentions that ModelAngelo provides torsion angle predictions for the side chains, which, when combined with the confidence measures, can give a good indication of the protein's structure before further refinement.

💡Symmetry

Symmetry in the context of the video refers to the repeating patterns or identical subunits in a protein's structure. ModelAngelo builds atomic models assuming no symmetry, treating each section of the protein independently. The script discusses the implications of this approach, noting that if a protein's structure has symmetry, the best chain from the model can be selected and then symmetry can be applied manually. This is an important consideration when dealing with proteins that have symmetrical arrangements.

Highlights

Kiarash Jamali introduces ModelAngelo, an automated atomic model building program for cryo-EM maps.

ModelAngelo is designed to automate the process of building atomic models into cryo-EM maps in an end-to-end fashion.

The program uses a three-step process involving C alpha atom predictions, sequence information integration, and post-processing.

A graphical network is employed to combine cryo-EM map, sequence, and initial C alpha positions into a full atomic model.

The attention algorithm, used in ModelAngelo, is a differentiable search procedure aiding in extracting information from the library of vectors.

ModelAngelo utilizes convolutional neural networks for interpreting the density around C alpha atoms and predicting peptide bonds.

The sequence module uses a protein language model to integrate amino acid sequence information into the atomic model building process.

A spatial invariant point and attention module is used to integrate geometric information about nearby residues.

ModelAngelo predicts confidence measures for the model based on the expected backbone ROIC, aiding in identifying areas of lower confidence.

Post-processing involves turning probabilities into an actual atomic model and using a hidden Markov model for sequence alignment.

ModelAngelo can build models without user-provided sequences, relying on the highest probability amino acids.

The program can identify and build models for unknown sequences in high-resolution cryo-EM maps.

ModelAngelo's command line interface is designed to be simple, with options to build with or without sequence information.

The program provides intermediate and final output files, with the final output being a CIF file for the atomic model.

Common issues with ModelAngelo include wrong map handedness, local resolution problems, and the need for manual sequence correction in some cases.

Upcoming ModelAngelo 1.0 will include improvements such as nucleotide modeling, integrated HMMER search, and performance optimizations.

ModelAngelo has been successfully used to build large models with up to 160,000 residues.

The ideal resolution for ModelAngelo to build high-quality models is 3.5 Angstroms or better.

The program is not limited by the size of the model and can handle large biomolecular complexes.

Transcripts

play00:00

[COMPUTER SOUNDS]

play00:01

Welcome to the SBGrid YouTube channel

play00:03

software tutorials by developers,

play00:06

lectures by structural biologists, unique content

play00:09

brought to you by SBGrid.

play00:17

[MUSIC PLAYING]

play00:24

Hello, everybody.

play00:25

Thank you for joining.

play00:26

This is Jason Key at SBGrid.

play00:28

Happy Pi Day.

play00:29

Today is March 14, Pi Day.

play00:33

So that's a celebration of mathematics,

play00:36

and I take it as a celebration of science in general.

play00:40

So Pi Day is a big holiday in our family.

play00:43

We always make pi.

play00:45

So with that, it is my pleasure to introduce Kiarash Jamali,

play00:51

joining us from the MRC in Sjors Scheres Group.

play00:54

Kiarash thematically appropriate for pi day,

play00:58

has a background in mathematics and has

play01:01

taken on computational approaches

play01:03

to structural biology problems.

play01:05

And so I guess the first application borne

play01:08

of those efforts that I've seen is ModelAngelo, which he's

play01:11

going to tell us about today.

play01:13

So Kiarash, go right ahead.

play01:16

Thanks, so yes, I'm Kiarash Jamali

play01:18

in the Scheres Group at the MRC laboratory of molecular biology

play01:21

in Cambridge, United Kingdom.

play01:24

And today, I'm going to be talking about ModelAngelo,

play01:27

which is an automated, atomic, model building

play01:30

program for cryo-EM maps.

play01:33

So I'll get started.

play01:37

Here we go.

play01:38

So this is a cryo-EM map.

play01:40

And we want to build this atomic model into it.

play01:46

And the way this is normally done

play01:50

is if you want to do Denovo, you basically kind of just

play01:55

do a residue-by-residue manually.

play01:57

And so we wanted to automate this process

play02:01

in an end-to-end fashion.

play02:02

And so what we came up with is ModelAngelo,

play02:06

which is a three-step process to build atomic models.

play02:11

And the three steps are basically you first

play02:14

start with C alpha atom predictions, which

play02:20

is to place each residue separately,

play02:24

just from the cryo-EM map.

play02:26

And then you take the sequence information as well.

play02:29

And you try to build the full atomic model

play02:33

from those initial points as well as the cryo-EM map.

play02:38

And then afterwards, we have post-processing

play02:40

to fix some issues.

play02:44

So for the C alpha part, which I won't really

play02:48

be talking too much about today, is you have your C alpha atoms.

play02:53

They're in this cryo-EM volume.

play02:56

And we've oxidized it.

play02:59

So we basically, say, ask the question within a 1 and 1/2

play03:02

Angstrom cube of the cryo=EM map,

play03:07

is there or is there not a C alpha atom existing there.

play03:11

And then this can become a machine

play03:14

learning segmentation problem.

play03:16

So basically, you have a cryo=EM map that goes as an input,

play03:19

and the output is just a bunch of probabilities for each voxel

play03:24

whether or not it includes a C alpha atom.

play03:27

And so here's your map.

play03:30

And this is real data.

play03:32

So that's an actual map that we give online.

play03:35

And on the right, for each orange dot,

play03:37

is a C alpha prediction it has made.

play03:40

But let's assume that this part we have already

play03:44

trained the network.

play03:45

And we want to move on to how to build

play03:47

full atomic model from this.

play03:49

And the way that works is basically

play03:52

the test of what ModelAngelo is, which

play03:56

is this graphical network.

play03:57

And we had to specifically design it

play04:00

so that you're able to combine all these different pieces

play04:04

of information together, which is the cryo-EM map which

play04:06

is a 3D voxel grid.

play04:10

You have the sequence, which is just text.

play04:13

The amino acid sequence for all the

play04:16

sequences that you expect to see in the map.

play04:19

And then you also have your initial C alpha positions,

play04:22

which is just a graph of positions.

play04:25

And how do you combine all this data together

play04:28

is actually difficult and has never been done before.

play04:32

So we designed this graphical network,

play04:35

which basically, takes each module is

play04:40

responsible for extracting information

play04:42

from a specific piece of these inputs.

play04:46

And then it's all integrated together within the graph.

play04:51

And each module is based on this attention algorithm

play04:57

that has now been taking all of machine learning way.

play05:00

So if you've ever use ChapGPT, AlphaFold,

play05:04

all use this attention algorithm.

play05:08

So here it is.

play05:11

And what the attention algorithm is, is it's

play05:13

a differentiable search procedure.

play05:16

So you have some object.

play05:18

You represent it with a vector by putting it through a machine

play05:22

learning, let's say, NLP.

play05:25

And then you get a vector at the end.

play05:27

And then you also have a library of things

play05:30

that you want to compare it against.

play05:32

And so what you do here is you do basically a dot

play05:36

product of this vector to your library of vectors.

play05:40

You get a similarity.

play05:41

And then you marginalize over these similarities

play05:46

with a different representation of the library.

play05:49

And what this gives you at the end

play05:51

is for a new object that has come in,

play05:54

what kind of information can I extract

play05:56

from the library of things I've seen in order

play05:59

to get a new representation for the object

play06:02

based on how it relates to the library.

play06:04

That's a bit abstract, but I will crystallize a bit

play06:08

when I go through each of these modules.

play06:11

So for example, the cryo-EM module, what it does

play06:13

is it looks at each of those C alpha initial points

play06:16

that I talked about.

play06:18

It takes the cube of density around this.

play06:21

And you've probably heard of convolutional neural networks.

play06:24

This is how, let's say, image processing with new machine

play06:29

learning methods works and also 3D volume stuff as well.

play06:33

And here in ModelAngelo we use it

play06:34

as well for the cryo-EM images.

play06:36

So you interpolate the cube around each C alpha,

play06:41

and the cubes are also oriented based on the backbone

play06:46

orientation at the C alpha position, which

play06:51

you might be wondering, how do we even

play06:52

get the backbone orientation?

play06:54

And the very beginning, it's completely random.

play06:57

But then ModelAngelo will update this backbone representation

play07:00

as it goes along.

play07:01

So we interpret this cube of density

play07:03

around the C alpha atom.

play07:06

And we put it through a convolutional neural network.

play07:09

We get a vector.

play07:11

We do similar for rectangles of density between the current C

play07:17

alpha atom and all its nearest neighbors, so 20

play07:21

nearest neighbors.

play07:22

We look at rectangles between them.

play07:25

And this is supposed to give it an idea of

play07:27

whether or not there's a peptide bond to C alpha atom.

play07:32

So then this also goes through a CNN,

play07:34

and we also get another vector.

play07:36

And then the cube kind of represents

play07:42

the query, so the object that comes in.

play07:45

And all the rectangles represent the library of things.

play07:49

So the idea is to say, for this C alpha atom,

play07:53

which of the other atoms around it does it actually connect to?

play07:58

And then based on that, we can actually get the information

play08:01

from those nearby atoms to come into the current atom

play08:05

and update the representation.

play08:09

The sequence module, which is the second module that

play08:12

goes after it, then asks the question, OK, for each residue

play08:15

that we have a C alpha for, given the representation that,

play08:20

let's say, before it's been updated

play08:22

from the cryo-EM module, can we see anything between this

play08:26

and the sequence that came in?

play08:28

And so the sequence, that's just text,

play08:31

text goes through a protein language model.

play08:35

So we use the ESM language model from Facebook.

play08:39

And that turns it into a series of vectors.

play08:41

And then we do a similar search against the sequence,

play08:47

and we calculate similarity to the sequence.

play08:51

And then we bring it back.

play08:52

So this is basically now corresponds--

play08:54

before we just had cryo-EM Information,

play08:56

now we also add in the sequence information

play08:59

from the amino acids that were inputted,

play09:02

and you get an updated representation.

play09:05

And lastly, we also do this module,

play09:09

this spatial invariant point and attention module,

play09:11

that is very similar to the one that's in AlphaFold.

play09:14

And the idea here is now you want

play09:16

to integrate information about the geometry

play09:19

of nearby residues, not just the cryo-EM information and not

play09:24

just the sequence information but also the geometry,

play09:26

whether or not something is alpha

play09:29

helix, greater sheet, et cetera.

play09:31

And the way this works is, for each residue, you

play09:34

kind of predict points around what you should look at.

play09:38

And then if there's a nearby residue that

play09:42

is close at that point, then you get a high weight,

play09:45

high similarity from that residue

play09:46

when you bring in information again, OK.

play09:50

So in this graphical network, once the information

play09:54

goes to each of these different modules,

play09:57

the representations keep getting updated for each residue.

play10:01

And then using these updated representations,

play10:05

we put it through this backbone framework.

play10:07

And what this one does, for example,

play10:10

this is one of the predictions we make,

play10:13

is that for this background frame

play10:15

that we talked about, which is just a triangle with the C

play10:18

alpha atom, the nitrogen carbon atoms,

play10:21

it predicts how much to shift each of these atoms.

play10:26

So in the beginning, when we did the segmentation task,

play10:30

if you remember, it was like a 1 and 1/2 Angstrom voxel.

play10:34

We just said that the C alpha atom exists in this voxel.

play10:37

But that doesn't really tell you anything about where

play10:39

in the voxel.

play10:40

So the C alpha atom itself starts off a bit wrong.

play10:45

And on average, it's like 1 Angstrom off.

play10:48

And then the nitrogen and carbon atoms, we're actually,

play10:53

we just started them off in a random rotation.

play10:55

So those are completely wrong in the beginning.

play10:57

But then every time these features are updated,

play11:01

and they go through this backbone frame module,

play11:04

we get these shift vectors, and then

play11:05

now we can actually update the orientation of this triangle.

play11:10

So each layer of the graphical network

play11:13

updates the orientations.

play11:15

And we get closer and closer to what

play11:17

the actual true orientation of the spectrum is.

play11:21

So this is one of these modules.

play11:24

But there's a few more.

play11:25

There's a torsion angle prediction for the side change.

play11:29

There's amino acid prediction probabilities.

play11:32

And so all of those are predicted

play11:34

after each of these modules has updated the representation.

play11:39

OK one of those things is also a confidence measure.

play11:43

This is very reminiscent of the AlphaFold confidence measure,

play11:48

but it's based on the backbone ROIC that we expect to see.

play11:54

So basically, if the cranium volume at some point

play11:58

has a low local resolution, ModelAngelo kind of knows this.

play12:04

And it knows that it can't constantly place the backbone.

play12:10

So it gives you a lower predicted confidence

play12:13

for how this loop is going to look, for example.

play12:17

And that is represented in the V factors

play12:21

that in the ModelAngelo output file, which we think

play12:25

is very useful.

play12:27

So that's also predicted by the graphical network.

play12:31

So for the post-processing, once everything

play12:34

goes through this graphical network,

play12:36

we need to actually take this and turn it

play12:41

into an actual atomic moment.

play12:43

So what we have is we have probabilities

play12:47

for an amino acid, which amino acid type for each residue.

play12:53

And it's not entirely clear how you

play12:56

make the statement of which amino acid it actually is.

play13:01

Because we can take the highest high scoring

play13:04

amino acid that ModelAngelo thinks

play13:05

it is, but then you're not really

play13:08

using the information that you got from the sequences

play13:11

that the user has provided.

play13:13

So the user knows what sequences exist in their map

play13:16

most of the time.

play13:17

And if you just take ModelAngelo's highest

play13:19

predictions, you might actually end up with wrong sequences.

play13:22

And so the way to fix that is to actually instead

play13:25

of taking the highest scoring sequences, which is what people

play13:29

normally do, you want to take the probabilities

play13:32

that ModelAngelo is actually predicting,

play13:36

and use those to somehow pick where the sequence is.

play13:42

And the way we do that is we basically

play13:45

turn this into a hidden Markov model.

play13:48

And so basically, each amino acid that we have,

play13:53

we have the probabilities, and then we create a hidden Markov

play13:56

model profile, and then we do a search against the sequences

play13:59

that the user has provided.

play14:02

And so there you go.

play14:03

That's a hidden Markov model.

play14:05

The transition probabilities, as of right now,

play14:08

are just average transition probabilities

play14:11

but just taken from literature.

play14:14

But actually, with the new ModelAngelo version,

play14:18

which I'll talk about later, we've made this better.

play14:22

So you take this hidden Markov model.

play14:24

You search it against the sequence file with the HMM

play14:28

Align, and then you get much closer amino acid predictions

play14:35

than you had before, so more correct ones.

play14:40

And so this is an example of a model

play14:46

we built with the sequence.

play14:47

So the Black outline is what's deposited.

play14:50

And the blue is what ModelAngelo predicted.

play14:57

And you can also see that using the torsion

play15:00

angles that ModelAngelo predicts for the side chains,

play15:03

we actually, more or less, get good side chain

play15:07

orientations predicted.

play15:08

And now the thing is this is before refining.

play15:11

So you actually will, if you use servo cap or something,

play15:15

these side chains orientations become much better.

play15:22

So we also wanted to see whether or not

play15:24

we can do this without the user providing sequences.

play15:28

And we know that if the resolution is high enough, when

play15:32

people are building things in [INAUDIBLE],, a lot of the time

play15:35

they can make some judgments about what they think it is.

play15:39

So we want to see if that's possible.

play15:41

And so if you actually train the graphical network itself

play15:46

without having the sequence module,

play15:50

you are then able to run ModelAngelo without sequences.

play15:55

And then we wanted to see how that performs.

play15:58

So this is for an amyloid.

play16:01

In the Scheres Group we also work a lot with amyloids.

play16:03

And this is TMEM106B which had a very high resolution map,

play16:11

but they didn't know what protein it actually was.

play16:14

And they spent months on this to find out

play16:21

what it was just by basically looking at the map itself.

play16:24

And this was much before ModelAngelo.

play16:26

So we want to see if you could potentially use ModelAngelo

play16:30

for a purpose like this.

play16:32

And so this model you see here is built by ModelAngelo.

play16:38

And how I mentioned before that if you just take the highest

play16:41

probability without the sequence--

play16:43

if you take the highest probability amino acids,

play16:45

and you can see that already it's

play16:47

pretty accurate because the resolution is very high here.

play16:51

And then if we do the same thing,

play16:55

and we build a hidden Markov model based off this,

play16:59

and then instead of searching against that human-provided

play17:02

sequence file, we search it against all

play17:04

of the human genome, you can actually

play17:07

see that you also find TMEM106B with a high confidence.

play17:11

So you are able to also use this to find sequences

play17:16

that you might not know exist in your cryo-EM matter.

play17:23

And so the conclusion for this is that we now have a novel way

play17:29

to encode cryo-EM density information in graphs using

play17:33

for human genome purposes.

play17:35

It results in automated model building

play17:37

of proteins when the resolution is high enough, so up to four

play17:40

Angstrom.

play17:41

Of course, as you get closer and closer to four Angstrom,

play17:44

the results get worse.

play17:46

And also, even if your map has a high global resolution,

play17:49

if the local resolution is bad, it's

play17:52

not going to do as well there.

play17:54

And it's also capable of resolving unknown sequences

play17:57

when the resolution is high.

play17:59

So resolution should actually be a bit higher

play18:02

if you wanted to get things that are unknown sequences.

play18:08

So how do you actually use this? and this

play18:11

is what I gather the seminar is mostly based around.

play18:15

And the command line interface for ModelAngelo,

play18:17

we strived to keep it very simple.

play18:20

So there's actually two main ways

play18:23

of running ModelAngelo, which I said before.

play18:26

So it's with the sequence and without the sequence.

play18:30

So if you want to build it with the sequence,

play18:33

you just run ModelAngelo build.

play18:36

You give it a path to the volume.

play18:40

And you give it the path to the fasta file,

play18:45

and actually, a lot of the time, when people have bugs,

play18:48

it's because of the fasta file, so please make

play18:51

sure it's formatted correctly.

play18:53

And then you also give an output directory.

play18:57

And what the output directory structure once it's done

play19:01

looks like, is something like this.

play19:03

So you get your output directory.

play19:05

And inside are four folders.

play19:07

So the C alpha output and then GNN output round 1, 2, and 3.

play19:12

So I didn't really mention, but the GNN, it has eight layers.

play19:16

And then once it's gone through the eight layers,

play19:19

that's basically one round of the GNN.

play19:21

But then at the very end of it, you have an atomic model.

play19:25

This is closer to the real atomic model

play19:28

than when it started.

play19:29

So then you can apply it again.

play19:31

And then you get closer.

play19:32

And you can apply it again, and then output

play19:34

round 3 is basically the end stage.

play19:38

But these are intermediate results.

play19:39

You probably, most of the time, you

play19:41

don't really need to look inside these folders,

play19:43

unless you want to debug.

play19:45

So the two zip files that exist in your output directory

play19:49

are actually the outputs.

play19:50

So we have the output.cif and we have output raw.

play19:56

Output.cif when you build with sequences,

play20:00

is pretty aggressively pruned based

play20:03

on the sequences provided.

play20:05

So if ModelAngelo builds something and is

play20:08

unsure of the sequence or if it's sure of the sequence

play20:10

but the sequence does not exist in the fasta file,

play20:13

it will prune it away.

play20:17

Those still exist in output_raw.

play20:20

So if it hasn't been pruned away in the raw file,

play20:25

probably that could mean two things.

play20:28

One, ModelAngelo was not very certain

play20:31

about the actual sequence, so it predicted one kind of sequence,

play20:38

or it had a low prediction confidence

play20:41

for a portion of the sequence.

play20:43

And it just wasn't in the sequence file.

play20:46

And so that could either happen if the local resolution is bad

play20:49

or if you forgot to put the sequence in there

play20:55

or that specific sequence in there.

play20:58

Now, if you want to build without sequence,

play21:00

it's the exact same.

play21:01

It's just the command is to build no seq,

play21:04

and then you don't have to give it a sequence

play21:07

file anymore, thankfully.

play21:09

And then what the directory structure looks like then

play21:13

is the same intermediate folders as before,

play21:15

but now you have this other folder called HMM profiles.

play21:19

HMM profiles is a folder, that for each chain

play21:24

in your output.cif, also has an HMM profile file in it.

play21:30

And this is actually what we use, for example, for the TMEM

play21:33

case, where you want to search against a genome or something,

play21:37

you would use these profiles.

play21:39

And I'll explain how.

play21:40

But then the output.cif here, is actually

play21:43

the same as the output_raw in the build with sequence.

play21:49

So it hasn't been pruned because you can't prune anything.

play21:53

You don't have access to sequences anymore.

play21:55

So it's a bit more messy than the normal ModelAngelo.

play22:01

How would you actually do a sequence search?

play22:05

Currently, the best way to do it is to use hhblits.

play22:09

So the HMM files, HMM profile files that we provide currently

play22:15

in ModelAngelo 0.2, are HHN files,

play22:20

which are hhblits profiles.

play22:23

And you have to install hhblits, and then you

play22:26

would have to also use their database files.

play22:32

It's a bit difficult to use.

play22:36

And then basically, you run a command like this

play22:39

to get access to your results, but it does work very well.

play22:45

And so some common issues.

play22:49

One really big one, especially if you

play22:52

think your map is very good and the result just looks horrible,

play22:56

it's that the map is probably in the wrong handedness.

play22:59

So first thing to try is to split the hand of the map,

play23:02

try it again.

play23:03

If the result gets better, that was probably why.

play23:06

Another way to diagnose this is if you look--

play23:09

like in Chimera for example, if you

play23:10

look at the confidence values ModelAngelo predicts,

play23:14

there should be some high confidence chains predicted.

play23:18

So if there's no high confidence change predicted,

play23:22

yeah, it's probably in the wrong hands.

play23:24

So that is the main most common issue we have.

play23:29

Second one is if you see a lot of pruned away sequences

play23:34

in the output.cif but in output raw

play23:36

it still exists, either it's a problem

play23:40

with the local resolution, but if you definitely

play23:42

think it's not a problem with the local resolution,

play23:45

then it's highly likely that that sequence is not

play23:49

provided in the sequence file that you provide.

play23:53

So to diagnose this you could, for example, run ModelAngelo

play23:58

without the sequence, then take the HMM profiles,

play24:02

do a search against whatever genome you think,

play24:05

and then look through the sequences.

play24:06

If you see that, yeah, there's a new sequence in here that

play24:10

could actually be the problem.

play24:15

The third problem is that the global resolution is too low.

play24:18

So it's not going to automatically

play24:21

just stop working at four Angstrom and above,

play24:23

but it will get worse.

play24:25

And it'll get worse pretty quickly.

play24:28

And it's going to perform best at 3.5 and higher resolutions.

play24:34

Similar problem is the local resolution.

play24:35

So even if your global resolution is high,

play24:38

the local resolution, if it's low,

play24:41

it's just not going to be very confident.

play24:44

And it's not going to be able to distinguish its sequence and so

play24:47

on.

play24:48

And the last one is, yeah, sometimes it

play24:52

can build the confident backbone.

play24:56

This is when the local resolution

play24:59

is on the cusp of something where it can see the backbone,

play25:01

but it can't build the sequences, even if you've

play25:05

provided the sequences.

play25:06

And this happens sometimes.

play25:10

And it's really hard to do anything about it

play25:12

because just the side chain densities are probably

play25:16

not good enough for ModelAngelo to understand what's happening.

play25:19

So to diagnose this, it's just you're

play25:25

confident that the sequence exists

play25:29

in the input and ModelAngelo the raw output has this that chain,

play25:35

but it just doesn't exist in the pruned sequence,

play25:39

then you could still use the backbone that ModelAngelo has

play25:43

predicted, but you would have to go and fix the sequence

play25:47

yourself.

play25:49

So yeah, those are some of the common issues.

play25:51

So what are the next steps for ModelAngelo?

play25:55

Well, we're going to have the 1.0, which

play25:57

we will release soon.

play25:59

And there's a lot of improvements.

play26:01

The first one is that everything I've explained so far

play26:04

is for proteins only.

play26:08

But we are going to now also have the ability

play26:12

to do nucelotides.

play26:14

Nucleotides are considerably more

play26:16

difficult to build automatically.

play26:18

And so the results won't be as good, especially for lower

play26:25

resolutions, as the proteins, but it should still

play26:29

be respectable to build the backbone.

play26:31

And when the resolution is very high, like 2 and 1/2 Anstrom,

play26:35

then you could really actually do

play26:37

a pretty good job of completely automatically building it.

play26:43

ModelAngelo currently, so from down samples everything

play26:47

to Nyquist of 3 Angstroms, and because we

play26:53

had some issues with the nucleotides,

play26:55

2 Angstroms works much better for the nucleotides,

play26:58

so that's what we're going to be using for ModelAngelo 1.0.

play27:03

This actually also slightly improves

play27:06

performance on proteins as well.

play27:09

And we will have an integrated HMMER

play27:14

search, which is going to be very useful,

play27:16

for ModelAngelo 1.0.

play27:18

So instead of having to install hhblits

play27:20

and dealing with all that, we're going

play27:22

to have an integrated utility that

play27:24

also understands how ModelAngelo HMM profile directories work.

play27:30

So you can just finish your build sequence job,

play27:35

and then just do an HMM search step right from there,

play27:39

and it's going to understand everything and parse

play27:41

everything correctly.

play27:42

And it's going to make your job of just searching

play27:45

a genome much easier.

play27:47

So that's going to be in ModelAngelo 1.0 as well.

play27:49

And it will be quite a bit faster,

play27:52

the whole thing, because of some optimizations that we've done.

play27:56

And yeah, so please try it out if you haven't already.

play28:02

There's instructions available in the GitHub repo.

play28:07

And also, if you just type ModelAngelo build and then

play28:12

help, it will give you all the options that

play28:15

are available for each group.

play28:20

And yeah, so I'd like to acknowledge my supervisor Sjors

play28:25

Scheres, Dari, who's the co-author of this paper,

play28:29

other members of my team, and also Lukas Käll,

play28:34

who helped with HMM.

play28:36

And I can take some questions now.

play28:42

Great, thank you very much, Kiarash.

play28:44

For questions, you can raise your hand

play28:48

and give the little clap symbol here.

play28:50

But if you go to reactions, and then click Raise Your Hand

play28:52

big button at the bottom, hands will pop up,

play28:55

and it will bump you up to the top of the list.

play28:58

And then I can unmute you, and you

play29:00

can ask your question directly.

play29:03

You can also send questions in the chat window.

play29:07

And we can just pass those on.

play29:10

Just a thought, as you were wrapping up there,

play29:13

I can tell you that when I installed ModelAngelo,

play29:16

I installed it, it's available in SBGrid,

play29:18

I think have 0.2.2 is there, I picked

play29:22

a sort of arbitrarily hard map and just ran it

play29:26

as my test case.

play29:27

The very first time it ran through beautifully.

play29:30

It built a very nice little model amino.

play29:33

And it ran great.

play29:34

And I can't say that for many other software packages,

play29:37

so that's--

play29:39

Thank you.

play29:39

That's great.

play29:42

One question that came in by chat,

play29:44

we observed that chain was built but in fragments.

play29:48

We tried after changing the map orientation,

play29:50

but still observed the same results.

play29:52

Can you suggest something regarding the same?

play29:55

The map local resolution between 3 to 5, much

play29:59

of it in the 3.2 range.

play30:02

Right, so yeah, if it's fragmented,

play30:05

you could take a look at the raw file.

play30:07

So if that one isn't fragmented, that's probably

play30:11

what's happening is in the postprocessing script,

play30:14

it's done a search, so it can see some portions

play30:17

of the sequence that are confident in

play30:21

and other portions where it's probably

play30:25

getting less confident, so it's going to prune those.

play30:27

So there's not much to do other than just take the chain

play30:34

and especially if you think that the sequence exists,

play30:37

not much to do there, just kind of fix it manually.

play30:41

It's probably just the local resolution

play30:43

at those specific points is not good enough for ModelAngelo

play30:48

or something like that.

play30:53

All right, we've got a hand up.

play30:55

Chao Kwan, you can unmute and ask your question.

play30:59

Thank you.

play31:00

That's good talk, so yeah, I'm actually working on a case

play31:04

where it's difficult to take a model for the map in Chimera.

play31:07

So I look at the map.

play31:08

I find the map quality is pretty--

play31:10

I mean, resolution is good by face value,

play31:15

but the poor probably have a preferred orientation.

play31:20

OK, right.

play31:21

So I'm wondering, in this case, do you

play31:24

think that AI software can help?

play31:26

Like--

play31:27

So what is the resolution, display resolution number?

play31:33

It's 3.2.

play31:34

Again, it's just the face value.

play31:36

So the quality is actually--

play31:39

You can try.

play31:40

So what it's probably going to do is, for a portion of it,

play31:43

it might be able to build it.

play31:45

OK.

play31:46

But yeah, if you already think it's not very good,

play31:51

It's, yeah,

play31:51

So basically, ModelAngelo, when a human can build it,

play31:55

we can build it faster, that's the idea, and easier for you.

play32:00

But then if you are having trouble,

play32:04

it might be that ModelAngelo can't do it either,

play32:06

but you should try it, and let me know.

play32:08

If it does build it, that's great.

play32:11

Yeah, yeah, I sure do hope so.

play32:13

I mean, I'm thinking the first step here is localization

play32:18

of the cr files, right?

play32:19

Yes.

play32:20

If we haven't localized the cr files, then it should help.

play32:24

In theory, it should help localize

play32:25

this secondary structure feature.

play32:28

So that's going to be tough.

play32:29

Yeah.

play32:30

That's my style.

play32:30

Yeah.

play32:31

Yeah.

play32:32

So if it's able to do that.

play32:33

And so we put the intermediate files there for a reason.

play32:37

So if this is a more advanced whatever, but once it runs,

play32:42

you can look in the C alpha output folder,

play32:45

and then there's a cif file with the C alphas in there.

play32:50

and just look--

play32:51

that might even help you even if ModelAngelo itself isn't

play32:54

able to do anything afterwards, it

play32:56

might help you see where things are, so.

play32:58

Yeah, yeah, cool.

play33:00

Thank you.

play33:00

Thank you.

play33:01

Appreciate that.

play33:02

We had a question here from--

play33:06

I noticed that building into maps in which symmetry has been

play33:09

imposed, build chains that were not identical,

play33:14

let's say, in post symmetry, so can symmetry

play33:17

be accounted for without user map segmentation?

play33:21

So basically, no.

play33:24

ModelAngelo will basically build everything independently almost

play33:29

assuming there's no symmetry whatsoever.

play33:32

And you will actually see specifically regions.

play33:35

And this comes up more in when you want to do searches

play33:38

against a genome.

play33:39

But it's the exact same symmetry I

play33:42

understand, but for some reason, ModelAngelo--

play33:46

and the reason is, it's a bit technical,

play33:48

but it will have better chains in one symmetry section and not

play33:53

the other.

play33:53

And I don't know, just you can duplicate it or something,

play33:58

but, yeah.

play34:00

Choose the best chain and apply symmetry.

play34:01

Exactly, yeah.

play34:03

Yeah.

play34:05

one other question, is there a size limit for ModelAngelo

play34:07

for building a model?

play34:08

No.

play34:09

yeah, we've built things now that's like 160,000 residues.

play34:15

It takes longer, but it should be fine, yeah.

play34:19

And I think you touched on this here.

play34:22

Your resolution, you're aiming for about 3 and 1/2 you said

play34:25

is sort of ideal for the best quality model.

play34:29

So--

play34:29

and better.

play34:31

Yeah, say that again, sorry.

play34:33

And better.

play34:34

And better, OK, so--

play34:35

Yeah.

play34:35

It drops off pretty quickly after 3 and 1/2.

play34:38

Yeah, yeah, yeah.

play34:39

It's like, mainly the side chains that go first, right?

play34:42

So it's-- yeah.

play34:45

Yeah, so we had a question, which resolution of the map

play34:48

can get a good model?

play34:50

So 3 and 1/2 and then you're--

play34:52

Yeah, 3 and 1/2.

play34:54

--living dangerously.

play34:55

Yeah, I think, yeah.

play34:57

Pete had a question.

play34:58

Go ahead, Pete.

play34:59

yeah so great talk.

play35:01

And I've actually got two questions.

play35:03

And they're both a little bit speculative kind of ones.

play35:06

But for the first one, so do you have

play35:08

a sense for how sensitive it is for the density type?

play35:11

If you wanted to use this for a low resolution X-ray map

play35:14

or if you had a low resolution micro ED map,

play35:17

would you have to retrain that portion of the model?

play35:19

Or do you think things would be close enough?

play35:21

So I've been asked this before.

play35:24

And I know that there are other machine learning approaches

play35:28

that work with densities, where apparently it

play35:30

doesn't need to be retrained.

play35:31

But I personally have never tested

play35:34

ModelAngelo on anything else.

play35:36

So I really couldn't say.

play35:38

My intuition says it should be retrained, but I don't know.

play35:45

Not knowing is a perfectly valid answer, and it's speculative.

play35:48

So I guess the other one might be

play35:51

a little bit more speculative.

play35:53

But for the unknown sequence case,

play35:56

have you thought about using--

play35:58

if you've got a three-dimensional structure

play36:00

that you don't know the sequence of,

play36:01

would doing a search with that three-dimensional structure

play36:05

rather than sequence probability,

play36:07

is that something that could be useful?

play36:09

Yeah, that could be useful.

play36:11

So like FoldC, for example, does something like that.

play36:14

But basically, we've seen much, much better specificity

play36:20

when we use the sequences.

play36:22

Because if you think like, ModelAngelo has--

play36:25

it's predicting two things.

play36:27

It's predicting the structure, but it's also

play36:28

predicting use probabilities.

play36:30

And there's a lot more information

play36:32

in the probabilities there.

play36:33

Because structure is conserved much more in sequences.

play36:38

So if you want to find--

play36:41

Cool, thank you.

play36:50

Any other questions, anyone?

play36:52

Feel free to raise your hand or send it by check.

play36:56

All right, good.

play36:57

I have a question.

play36:58

So you're coming from a mathematics background.

play37:01

You're tackling different problems in structural biology.

play37:04

This is a challenging one, and this is a pretty productive

play37:08

approach, at least in my hands.

play37:11

What are your targets on next?

play37:14

Are you improving this?

play37:15

You're going to nucleotides.

play37:18

Yes.

play37:18

That's interesting.

play37:19

So, yeah, we're going to be improving this for now.

play37:22

And I'm somewhat involved with other things

play37:29

for cryo-EM processing itself.

play37:32

But after that, who knows?

play37:34

There's some other projects, but they're pretty preliminary now,

play37:38

so.

play37:39

I'll let you know.

play37:40

All right, cool.

play37:42

Yeah, that's great.

play37:42

I just, I see there's so much rapid development happening

play37:47

in this space.

play37:48

So it seems like there's so many great ideas out there.

play37:51

Just a good time to be developing

play37:54

methods in structural biology.

play37:57

And using them, hopefully.

play37:59

Yeah, yeah, all right, well, with that, we can wrap up.

play38:02

[MUSIC PLAYING]

Rate This

5.0 / 5 (0 votes)

Related Tags
ModelAngeloCryo-EMAtomic ModelAutomated BiologyStructural BiologyMachine LearningPi DayScience CelebrationAI in ScienceBiological ResearchData Analysis