FUNCTIONALLY PROFILING METAGENOMES AND... - Eric Franzosa - Late-Breaking Research - ISMB 2016
Summary
TLDRIn the script, Dr. Orlando discusses advancements in profiling microbial communities from sequencing data, focusing on taxonomic and functional profiling. The tool 'MetaPhlan 2' is highlighted for taxonomic profiling, while 'HUMAnN2' is introduced for functional profiling, using a tiered read mapping strategy to improve accuracy and speed. HUMAnN2 enables species-level functional profiling, offering insights into both the composition and activity of microbial communities, with applications demonstrated in the Human Microbiome Project and other cohorts.
Takeaways
- 🌟 The script discusses advancements in profiling microbial communities from shotgun sequencing data, focusing on taxonomic and functional profiling.
- 🔍 Taxonomic profiling identifies species and higher-level clades present in a microbial community, while functional profiling identifies gene families and pathway composition.
- 🚀 MetaPhlan 2 is a tool developed for taxonomic profiling that searches reads against a pre-selected set of marker genes unique to each clade or species for efficiency and accuracy.
- 🛠 Humann2 introduces a 'tiered read mapping' strategy to improve the accuracy, speed, and resolution of functional profiling by leveraging species identification for more targeted searches.
- 🔑 The first tier of Humann2's method rapidly identifies species in a community by mapping reads against species-specific marker genes.
- 🌐 The second tier builds a custom database for the sample by concatenating the pan-genomes of detected species, allowing for a detailed search of the remaining reads against these genomes.
- 📚 Humann2's tiered search is more specific and sensitive, leading to higher accuracy and faster processing times compared to traditional comprehensive searches.
- 📈 The method was tested using synthetic metagenomes created from sets of bacterial genomes, demonstrating high sensitivity and precision even for low-abundance species.
- 🌿 Humann2 has been applied to real-world data, profiling hundreds of human metagenomes, and has shown to explain a majority of reads during the initial accelerated search tier.
- 🔬 The tool allows for the identification of metabolic pathways that are signatures for particular body areas, providing insights into the taxonomic resolution of functions within communities.
- 📊 Humann2 also enables the distinction between functional potential (DNA level) and functional activity (RNA level), showing that the two are not always the same.
Q & A
What are the two primary questions in microbial community profiling from sequencing data?
-The two primary questions are 'who is there', which refers to taxonomic profiling to identify species and higher-level clades, and 'what are those species doing', which is functional profiling to identify gene families and pathway composition.
What is the computational challenge in profiling microbial communities from sequencing data?
-The computational challenge lies in the need to search short reads from shotgun sequencing against a vast database of microbial reference genomes, which is computationally intensive and prone to errors due to potential spurious mappings.
What is MetaPhlAn 2 and how does it address the taxonomic profiling challenge?
-MetaPhlAn 2 is a tool developed for taxonomic profiling that searches reads against a pre-selected set of marker genes unique to each clade or species, rather than the entire database, making the profiling process more efficient and accurate.
What is the concept behind the tool HUMAn2 and its tiered read mapping strategy?
-HUMAn2 implements a tiered read mapping strategy, which starts with rapidly identifying species in a community by mapping reads against species-specific marker genes, followed by a custom database search of pan genomes of detected species, and finally a comprehensive search for any remaining reads.
How does HUMAn2 improve the accuracy and speed of functional profiling?
-HUMAn2 improves accuracy by ensuring reads are mapped to the correct species-specific genes, reducing spurious mappings. It improves speed by using a reduced database for the initial search, thus processing the metagenome much faster than a comprehensive search against an exhaustive database.
What is the advantage of HUMAn2's tiered search over traditional comprehensive searches?
-HUMAn2's tiered search is more specific and sensitive, placing reads in the correct locations, and thus increasing overall accuracy. It also processes the metagenome faster due to the use of a reduced pan genome database.
How does HUMAn2 provide species-level resolution in functional profiling?
-HUMAn2's tiered search allows for the reconstruction of functions on a species-by-species basis within the community, providing insights into which species are performing specific functions, even for low-abundance species.
What synthetic metagenomes or meta transcriptomes are used to evaluate HUMAn2's performance?
-Synthetic metagenomes or meta transcriptomes are created by taking sets of bacterial genomes and pulling synthetic sequencing reads from them, allowing for the evaluation of HUMAn2's accuracy and performance against expected profiles.
What are the different patterns of species contribution to a conserved pathway observed in the human gut microbiome using HUMAn2?
-Different patterns include a complex attribution where multiple species contribute to a conserved pathway in varying mixtures across individuals, a per-person dominant attribution where one species dominates the contribution to the pathway for each individual, and a universal dominant pattern where one species consistently provides the pathway across the population.
How does HUMAn2 distinguish between functional potential and functional activity in a community?
-HUMAn2 can analyze both DNA and RNA data, allowing it to distinguish between the functional potential (number of encoded pathways) and functional activity (actual expression of those pathways) within a community, showing that the two are not always the same.
How can one access and learn more about HUMAn2 and its related tools?
-HUMAn2 can be found by searching its name online, and it is installable via source pip as a Python package and via Homebrew. There is a detailed user manual and an active user group on Google Groups for further support and information.
Outlines
🌱 Advances in Microbial Community Profiling
Orlando discusses the lab's work on profiling microbial communities from shotgun sequencing data, focusing on taxonomic and functional profiling. Taxonomic profiling identifies species and clades, while functional profiling examines gene families and pathways within the community. The challenge lies in the computational intensity and potential for error due to the vast size of data and databases. The lab developed MetaPhlan 2 to address taxonomic profiling by searching reads against a reduced, pre-selected set of marker genes unique to each clade or species, resulting in faster and more accurate taxonomic composition profiling.
🔍 Humann2: Tiered Read Mapping for Enhanced Functional Profiling
The script introduces Humann2, a method that improves the speed and accuracy of functional profiling by implementing tiered read mapping. The process begins with a rapid identification of species using a small, specific dataset of marker genes. This is followed by a custom database search using the pan genomes of detected species, excluding genomes with no evidence of presence. The final tier involves a comprehensive search for unmapped reads. Humann2's approach is shown to be more accurate and faster, with the ability to reconstruct functions on a species-by-species basis, providing both community totals and species-specific insights.
🧬 Synthetic Metagenomes and Real-World Data Analysis
To evaluate Humann2's effectiveness, synthetic metagenomes were created using bacterial genomes and sequencing reads. The method was tested on a challenging dataset with a wide abundance range and congeneric species. Humann2 demonstrated higher specificity and sensitivity compared to traditional methods, resulting in improved accuracy and a significant speedup. The method was also applied to real-world data from the Human Microbiome Project, showing similar performance with faster processing times and the ability to identify signature metabolic pathways conserved in specific body areas.
🌐 Humann2's Broader Impact on Microbiome Research
Humann2 has been used to analyze hundreds of human metagenomes, revealing distinct metabolic pathway signatures for different body areas. The method allows for species-level resolution of function, showing different patterns of species contribution to conserved pathways across individuals. It also distinguishes between functional potential and activity within a community. Humann2 is available for use and is part of a suite of tools developed by the Huttenhower lab for microbiome analysis, which will be further discussed in upcoming presentations.
📚 Availability and Further Resources for Humann2
Humann2 is accessible for those interested, available through a simple Google search and installable via pip for Python or Homebrew. The lab provides a detailed user manual and an active user group for support. The presentation also highlights the broader ecosystem of tools for microbiome analysis developed by the Huttenhower lab, with a technology track presentation planned for the following day. The team acknowledges the contributions of collaborators and the Human Microbiome Project for the valuable data used in their research.
Mindmap
Keywords
💡Microbial Communities
💡Shotgun Sequencing
💡Taxonomic Profiling
💡Functional Profiling
💡Metaphlan 2
💡Tiered Read Mapping
💡Pan-genome
💡HUMANn2
💡Synthetic Metagenomes
💡Metabolic Pathways
Highlights
Orlando discusses advancements in profiling microbial communities from shotgun sequencing data.
Two primary questions in microbial community profiling: taxonomic profiling and functional profiling.
Taxonomic profiling identifies species and clades present in a microbial community.
Functional profiling focuses on gene families and pathway composition within the community.
Metaphlan 2 is a tool for taxonomic profiling using pre-selected marker genes.
Metaphlan 2's efficiency comes from reduced database size and specificity of marker genes.
Humann2 introduces a tiered read mapping strategy for functional profiling.
Tiered read mapping starts with rapid species identification using marker genes.
Second tier involves building a custom database from detected species' pan genomes.
Remaining reads are searched against a comprehensive protein database in the final tier.
Humann2 provides functional profiles stratified by contributing organism and unclassified abundance.
Humann2 can quantify gene families and collapse their abundance into pathways.
Synthetic metagenomes are used to validate Humann2's accuracy and performance.
Humann2 outperforms traditional methods in precision, sensitivity, and speed.
Humann2 enables species-level resolution of functional profiles in microbial communities.
The method has been applied to hundreds of human metagenomes from the Human Microbiome Project.
Humann2 can identify metabolic pathway signatures specific to different body areas.
Species-level profiling reveals different patterns of contribution to conserved pathways.
RNA data analysis with Humann2 shows a distinction between functional potential and activity.
Humann2 is available as a Python package and part of a broader suite of microbiome analysis tools.
The Huttlehauer lab offers a technology track presentation for an overview of their tools.
Transcripts
mv orlando talking about some of the
work
in our lab profiling microbial
communities from shotgun sequencing data
and we'll be talking about some recent
advancements in that area today
so when we're talking about the
microbial community there's two primary
questions or types of questions we're
trying to answer from sequencing data
one is this question of who is there
which is the issue of taxonomic
profiling of identifying the species and
higher level clades that are present in
that microbial community
and the second question which will be
the closer focus for today
is a question of what those species are
doing which is functional profiling
identifying the gene families
and pathway composition of that
community and both of these because
they're starting from sequence
data are classic bioinformatics problems
in in sequence search so that's where we
begin our story um so to actually do
this to actually do either of these
types of profiling
we're interested in searching short
reads a shotgun sequence metagenome or
meta transcriptome
against a vast database of microbial
reference genomes
and as you can imagine as the size of
these data metagenomes
increases as the size of this database
increases with new isolate genomes being
sequenced every year
this is a very computationally intensive
problem and there's also a lot of
opportunity for error here if we're
spuriously mapping
reeds where they don't belong and so
this is really where we find ourselves
in trying to solve these profiling
questions
previously we developed a technique for
taxonomic profiling to alleviate some of
those issues
and more specifically what we're doing
there in this tool called metaphon 2
is instead of searching reads against
the entire database
is to search them against a pre-selected
set of marker genes that are unique for
each clade or species
across the community so for example here
we have isolated this a gene here that's
well conserved within the yellow species
we always see it in isolates of that
species
but it's not seen anywhere else so if
that gene recruits a reed if a reed maps
to that gene in the community
it's sort of like a little name tag
telling us that the yellow species
is there and we can use this technique
to very efficiently and accurately
profile the taxonomic composition of a
community this is a reduced database so
it gives us a nice speed
uh bonus and we've also pre-selected
these genes to be very specific so we
know when they recruit reads
that were um that they're being assigned
at the correct place
the issue in texas and functional
profiling is we're not interested in
just a subset of genes but rather we're
interested in all the genes and pathways
in the community
but in today's talk what i would like to
to go over is how we can leverage this
idea
of being able to rapidly and accurately
identify the species in a community
in order to improve the accuracy speed
and resolution of functional profiling
so the method that we developed for that
is called human2 and it implements a
strategy called
tiered read mapping and i'll go through
what that means now
so the idea here is that we're starting
from a shotgun sequenced
metagenome or meta transcriptome from a
microbial community and then we're going
to search these reads through a set of
tiers of different searches against
different databases
the first search tier is what i just
described in the previous slide
we're going to attempt to rapidly
identify the species in this community
by mapping these reeds against the small
and highly specific data set
of species-specific marker genes and so
you can see here in this example
that genes are being recruited to this
marker gene from the blue species
and the orange species here indicating
that they are likely to be present in
this community
but not to this green species indicating
that it's likely absent from that
community
and this is our first search tier the
second search tier which is really the
meat of this whole process then
is to build a custom database for this
sample by concatenating the pan genomes
of species that we detected in this
taxonomic prescreen step
so now we're going to do is do a
detailed search of all the remaining
reeds
against the genomes or pan genomes of
species we believe to be present in the
community
we're throwing away this genome here
we're not including this genome here
because we had no
evidence that it was actually present in
the community and although i'm just
showing one here being excluded
in reality this process is excluding
many many pan genomes that we won't
search through in this second tier of
the search
in the last tier of the search anything
that doesn't map in this process will
then
let flow into a more traditional
comprehensive search strategy so we'll
try to explain as much as we can
as quickly and as specifically as
possible and then what's left over will
take a
a traditional approach and just search
comprehensively by translated search
against a protein database at the very
end of this some reads will still map
nowhere they don't map to any reference
and these are set aside for possible
assembly downstream and outside of human
too
so the end result of this tiered search
are functional profiles of a metagenome
or meta transcriptome looking something
like this
where we have a particular function in
this case a gene family
and it's stratified by both contributing
organism as well as unclassified
abundance that we couldn't assign to a
particular species
that adds up to a total here this is for
gene families
once we've actually quantified the gene
families in the community
for genes like imp dehydrogenase that
participate in a metabolic pathway we
can collapse the abundance of multiple
genes
into a smaller subset of pathways which
is
more tractable to work with downstream
and we end up with similar looking data
where we get four each pathway
an abundance at the community level as
well as stratified by the species that
contributed to that pathway
and a measure of pathway coverage which
is a measure of our confidence that the
pathway is actually complete within this
particular sample
so these are what the outputs look like
for a typical run of human 2 on a
metagenome or metatranscriptome
to evaluate that this actually works and
it's behaving the way that we expect it
to
we were able to create synthetic
metagenomes or meta transcriptomes
by taking sets of bacterial genomes and
pulling synthetic sequencing reads from
them
and so in the example i'll go over here
we have a selection of 20 bacterial
species that are the most commonly
occurring species in the human gut
microbiome
and what we've done is to sample wreaths
i think the color there just shifted if
there's anything we're able to do
with that on the projector if not you
can adjust your eyes
um so we have sampled these reeds in a
staggered composition here
such that the most abundant species in
this
synthetic metagenome is about a thousand
times more abundant than the least
abundant species and so this makes for a
challenging
problem here and that the species have a
really broad range of
abundance brilliance also you can see
challenging us here the fact that we
have
congeneric species multiple species
within the same genus
and so there's a lot of homology we
expect among these congeneric species
which can make
mapping reuse two specific species
within that gene is more difficult
so once we have this synthetic
metagenome we can create an expected
profile of what genes and pathways we
expect to observe
and then analyze this meta genome using
different methods and see how well they
do
when we actually analyze this using a
traditional method a traditional
comprehensive search
we actually see a lot of error due to
spurious mapping we have
reads that are mapping where they're not
supposed to across those broad databases
which hurts our precision
as well as reads that we're supposed to
map to a gene and wound up mapping
somewhere else which hurts our
sensitivity
in contrast humantu's tiered search is
both more specific
and more sensitive in terms of putting
reads in the right place which gives us
a nice boost in overall accuracy
in addition because the tiered search is
trying to explain as much as possible
using that reduced pan genome database
it tends to process the metagenome a lot
faster than the comprehensive search
you're spending more time working with a
small database than you are the large
database with the tiered search
and so in this particular synthetic
example it was about a 7x speedup
but the last thing which is really one
of the key advantages of human 2 here is
that in addition to getting
us to community totals of different
functions which both methods can do
human 2's tiered search is able to
reconstruct functions on a species by
species basis within the community
and what we can see here is our
sensitivity for those functions across
species is very very high
down to about 1x coverage once we get
below 1x coverage
you're actually not sampling the entire
genome anymore and so you're well the
gold standard says you should be able to
find everything
because breeds weren't necessarily
sampled from every gene we don't see
them
however it's critical to note that our
precision remains very high even for
these low abundant species meaning that
their genomes are in this database
they're in this custom database
but they're not recruiting reads that
they're not supposed to so we were very
happy with the overall performance of
the method here
in terms of performance on real world
data we've used human 2 to profile
hundreds of human metagenomes from the
human microbiome project
and there we tend to see similar
performance that we're able to move very
quickly through these meta genomes
in the pan genome search stage about an
order to two orders of magnitude faster
than in the translated search
now that doesn't work that well for us
if we don't end up explaining a lot of
reads in pan genome search
and indeed what we find is that up to
about 60 percent of reeds in a typical
metagenome are explained by the pangeno
mapping during that fast step
with about 15 percent of additional
reads explained when we then
take the rest of the reeds and push them
through pan genomes translated search
so we're really explaining the majority
of what we can explain during this
accelerated
initial tier of the search
so that's that's some benchmarking and
accuracy stats for you
moving into some actual science we've
then looked at the profiles that human 2
produced
from those hmp metagenomes and we've
isolated metabolic pathways that we call
signatures for particular body areas
meaning that they're really well
conserved within a particular body area
and tend to be absent from other body
areas so in this example at the top
romnos degradation we see that it's
quite abundant and conserved at the in
gut metagenomes across individuals and
fairly rare at these three other
microbiome body sites so this is sort of
a grand overview
if we zoom in on that top example here
to see how this looks sample by sample
we can see that indeed human 2 is
showing us a lot of very consistent
abundance for this pathway
of about six parts per thousand across
the gut meta genomes and then it drops
off very quickly thereafter that we
don't really see much abundance for
those pathways at the other
sites in addition this is a little
tricky to see here but this light gray
down here is the
unclassified amount of the uh the
pathway that's
identified during the translated search
the rest of this in darker gray
outside that box is actually being
assigned a particular species so in this
particular example
not only are we seeing this pathway
consistently across gut meta genomes but
we're able to assign it
assign the majority of the copies of
that pathway to particular species
and we actually take those signature
pathways and dig into that species level
attribution
we see some interesting patterns so for
example if we zoom in on what's going on
in that little box there for the gut
meta genomes and take a look at the
species attribution
we can see different patterns of how
species contribute to a conserved
pathway or how a pathway is conserved
across different individuals
in the case of this gut pathway romney's
degradation
the attribution is actually relatively
complex that although the total is
fairly well conserved across individuals
we see very different mixtures of
species contributing the pathway from
one person to another
most individuals have a handful of
species that are contributing
and those aren't necessarily the same
from one person to the next
suggesting that the overall abundance of
this pathway seems to be more conserved
than its taxonomic attribution
i'll contrast that with another
mechanism where we can observe a per
person
dominant attribution of the pathway so
for example this is peptidoglycan
biosynthesis from the vaginal microbiome
here again we see a relatively constant
abundance across individuals
but a very different pattern of
attribution that each individual is
dominated by about
one species mostly in the the genus
lactobacillus
and so while they all wind up with about
the same abundance of the pathway it
tends to be contributed by just
one species per person that may differ
between people
a last mechanism is a universal dominant
pattern of attribution and
an example of this is trellis
degradation on the skin
this is a pathway that's provided by
propianobacterium acnes it's not really
seen
at other body sites and on the skin this
pathway is fairly consistently
and completely provided by just
propionibacterium
acnes across the population so unlike
the previous two examples
where different species could contribute
the pathway in different mixtures
here on the skin we're really just
seeing this one very common skin bug
prop acne is contributing this
signature pathway for the skin across
individuals
and so humantu's tier search combined
with the species level profiling
allows us to see this sort of taxonomic
resolution to function
that we haven't been able to do in
previous approaches
as a final biological example we'll move
outside of the human microbiome project
to another cohort that
two of my colleagues jason and gallop
work on
this is a cohort of health professionals
within the boston area where we have
both
metagenomes and metatranscriptomes of
the human gut
and jason and elite have profiled these
samples using human two
and we look at the dna level
contributions for for pathways in this
uh
group we see a lot of things that are
similar to that first mechanism i showed
a
complex attribution where a particular
pathway is relatively conserved in the
gut but can be contributed by multiple
organisms per person and those
potentially differ between people
so this is looking at the the dna level
we see that this is also a fairly
conserved pattern between individuals of
this mixture of bugs
when we look at the rna data however we
see a very different picture
that there's more of a gradient of some
people having a complex attribution
pattern in the rna pool
whereas in others the rna pool is
completely dominated by a single species
tequila bacterium prismiciae
and so here what we're seeing from human
too is this ability to distinguish
the functional potential of a community
in this case
numbers or relative copy numbers of
encoded pathways across bugs
with functional activity the actual
relative expression of that
that pathway within a community and see
that the two are not always the same
so in summary human too implements this
tiered approach which is a new approach
to functional profiling that aims to
explain as many reads as possible with
progressively
broader and less specific databases this
approach is more accurate and a lot
faster than the traditional approach of
just doing a comprehensive search
against an exhaustive database
and lastly we get stratification of our
results by species for free in this
process which allows us to access both
those questions of who's there
and what they're doing at the same time
which is increasingly what we want to
know not just what a community is doing
but what species are actually performing
those functions in the community
human 2 is available now if this is
something that's of interest to you it's
the first hit if you google humond 2.
it's installable via source pip as a
python package and
via homebrew we have a fairly detailed
user manual as well as a very active
user group on google groups where you
and i can converse by email if you'd
like
and so i'd encourage you to try it out
human 2 is part of a broader menagerie
of tools that the huttonhauer lab has
put together for analyzing microbiomes
both in terms of profiling data
profiling metagenomes from raw
sequencing as well as doing downstream
statistical analysis on those results if
you'd like to learn more about that
overall system
we'll be doing a technology track
presentation tomorrow evening that will
give more of a broader survey of these
tools than i've done now
and also if you stay tuned my colleague
ali will be presenting one of our
statistical methods for the analysis of
paired high-dimensional data sets
so a big thanks to the lab especially
the human 2 team highlighted there in
green for
having a lot of fun with this project
our collaborators on human 2 and
also the entire human microbiome project
for providing a lot of excellent data
for us to work with
thank you
Voir Plus de Vidéos Connexes
Tuning OTel Collector Performance Through Profiling - Braydon Kains, Google
Uber System Design | Ola System Design | System Design Interview Question - Grab, Lyft
Profilazione diretta e indiretta
COS 333: Chapter 1, Part 2
RNA Seq: Principle and Workflow of RNA Sequencing
A Conversation About Growing Up Black | Op-Docs | The New York Times
5.0 / 5 (0 votes)