Tuning OTel Collector Performance Through Profiling - Braydon Kains, Google
Summary
TLDRIn this insightful talk, Braden K, a contributor to OpenTelemetry, discusses performance tuning of the OpenTelemetry Collector through profiling. He introduces profiling as a means to measure program activity at specific locations, akin to time series data points. Using 'pprof', a built-in Go tool, he demonstrates how to analyze and improve the Collector's performance, addressing issues like memory leaks and inefficient process metric collection. The talk showcases the utility of profiling tools for developers and users alike, emphasizing their accessibility and the potential for continuous profiling to enhance understanding and optimization of applications.
Takeaways
- 🔧 The speaker, Braden K, discusses the use of profiling for performance tuning of the OpenTelemetry Collector and emphasizes the accessibility of profiling tools.
- 📈 Profiling is likened to taking measurements at different locations in a program over time, similar to how metrics are measured at different timestamps.
- 📊 Profiling formats such as PPR (which OpenTelemetry profiles are based on) support multiple types of signals and measurements, including CPU and memory profiling.
- 🛠️ The OpenTelemetry Collector is written in Go, which has built-in support for using PPR, simplifying the process of profiling.
- 🔬 Case studies are presented to demonstrate the use of profiling in identifying performance issues, such as a potential memory leak in the Prometheus receiver.
- 🔍 The use of flame graphs in PPR provides a visual representation of memory allocation and can help pinpoint areas of a program consuming the most resources.
- 🔄 The speaker identifies a potential cardinality leakage in the cumulative to Delta processor, suggesting a need for better cache eviction configuration.
- ⚙️ Process metrics collection on Windows is identified as inefficient due to the high cardinality and system call costs, leading to a quest for optimization.
- 💡 The Windows Management Interface (WMI) is explored as a more efficient method for retrieving parent process IDs, reducing the workload of a single scrape.
- 🚀 The speaker shares ongoing work to improve the OpenTelemetry Collector's process scraping efficiency on Windows, with a PR pending merge.
- 🌐 The talk concludes with an encouragement for OpenTelemetry developers and users to utilize profiling tools to understand and improve their collector's performance.
Q & A
What is the main topic of the talk given by Braden KES?
-The main topic of the talk is about tuning the OpenTelemetry Collector's performance through profiling, specifically using the pprof built-in tool in Go to analyze performance problems.
What is the purpose of profiling in the context of the talk?
-Profiling is used to measure and analyze the performance of a program at different locations, similar to how metrics are used to measure something over time. It helps to identify performance issues such as memory leaks or CPU usage inefficiencies.
What is the pprof format and why is it significant in the talk?
-Pprof is a profiling format that supports multiple types of signals and measurements. It is significant because it is the format that OpenTelemetry profiles are based on, and it is built into Go, the language used for the OpenTelemetry Collector.
What is the difference between a metric and a profile in the context of performance monitoring?
-A metric is a time series data point, a measurement of something at a specific time. A profile, on the other hand, is a measurement taken at a specific location in a program, and when aggregated, it provides insights into what the program was doing over the measured locations.
What is the role of Braden KES in the OpenTelemetry project?
-Braden KES is a code owner on the host metric receiver and a member of the system metric semantic conventions working group. He mainly focuses on system metrics and works on the Google Cloud Ops agent.
What was the issue with the Prometheus receiver that Braden KES investigated?
-The issue was a suspected memory leak in the Prometheus receiver, where the memory usage was constantly growing over time.
What is a flame graph and how is it used in the context of the talk?
-A flame graph is a visualization tool used in profiling to show where different allocations are happening in the program. It helps identify areas of the program that are using more resources, which can be indicative of performance issues.
What was the conclusion from the memory profiling of the Prometheus receiver?
-The conclusion was that the Prometheus receiver itself was not allocating a significant amount of memory. The growth in memory usage was attributed to the cumulative to Delta processor, suggesting a potential cardinality leakage issue.
What is the significance of the cumulative to Delta processor in the context of the memory leak investigation?
-The cumulative to Delta processor was identified as the main consumer of memory, suggesting that it might be storing too much data without properly evicting old entries, leading to a memory leak.
What was the second case study presented in the talk?
-The second case study focused on improving the efficiency of process metric collection, specifically addressing the high CPU usage when getting the parent process ID for every process on Windows using the host metrics receiver.
What was the solution proposed to reduce CPU usage in the process metric collection on Windows?
-The solution proposed was to use the Windows Management Interface (WMI) to query and retrieve the parent process ID in one go, instead of using the win32 API which was found to be inefficient.
What are the potential benefits of using the OpenTelemetry profile format across different languages?
-The potential benefits include a standardized way to profile applications, allowing for consistent tooling and analysis methods across different programming languages, making it easier to identify and solve performance issues.
Outlines
🔍 Introduction to Profiling for Performance Tuning
The speaker, Braden K, introduces the topic of using profiling for performance tuning of the OpenTelemetry Collector. He emphasizes the importance of understanding profiling signals and encourages the audience to explore profiling tools that are readily available. Braden provides a basic analogy for profiling, comparing it to taking measurements at different locations in a program over time to build a picture of the program's behavior. He also discusses various profiling formats, focusing on PPR (Profile Protocol) and its extension, OpenTelemetry profiles, which are compatible with existing tools and built into the Go programming language, the language in which the OpenTelemetry Collector is written.
🔎 Case Study: Investigating Memory Leaks in Prometheus Receiver
The speaker presents a case study involving a memory leak issue in the Prometheus receiver of the OpenTelemetry Collector. He describes using PPR to take hourly profiles to understand the memory allocation over time. Initially, the Prometheus receiver's memory allocation was not significantly large, but over a period of five hours, it grew substantially. The speaker identifies the 'cumulative to Delta processor' as a significant consumer of memory, which continued to grow across profiles. This led to the hypothesis of a cardinality leakage issue within the metric pipelines. The speaker suggests configuring cache eviction to mitigate memory growth and discusses the importance of understanding and addressing cardinality leakage in profiling.
🚀 Improving Efficiency in Process Metric Collection
Braden K discusses his efforts to improve the efficiency of process metric collection, which is known for its high cardinality and the cost of system calls. He presents a CPU profile analysis of the process scrape on Windows, revealing that a significant amount of time was spent retrieving the parent process ID for each process. The initial method used, 'create tool help 32 snapshot', was found to be inefficient due to its extensive data capture. Braden explores an alternative approach using the Windows Management Interface (WMI) to perform a single query for all process information, reducing the time spent on a single scrape from 740 milliseconds to 220 milliseconds. He also mentions the possibility of further reducing scrape time by disabling the parent PID metric if it's not required, which could lead to even more efficient process scraping.
🌟 Conclusions and the Future of Profiling with OpenTelemetry
In conclusion, the speaker emphasizes that profiling is not a magical process but a skill accessible to everyone. He encourages developers and users to leverage profiling tools to understand and improve their applications. Braden highlights the importance of using PPR and OpenTelemetry profiles for targeted issue resolution and continuous profiling. He expresses excitement for the adoption of OpenTelemetry profiles across different tooling and languages, envisioning a future where the profiling standard can be used uniformly across various programming environments.
Mindmap
Keywords
💡otel Collector
💡profiling
💡PPR (pprof)
💡otel profiles
💡flame graph
💡memory leak
💡cardinality leakage
💡process metric collection
💡CPU profile
💡Windows Management Instrumentation (WMI)
💡continuous profiling
Highlights
Introduction to tuning the OpenTelemetry Collector performance through profiling.
Emphasis on the accessibility of profiling tools for performance analysis.
Speaker's introduction: Braden K, a contributor to OpenTelemetry Collector and system metrics.
Explanation of profiling as a measurement at a location in a program over time.
Overview of profiling formats, including PPR and Otel profiles.
PPR's support for multiple signal types, including CPU and memory profiling.
Built-in profiling tools in Go for OpenTelemetry Collector.
Demonstration of using PPR extension with the OpenTelemetry Collector.
Case study on Prometheus receiver performance and potential memory leak.
Use of PPR to diagnose and analyze memory allocation issues.
Investigation of cumulative to Delta processor's memory usage over time.
Hypothesis of cardinality leakage in metric pipelines.
Efficiency improvements in process metric collection.
CPU profiling of process scraping on Windows and identification of inefficiencies.
Implementation of Windows Management Interface for more efficient process metadata retrieval.
Performance improvement by disabling parent PID in process scraping.
Conclusion emphasizing the power of profiling tools for developers and users.
Encouragement for using profiling to understand and report issues in collectors.
Anticipation for the proliferation of OpenTelemetry profiling in various tooling.
Transcripts
all right so um I'm going to be giving a
talk about uh tuning The otel Collector
performance through profiling uh this is
specifically about some of the work I've
been doing using uh prpr built-in to go
to uh make some minor improvements or
analyze some performance problems with
the open lemetry collector uh this is
sort of a two-prong talk you know one of
is to sort of build excitement for the
profiling signal in general if you're
not familiar with it but also to show
that these tools already exist to start
getting into profiling if you're into
interested uh it's all very accessible
uh so I'm Braden KES uh you might know
me through GitHub as Braden K or if
you're a big meany head brayon uh I work
on the open Telemetry collector and
semantic conventions mainly focused on
uh system metrics since it's a we we
heavily use it on the Google Cloud Ops
agent which is uh my team's tool um so
I'm a code owner on the host metric
receiver and I'm a member of the system
metric semantic conventions working
group
uh I'm going to start by grossly
oversimplifying what a profile is to
sort of uh explain it if you've never
heard of it this is the analogy that
sort of cracked it for me um if you
think of a metric um usually it is a
Time series data point it's a
measurement of something at a period at
like a time stamp at a period of time if
you map that data over a span of time
you can start to paint a nice picture of
what was going on with that measurement
over that span of time uh the way I like
to think of profiling is kind of like
that except instead of a measurement
taken at a period of time it's a
measurement taken at a location in a
program and when you paint that over all
the locations available to the program
you can start to paint a picture of what
your program was doing with the
measurement you're taking whether that
be CPU or memory based um these are some
of the uh profiling uh some some of the
profiling formats you might be familiar
with my first time using profiling was
actually uh with a call grind this is a
tool under Val grind I had no idea what
any of it meant uh PPR is the format I'm
going to be talking about today and otel
profiles are uh an extension of PPR you
know the the current version to proposal
of the data model is actually just a
straight extension of PPR it's backwards
compatible with the old format um the
reason I'm going to be talking about P
today um mainly is that for one thing it
has uh support for multiple types of
signal signals and measurements uh you
can call grind for examples focused
specifically on CPU profiling and uh you
know kernel Linux kernel perf is another
popular one that's very focused on the
CPU side of things but PPR does both and
we're going to be looking at both today
um and I already mentioned that whoops
uh but yes it's the format that otel
profiles are based off of um the tools
to utilize it are actually built right
into go which the open Telemetry
collector is written in or you might
have applications written in it as well
uh and I'm going to be demonstrating a
little bit of that today as well uh to
use prpr with the otel collector uh you
can use the prpr extension um this
automatically configures the stuff that
you would need to manually configure
yourself which is not all that hard
really but it's even more convenient
that it's built in uh this will actually
spin up a PPR server that you can go
query yourself with the PPR tool to get
a profile at a specific time get a
specific type of profile at a certain
time uh you you do that with this
command uh if you have go installed this
is already here you don't have to do any
extra go install or anything like that
it's actually built into the
installation um and then that's it sort
of you have to install graph F to get
the graphs and stuff but for the most
part this is it uh and this is all you
need to do to get started with it uh so
I'm going to be looking at a couple of
case studies and because I know I'm
playing us into the break and
everybody's excited for coffee I'm going
to try and be a little bit brief with it
but I will try and hopefully get all the
information we need out of it this uh
issue came to my attention uh when I was
talking internally about Prometheus
receiver performance stuff that I was
trying to improve and this issue opened
by Enrique who you might be you might
know him from his uh YouTube channel is
it observable uh he was doing some uh
performance testing of different
Prometheus scrapers and he was having
issues with the Prometheus receiver
doing what it looked like it was leaking
memory because it was the memory was
constantly growing over time uh so I
opened up prpr to try and get some
information and oops there we go we go
uh when you open up uh the PPR web UI
this is sort of what you get um the
default view here is a graph which is um
pretty good for for memory maybe not so
much for CPU profiles but it's a pretty
good view for memory you can sort of see
that oh geez this is going to be tough
because I'm on a lower resolution uh but
you can sort of see what where different
allocations are happening and where
different spots are holding memory uh
for today we're actually going to be
looking specifically at the flame graph
visualization um oh it's so zoomed in
this is going to be fun uh This what I
had enre do was to take profiles at
every hour over a period of time since I
couldn't really replicate his setup so
well so it's sort of like budget
continuous profiling um this first
profile we're seeing we're working with
half a gig of Heap space when you look
at a PPR profile you're specifically
looking at the the Heap there is more
memory that ends up getting used if you
read the number from your system versus
the profile you're going to get a
different number because the Heap is
only one part of the memory map but it
is important for when we're talking
about memory leaks because memory leaks
are like the region of memory that's
growing is going to be the Heap usually
um if we look at this for an issue
called Prometheus receiver memory leak
this spot where the Prometheus receiver
is actually allocating memory is really
not that big and I was a bit surprised
by that but this is pretty early in the
measurement the biggest thing that we
see right now is actually the cumulative
to Delta processor uh this works by
storing an original point of of the
metric to do a Delta calculation against
because that's how uh Delta metrics work
so for the first profile early on in the
Run we kind of expect that this would be
relatively large if you're doing a lot
of metrics if you're converting a lot of
metrics there needs to be for each
metric identity you know an original
point to calculate against so maybe
that's okay for the first profile but I
kind of hoped that what what what I
would expect to see is in the later
profile I have another profile from like
5 hours later and that this region of
memory from the Prometheus receiver I
would expect to see it grow um if we go
and look at the from five hours later it
has grown quite a bit we're up to two
and a half gigs of Heap space but the
shape is basically the same the
cumulative to Delta processor uh is
still taking up the most and the numbers
themselves have continued to grow um and
when I thought I had more time I was
going to take on us all on an adventure
through how this all works in the
cumulative to Delta processor but we
don't have time so basically what this
what this led me to believe is that
there are some manner of cardinality
leakage um this sync map loader store
this map is a storage of uh different
metric identities where it's like the
name the labels and the label values so
that every unique time Series has its
own original data point stored um this
region of memory continues to grow and
even through through every profile this
was consistently the one that was
growing which led me to believe that one
of the metric pipelines there has some
manner of cardinality leakage um but the
the Prometheus endpoints he was scraping
were popular like community exporters
and I don't really know which one is the
culprit we haven't figured that out yet
um but there is a feature in the
communative to Delta processor that will
evict old entries if it hasn't seen it
in a certain amount of time so I am
having him configure that to hopefully
see that that region of memory not grow
too much if we have cardinality leakage
in one of the pipelines but that's a
good lesson about using cumulative to
Delta make sure you're not leaking
cardinality too much because this can
this can happen or at least make sure
you configure the cache eviction um oops
that's the wrong one the second case
study that we're going to be looking at
um I'm a host metrics code owner um and
one of the Crusades that I've been on is
to make process metric collection more
efficient because process metrics are
very high cardinality and to get a lot
of the information you need you need to
make system calls and that's very
expensive uh and we had these two issues
we're only going to be looking at one of
them today I don't have time if you want
to talk about the first one where uh we
were looking at uh host metric receiver
on Linux uh the fix for that has
actually landed so if you want to talk
about it come find me after um but I'm
going to be looking at the slightly more
interesting one which is the second
issue related to uh process scraping on
Windows uh of course the the challenge
of the host metrics receiver is that the
a lot of the process metrics are the
same or across platforms but the
implementation of how you get them is
completely different uh so we're I'm
going to look at two profiles this first
one um is this now we're into CPU
profiles the last one was a memory
profile this is a uh CPU profile of one
scrape I I time you when you take a CPU
profile you time it over a certain
amount of time uh and pre- profile
sample at a certain rate for events
uh and you can get the 740 milliseconds
we see here that is how much on CPU work
was sampled uh and this is
representative of one process scrape
scrapes all the processes on the system
records all the metrics for it and that
usually happens on an interval it was
happening on interval but of about a
minute in this case and I sampled for 40
seconds so that's what we're looking at
um Beware of the jump
scare uh the width of these uh each
section is how much like What proportion
of the of the work essentially was being
taken here and because I'm I'm a little
squished because of the zoom in you
can't see but 88% of time was spent
getting the parent process ID for every
process in the scrape and when I saw
this I near jumped out of my chair uh
because it really shouldn't be that
ridiculous uh but it turns out with the
win32 API this call create tool help 32
snapshot is as far as I can see and as
far as the go PS util maintainers can
see the best way to get the parent
process ID for a process but this
snapshot snaps tons and tons of
information including Heap and thread
space for the entire process and it's
doing this for every process uh so that
is extremely expensive uh and I was on a
hunt for a better way to do it um
there's a lot of Microsoft people here
so they're going to kind of know this
part um but the the best other way that
I can see to get uh the parent process
ID was Through the Windows management
interface uh it has a SQL query type of
language where you can query information
about the system and specifically about
processes and I'm already using it on an
old metric that I implemented to get the
process uh handles uh process the
handles belonging to a process getting
that information the only really good
way I could see in the win32 API was
using an unsupported nquery system
information query process information I
forget what it's called but it's an
unsupported win32 API and the we weren't
really super excited about using that uh
so I started using the Windows
management interface to query get the
information for every process in one
query and then sort of organize that
information as we're scraping process uh
scrap doing this get process metadata
work uh and that led to this second
profile of the new version that I came
up with um and we are down to uh only
220 milliseconds of work roughly cut
quite a bit of work out of a single
scrape uh by doing this in a wmi query
uh the query is the most expensive part
of the process scrape still um but this
actually has a sort of a back door
improvement too which is if you use the
process. handle metric already if you've
enabled that um then the information all
comes in one query and it's not more
expensive to get that second metric so
this is a pretty good Improvement the pr
for it is open I haven't got it merged
yet but I'm hoping we can get that
merged um and the other thing that I did
was I made it so that if you disable the
parent PID like if you just don't really
care about that resource attribute uh
you can delete it and essentially what
you get is
just um let's see it was you ignore this
yeah you get down to 90 milliseconds so
if you don't if you don't want the
parent pit in your scraping processes on
Windows you don't care about that then
you can get much more efficient just by
disabling it after the pr is merged
hopefully that will get merged soon um
so I'm pretty close to being out of time
I'm sure uh so I'm just going to go
quickly through sort of some of the
conclusions we come here uh mainly the
main thing I want to take away is that
like this isn't Magic you know anybody
has the power to understand this if I
have the power to understand this uh and
the tools are readily available to
everyone so if you're an otel collector
developer or really if you're in any
language there's probably similar
solutions for this you know we don't
have to wait for otel profiling to be
ready to start solving some of the
problems that profiling is good at um
and if you have you know if you have a
collector even if you're not a developer
of The Collector but you want to
understand a little bit more about what
your collector's doing or you want to
you know report a GitHub issue uh using
the PPR extension and sort of
understanding how to like show either
show screenshots or send profiles along
to maintainers that stuff is very
useful um the profiling is in this CA in
this case I was looking at individual
problems you know I was trying to Target
specific specific issues at specific
times manually taking profiles but there
are a lot of solutions out there for
continuous profiling you know I tried to
focus this talk only on generic
Solutions not nothing vendor specific
but lots of vendors have lots of uh to
Great tools for you know even better
flame graph views uh continuous
profiling over time so if you want to
look at us at it use basically you can
use it a lot like tracing uh but get you
know more granular information about
your program instead of like about your
whole
system um and just I'm excited for otel
profile in basically that's the that's
the end of it I'm really excited for
this to sort of uh proliferate through
other tooling and so that we can use
this the profiling standard but be able
to Tool it the same way across different
languages I think that's very exciting
uh that's everything I got and uh I
don't know if there's time for questions
if we're on break but you can come find
me afterwards thanks
[Applause]
تصفح المزيد من مقاطع الفيديو ذات الصلة
Lean Spring Boot Applications for The Cloud by Patrick Baumgartner @ Spring I/O 2024
FUNCTIONALLY PROFILING METAGENOMES AND... - Eric Franzosa - Late-Breaking Research - ISMB 2016
Profilazione diretta e indiretta
Voice Forensics: Rita Singh | 2019 Wharton People Analytics Conference
Managing Observability Data at the Edge with the OpenTelemetry Collector and OTTL - Evan Bradley
Performance Monitor Tutorial for Windows
5.0 / 5 (0 votes)