Using Native OpenTelemetry Instrumentation to Make Client Libraries Better - Liudmila Molkova
Summary
TLDRLila Malova from Microsoft discusses the importance of observability in Azure SDKs for library owners, who often lack visibility into their libraries' post-release behavior. She emphasizes the need for detailed telemetry to diagnose issues efficiently. Malova illustrates how open Telemetry can be leveraged during development, integration testing, and performance testing to optimize libraries and improve user experience. She concludes by highlighting the necessity of embracing network issues and the value of user feedback for refining library instrumentation.
Takeaways
- π Lila Malova is a new member of the OpenTelemetry technical committee and a maintainer of OpenTelemetry semantic conventions.
- π OpenTelemetry is used in Azure SDKs to improve observability, which is typically considered from the user's perspective, but library owners also need observability to understand what happens after their libraries are released.
- π€ Library owners often lack visibility into their libraries' performance and usage post-release due to privacy concerns and the absence of self-collecting telemetry.
- π οΈ Developers can act as 'user zero' by collecting and analyzing telemetry during the development and testing of their libraries to gain insights into their performance and identify areas for improvement.
- π Observability during development time is crucial as developers have the context and control to make meaningful changes and optimizations based on the telemetry data.
- π OpenTelemetry can help identify inefficiencies in library operations, such as unnecessary HTTP requests or authentication issues, by analyzing traces and logs.
- 𧩠Integration testing can be improved with observability, as it helps pinpoint the causes of flakiness and bugs in retry policies and configurations.
- π Performance testing benefits from OpenTelemetry by allowing developers to simulate realistic scenarios, including network issues, and to monitor the service under load.
- π Telemetry data from performance and reliability testing can reveal insights such as excessive buffer allocation, thread pool size misconfiguration, and memory leaks.
- π§ Observability helps in debugging and fixing issues that arise during testing, leading to better performance and reliability of the libraries.
- π Library owners should be their own 'user zero' to understand their libraries deeply, but also need feedback from actual users to refine the telemetry and ensure it is useful for end users.
Q & A
Who is Lila Malova and what is her role at Microsoft?
-Lila Malova is a new member of the OpenTelemetry technical committee at Microsoft. She is a maintainer of OpenTelemetry semantic conventions.
What is the primary focus of Lila Malova's talk?
-Lila Malova's talk focuses on how OpenTelemetry is used in Azure SDKs to improve their observability and the importance of observability for library owners.
What does Lila Malova suggest about the observability of libraries after they are released?
-Lila Malova suggests that library owners often lack visibility into what happens to their libraries after release. They typically do not collect telemetry for themselves due to privacy concerns and the large volume of data involved.
Why is detailed telemetry important for library owners?
-Detailed telemetry is important for library owners because it helps them understand the issues users face, avoid back-and-forth communication, and collect comprehensive data to reproduce and fix issues efficiently.
How can library developers use telemetry during the development phase?
-Library developers can use telemetry during the development phase to collect feedback, analyze data, and optimize their libraries. They can be the 'users' who decide how to collect and analyze telemetry data.
What is an example of how telemetry can help in identifying issues during the development of a library?
-An example given is the observation of a complex operation downloading multiple layers of an image from a container registry, where repeated 401 errors were detected. This telemetry data allowed developers to identify and optimize the authentication flow.
What role does observability play in integration testing?
-In integration testing, observability helps in debugging tests and identifying bugs in retry policies and configurations. It is crucial for understanding the root cause of test flakiness.
How can performance testing benefit from OpenTelemetry?
-Performance testing can benefit from OpenTelemetry by providing detailed insights into network issues, resource utilization, and system behavior under load. It allows for more realistic testing scenarios and easier identification of performance bottlenecks.
What are some of the performance improvements identified through telemetry in the script?
-Some performance improvements identified include reducing buffer allocation size, optimizing thread pool size, and fixing bugs related to message prefetching, which in one case resulted in a thousandfold reduction in memory usage.
What is the importance of being the 'user zero' for library developers?
-Being 'user zero' allows library developers to gain firsthand experience with their libraries, collect telemetry data, and understand user needs. However, it's also important to gather feedback from 'user one', 'user two', and beyond to correct initial mistakes and improve the library further.
How does OpenTelemetry help in long-term performance and reliability testing?
-OpenTelemetry helps in long-term performance and reliability testing by providing detailed telemetry data over an extended period. This data allows developers to pinpoint issues and understand system behavior under various conditions, including regular network issues.
Outlines
π¬ Observability in Library Development
Lila Malova, a member of the OpenTelemetry technical committee at Microsoft, discusses the importance of observability not just for application users but also for library owners. She emphasizes the lack of visibility into what happens to libraries post-release and the challenges of collecting telemetry data due to privacy concerns and data volume. Malova suggests that library developers can act as users to collect and analyze telemetry, using development time as an optimal period for observability due to better control and context of the code. She illustrates this with examples of complex operations in Azure SDKs, showing how detailed telemetry can help identify and optimize issues like authentication flows and redirects.
π οΈ Leveraging Observability for API Improvements
The speaker uses the example of an API designed for downloading content to highlight how observability can lead to performance improvements. They point out unnecessary HTTP requests that could be optimized, reducing operation time and improving efficiency. Malova stresses the value of library developers understanding the inner workings of their APIs and using this knowledge to guide users effectively. She also touches on the complexity hidden within libraries, such as retry policies and connection management, and how observability can help in integration testing to identify and fix flaky tests.
π Enhancing Performance and Reliability Through Observability
This section delves into how performance testing can be revolutionized with OpenTelemetry. Traditional benchmarking is expanded upon by embracing real-world scenarios, including network issues, to test libraries more effectively. Malova illustrates how detailed telemetry can uncover performance bottlenecks, such as excessive buffer allocation and improper thread pool sizing. She shares specific examples of performance improvements and memory usage optimizations discovered through long-term monitoring and analysis of telemetry data.
π The Importance of Real-World Testing and User Feedback
In the final paragraph, Malova emphasizes the need for developers to test their libraries in real-world conditions to expose and address network issues that aren't apparent in controlled environments. She advocates for high levels of observability to debug and understand test flakiness and performance issues. The speaker also discusses the iterative process of library development, suggesting that while developers can be 'user zero' to provide deep telemetry insights, feedback from actual users is crucial for refining and correcting initial implementations. Malova concludes by highlighting the benefits of chaos engineering and long-term testing with OpenTelemetry for pinpointing issues over extended periods.
Mindmap
Keywords
π‘Observability
π‘OpenTelemetry
π‘Semantic Conventions
π‘Azure SDKs
π‘Telemetry
π‘Library Owners
π‘Integration Testing
π‘Performance Testing
π‘Trace
π‘Metrics
π‘User Feedback
Highlights
Lila Malova introduces herself as a new member of the OpenTelemetry technical committee and a maintainer of OpenTelemetry semantic conventions.
She discusses the importance of observability in Azure SDKs and the challenges faced by library owners in understanding the post-release impact of their libraries.
Lila emphasizes the need for library owners to have their own observability tools, such as GitHub issues and back-end tracker systems, to improve their libraries.
The presentation highlights the difference between user observability and library owner observability, and the lack of detailed Telemetry for library owners.
Lila shares insights on how library developers can use their position as users of their libraries to collect and analyze Telemetry data for improvements.
Development time is identified as an optimal period for observability due to the developer's intimate knowledge of the code and setup.
A complex Trace example is presented, showing multiple layers of an operation and the potential for developers to identify and optimize inefficiencies.
The talk discusses the use of logs and Traces to understand and improve library performance, including the identification of unnecessary HTTP requests.
Lila explains how observability can help in integration testing by identifying flakiness and bugs in retry policies and configurations.
Performance testing is discussed, with a focus on how OpenTelemetry can provide more in-depth insights than traditional benchmarking.
The presentation shares examples of performance issues discovered through OpenTelemetry, such as excessive buffer allocation and thread pool size misconfiguration.
Lila describes a significant memory usage issue caused by improper prefetching in messaging libraries, which was resolved through OpenTelemetry insights.
The importance of being 'user zero' for library developers is emphasized, as it allows them to understand and improve their libraries from a user's perspective.
The need for user feedback beyond 'user zero' is discussed to refine and correct initial Telemetry implementations for broader user utility.
Lila concludes by stressing the importance of embracing network issues and high observability for debugging and improving software development and testing.
The applause signifies the end of the talk, highlighting the value and impact of the insights shared on OpenTelemetry and observability in SDKs.
Transcripts
so I'm Lila malova work at Microsoft I'm
a new member of open Teter technical
committee I'm a maintainer of open
Telemetry semantic conventions so and
here today I'm going to share how we use
open
Telemetry uh in our Azure sdks to make
them
better um so when we think about
observability we tend to think about it
as something intended for users for
somebody who works on the applic ation
um or for somebody who runs it but
effectively they decide which backend to
use they decide how to configure it they
can add data they can remove data it's
their application but what about Library
owners do we have any
observability um do we know what happens
to our libraries after we release
them we don't collect Telemetry for
ourselves
I mean there are privacy concerns we
need consent with enormous volume of
data and no we don't we don't
know and do we know if it works at all
like I mean does it do the intended
thing maybe
sometimes um right so we do have some
observability but our observability is
quite different our observability tools
are get hub issues or maybe some back
tracker system we do live debugging
sessions with our users we have logs we
have ask users to for the re
robs and when the issue happens I mean
we we want impossible right we want
detailed Telemetry because we don't want
to do back and forth we want everything
we want it to be on okay we want it to
be on by default because we don't want
you to reproduce issues um right so they
should things should work right away um
so okay it's every Telemetry possible
it's always on it costs you nothing it
does not affect performance and the main
thing we want to is to access it on
behalf of you right um so I guess we're
out of luck there is no hope for us
right we we cannot get it uh well yes or
no
so one thing we can do we we are the
users of our libraries right we develop
them we test them we test them in all
different ways so we can be the users
who collect this feedback right we can
be the users who decide how to collect
Telemetry and we could be the users who
know how to analyze this data so let me
give you some examples uh so there is no
better time for observability than
development time
right I'm still in the context I still
um know what my what the code is
supposed to do right I didn't forget it
yet it I know the setup I control it I
can change things um and
so let's see well you probably don't see
it but anyway so what you're looking at
is a very complicated Trace there are
like 90
Spin and um this is a part of a complex
operation this operation downloads
multiple layers of image from the
container registry and there are a bunch
of things that are going on at the same
time there is authentication there is uh
there's multiple layers and there is
chunking and it kind of looks repetitive
I'm not sure if you see it but what I
see is a groups of spans uh some of them
return 401 um and like if I'm a
developer who works on this Library I
will I really want to see what you see
do you see R red things right yeah
awesome so errors right these are four
ones there are like four of them and
they are on every trunk I'm
downloading so if I'm a developer I I am
on this Library I like why like why do
if it wasn't part of normal
authentication
flow couldn't I reuse the token on the
second chunk it should have worked if it
worked the first time right so I can go
and optimize and then there are actually
groups of redirects and they can start
raising questions do I need to redirect
on every chunk can I optimize it maybe
yes maybe no but effectively I know um
that um there is something in the uh
Library I don't really like and somebody
can tell you okay I can use logs right
there is the same information oh sorry
there the same information as you you
saw on the trace it's just in logs and
yeah well I mean you can with this or
this you
decide okay so another example um there
is a much easier API it just downloads
something and it has two two HTTP
requests underneath first one it
downloads everything the second one has
an error it returns 416 and 10 of range
so I downloaded everything and then I
made another request to like verify okay
this is the end of stream um again as a
developer who works on this Library I'm
like why do I do this extra request can
I avoid it in this particular case it
would cut this operation it would be
twice twice of
improvement in in this particular case
the API I'm using is intended for the
cases when somebody can keep uploading
stuff so I might not know
uh when I've done the first request if
it's the end of it but as a user looking
at it I can decide oh okay why does it
happen um I can go and read
documentation and documentation will
tell me oh you probably should use
different API if if you can the easier
one um as a owner of this Library I can
go and document stuff I can say okay
this API it's it's it's specific don't
use it for simple download stuff
um okay so the point here is that even
though if you think about Library uh as
a thin wrapper in fact it does a bunch
of interesting things under the hood and
they are under the hood even for the
library developers it's part of some
core logic and you might configure your
retry policy and authentication policy
in different orders but effectively um
the things that are happen under the
hood is yeah there are rot R there is
content buffering chunking what not some
caching and uh connection
management um so it is
complicated uh and now we come to an
interesting uh problem where
observability really shines the
integration testing so we tend to think
about integration test as something
inherently flaky and like okay it failed
again let me restart the test
but
why I
hear something
talking oh oh I see sorry uh okay anyway
so we tend to think about integration
test as something that is inherently
flaky uh but why right yes network
issues happen but we should have AET Tri
policy in place did we retry like uh did
we have the proper config ation uh maybe
we had time out for 5 minutes um so they
shouldn't be flaky and that when you
have flakiness in your integration test
it's a good sign that you have a bug why
don't we uh debug them why don't we fix
them because it's hard right uh the
volume of this logs these beautiful logs
I showed a few slides uh before uh is
enormous and those were grouped by the
trace ID our logs in the CI system well
if you have them they are they could be
terrible right so the time when you do
integration testing is the best time to
use observability to debug this test and
to actually find the bugs in your Tri
policy this is the worst bugs to have
right it's very hard to detect them and
effectively uh we by adding the
Telemetry to libraries themselves we
help both we help
ourselves understand what what our
libraries do and fix issues and also
help users at the same
time okay
so the next
part is performance testing right
so how our testing looked before op
Telemetry uh well effectively it's
benchmarking right we get a little bit
more data than this but effectively we
get a number okay the throughput this
was your throughput if there was a
network issues during the test um we
would see a regression we would spend
days investigating why it happened but
effectively the test is not valid in
presence of normal
Cloud uh or real life errors right we
tend to asay this test as much as
possible what changes with Optometry
well of course we can do benchmarking
but it it's kind of boring right we can
do much more so we can Embrace these
network issues we can even simulate them
we can test our uh libraries in not in
the in the realistic scenario right how
user in the same place users use them in
in perfect world uh and in order to do
this we need to apply some real load we
want we need to inject some failures and
we need to run it for for a while and at
this point it becomes a service and the
stress test or reliability test it's
just a service that you monitor similar
similarly to anything else you enable
the same observability you would want
your users to enable you can um collect
your all the data that you want to um
and how it might look okay we have the
pretty sure you don't see it but we have
some beautiful dashboard for the test it
has all the boring stuff the latency
error rate throughput uh we have even
more boring stuff some CPU memory
metrics um and so on but we we have much
more it's just open Telemetry right you
go ahead and you um look for traces if
you have continuous profiling enabled it
becomes even better so I want to share
some example of of things we were able
to find um with this uh tests um and
they uh even though they rely on some
basic metrics the way to find them
detect them and uh solve them would not
be possible without all the richness of
different signals we get with open
Telemetry uh so the first one okay uh we
allocated buffers of excessive size we
could allocate the precise size which is
small but we said okay we will always
allocate one megabyte for this okay what
happens we have high CPU High memory low
throughput lower than we expected um we
take memory dump we see all the buffers
we fix it we get much higher throughput
um it's all possible because we run it
for a long time and compare
easily um then the other story is the
threadpool size you run your code well
well our messaging libraries allow you
to configure con
curreny um and user can come and say
okay I want 500 uh thre I want 500
messages Protestant
parallel um but what happens if you
don't configure your thread pool size
accordingly your concurrency is wasted
you don't have threads to to accommodate
this
concurrency um and you see low througho
but also low resource utilization you
underutilize your stuff you go uh in
this case check the number of threads
bomb it scals linearly
um and this one is it's my favorite of
all times um it shows some uh C this is
the fix that uh I don't know reduces
memory usage in in Thousand Times uh
hard to imagine but that's a great
argument for people who say that all the
problems come from Network um and your
code just cannot do something so stupid
well it can um
so um there are multiple there are two
bugs here uh but what happens uh our
messaging libraries allow you to peret
stuff so you process a batch of messages
and in parallel they go to the broker
and they pret a few more so then when
you uh come back and you finish
processing you get the next batch right
away you don't need to wait for it okay
so you can we configure a thousand
messages to be preted we start the test
memory grows exponentially boom out of
memory uh we look at the memory dump and
there are four millions of this messages
there okay so one bug it's on us the
second bug well it's also on us but I I
want to blame this this framework so uh
what we see here is a reactor uh it's my
favorite framework of on Earth um what
it does it Peres on behalf of You by
default so the there is this um U comma
zero thing com line
23 which disables the default
prefetching um
stuff okay so
um with this um I want to summarize so
where people who don't have
observability things they
know they think they know their code
they don't they just don't know they
don't have any evidence that they don't
right
um to actually uh improve SD Cas we need
to uh Embrace network issues right so
when we develop stuff we rarely have any
network
problems uh we don't have the scale in
production which shows them we're not
exposed to them so we need to make an
extra effort to actually run our stuff
in a real environment being exposed to
this metrix uh to this uh network issues
and we need the level observability that
helps us to debug this issues to
understand what happened where this test
flaky because where it was very
unfortunate or our retri policy doesn't
work correctly right um and when we
instrument libraries uh we end up we use
this Telemetry we end up with the same
Telemetry as our users would need
because um it's the volume is the same
we have enormous amount of tests running
we uh have all this performance and
reliability testing if this Telemetry
doesn't answer the question or if it's
too verbose for us it's most likely that
it's also to both for our users and also
doesn't answer their question
okay that's it thank you for coming to
my talk um
[Applause]
I'm uh yeah the user that you can work
closely together who can provide
detailed feedback is awesome
um but what I'm trying to say You are
the user right you can be your user zero
uh and was Library
instrumentations the library owners they
they tend to provide very deeply very
deep Telemetry focused on their specific
thing uh and they need user feedback to
actually create something that would be
useful for for end users so I I I would
say yes you should be you're user zero
but you need user one two and three to
actually correct the mistakes you've
done first
right like to
simulate oh that's uh cool so we tried
to use chos smes we got a
um no I wouldn't say it's a success uh
but it allows to create some chaos it's
hard to control it's it's it's hard to
do it in multiple directions but mostly
it's like you take
the uh something you give it very little
CPU memory quarter uh and you try to
load it as much as you can when you see
a bottleneck you try to fix it and
understand where it comes from and even
with this by just running it at maximum
capacity you're exposing it to a lot of
stuff and by running it let's say for
days uh you get just regular network
issues um what open Telemetry is helpful
is that after you run it for days you
can actually pinpoint the time and the
problem right without it it wouldn't be
possible
Browse More Related Video
Telemetry Over Events: Developer-Friendly Instrumentation at American... Ace Ellett & Kylan Johnson
OpenTelemetry for Mobile Apps: Challenges and Opportunities in Data Mob... Andrew Tunall & Hanson Ho
What Could Go Wrong with a GraphQL Query and Can OpenTelemetry Help? - Budhaditya Bhattacharya, Tyk
How OpenTelemetry Helps Generative AI - Phillip Carter, Honeycomb
O QUE SΓO FRAMEWORKS E BIBLIOTECAS? QUAIS AS DIFERENΓAS?
Fine-Tuning Auto-Instrumentation - Jamie Danielson, Honeycomb
5.0 / 5 (0 votes)