Using Native OpenTelemetry Instrumentation to Make Client Libraries Better - Liudmila Molkova

CNCF [Cloud Native Computing Foundation]
29 Jun 202418:49

Summary

TLDRLila Malova from Microsoft discusses the importance of observability in Azure SDKs for library owners, who often lack visibility into their libraries' post-release behavior. She emphasizes the need for detailed telemetry to diagnose issues efficiently. Malova illustrates how open Telemetry can be leveraged during development, integration testing, and performance testing to optimize libraries and improve user experience. She concludes by highlighting the necessity of embracing network issues and the value of user feedback for refining library instrumentation.

Takeaways

  • πŸ˜€ Lila Malova is a new member of the OpenTelemetry technical committee and a maintainer of OpenTelemetry semantic conventions.
  • πŸ” OpenTelemetry is used in Azure SDKs to improve observability, which is typically considered from the user's perspective, but library owners also need observability to understand what happens after their libraries are released.
  • πŸ€” Library owners often lack visibility into their libraries' performance and usage post-release due to privacy concerns and the absence of self-collecting telemetry.
  • πŸ› οΈ Developers can act as 'user zero' by collecting and analyzing telemetry during the development and testing of their libraries to gain insights into their performance and identify areas for improvement.
  • πŸ“ˆ Observability during development time is crucial as developers have the context and control to make meaningful changes and optimizations based on the telemetry data.
  • πŸ”„ OpenTelemetry can help identify inefficiencies in library operations, such as unnecessary HTTP requests or authentication issues, by analyzing traces and logs.
  • 🧩 Integration testing can be improved with observability, as it helps pinpoint the causes of flakiness and bugs in retry policies and configurations.
  • πŸš€ Performance testing benefits from OpenTelemetry by allowing developers to simulate realistic scenarios, including network issues, and to monitor the service under load.
  • πŸ“Š Telemetry data from performance and reliability testing can reveal insights such as excessive buffer allocation, thread pool size misconfiguration, and memory leaks.
  • πŸ”§ Observability helps in debugging and fixing issues that arise during testing, leading to better performance and reliability of the libraries.
  • πŸ“ Library owners should be their own 'user zero' to understand their libraries deeply, but also need feedback from actual users to refine the telemetry and ensure it is useful for end users.

Q & A

  • Who is Lila Malova and what is her role at Microsoft?

    -Lila Malova is a new member of the OpenTelemetry technical committee at Microsoft. She is a maintainer of OpenTelemetry semantic conventions.

  • What is the primary focus of Lila Malova's talk?

    -Lila Malova's talk focuses on how OpenTelemetry is used in Azure SDKs to improve their observability and the importance of observability for library owners.

  • What does Lila Malova suggest about the observability of libraries after they are released?

    -Lila Malova suggests that library owners often lack visibility into what happens to their libraries after release. They typically do not collect telemetry for themselves due to privacy concerns and the large volume of data involved.

  • Why is detailed telemetry important for library owners?

    -Detailed telemetry is important for library owners because it helps them understand the issues users face, avoid back-and-forth communication, and collect comprehensive data to reproduce and fix issues efficiently.

  • How can library developers use telemetry during the development phase?

    -Library developers can use telemetry during the development phase to collect feedback, analyze data, and optimize their libraries. They can be the 'users' who decide how to collect and analyze telemetry data.

  • What is an example of how telemetry can help in identifying issues during the development of a library?

    -An example given is the observation of a complex operation downloading multiple layers of an image from a container registry, where repeated 401 errors were detected. This telemetry data allowed developers to identify and optimize the authentication flow.

  • What role does observability play in integration testing?

    -In integration testing, observability helps in debugging tests and identifying bugs in retry policies and configurations. It is crucial for understanding the root cause of test flakiness.

  • How can performance testing benefit from OpenTelemetry?

    -Performance testing can benefit from OpenTelemetry by providing detailed insights into network issues, resource utilization, and system behavior under load. It allows for more realistic testing scenarios and easier identification of performance bottlenecks.

  • What are some of the performance improvements identified through telemetry in the script?

    -Some performance improvements identified include reducing buffer allocation size, optimizing thread pool size, and fixing bugs related to message prefetching, which in one case resulted in a thousandfold reduction in memory usage.

  • What is the importance of being the 'user zero' for library developers?

    -Being 'user zero' allows library developers to gain firsthand experience with their libraries, collect telemetry data, and understand user needs. However, it's also important to gather feedback from 'user one', 'user two', and beyond to correct initial mistakes and improve the library further.

  • How does OpenTelemetry help in long-term performance and reliability testing?

    -OpenTelemetry helps in long-term performance and reliability testing by providing detailed telemetry data over an extended period. This data allows developers to pinpoint issues and understand system behavior under various conditions, including regular network issues.

Outlines

00:00

πŸ”¬ Observability in Library Development

Lila Malova, a member of the OpenTelemetry technical committee at Microsoft, discusses the importance of observability not just for application users but also for library owners. She emphasizes the lack of visibility into what happens to libraries post-release and the challenges of collecting telemetry data due to privacy concerns and data volume. Malova suggests that library developers can act as users to collect and analyze telemetry, using development time as an optimal period for observability due to better control and context of the code. She illustrates this with examples of complex operations in Azure SDKs, showing how detailed telemetry can help identify and optimize issues like authentication flows and redirects.

05:00

πŸ› οΈ Leveraging Observability for API Improvements

The speaker uses the example of an API designed for downloading content to highlight how observability can lead to performance improvements. They point out unnecessary HTTP requests that could be optimized, reducing operation time and improving efficiency. Malova stresses the value of library developers understanding the inner workings of their APIs and using this knowledge to guide users effectively. She also touches on the complexity hidden within libraries, such as retry policies and connection management, and how observability can help in integration testing to identify and fix flaky tests.

10:02

πŸ“ˆ Enhancing Performance and Reliability Through Observability

This section delves into how performance testing can be revolutionized with OpenTelemetry. Traditional benchmarking is expanded upon by embracing real-world scenarios, including network issues, to test libraries more effectively. Malova illustrates how detailed telemetry can uncover performance bottlenecks, such as excessive buffer allocation and improper thread pool sizing. She shares specific examples of performance improvements and memory usage optimizations discovered through long-term monitoring and analysis of telemetry data.

15:03

🌐 The Importance of Real-World Testing and User Feedback

In the final paragraph, Malova emphasizes the need for developers to test their libraries in real-world conditions to expose and address network issues that aren't apparent in controlled environments. She advocates for high levels of observability to debug and understand test flakiness and performance issues. The speaker also discusses the iterative process of library development, suggesting that while developers can be 'user zero' to provide deep telemetry insights, feedback from actual users is crucial for refining and correcting initial implementations. Malova concludes by highlighting the benefits of chaos engineering and long-term testing with OpenTelemetry for pinpointing issues over extended periods.

Mindmap

Keywords

πŸ’‘Observability

Observability in the context of the video refers to the ability to understand the internal state of a system through its outputs, without needing to access the system's internals directly. It is a key concept in monitoring and maintaining the health of applications and libraries. In the video, the speaker discusses how observability is typically considered from the user's perspective but also emphasizes the importance of library owners having observability into their own libraries to understand their impact after release.

πŸ’‘OpenTelemetry

OpenTelemetry is an open-source observability framework for cloud-native software. It provides a set of APIs, libraries, agents, and instrumentation to help capture distributed traces and metrics from applications. The speaker mentions OpenTelemetry as a tool used in Azure SDKs to enhance their capabilities and discusses its role in improving observability for both developers and library owners.

πŸ’‘Semantic Conventions

Semantic conventions in the context of OpenTelemetry are a set of guidelines that define how to instrument code to produce consistent, meaningful, and machine-readable telemetry data. The speaker identifies themselves as a maintainer of these conventions, indicating their role in shaping the standards for how telemetry data should be structured and interpreted.

πŸ’‘Azure SDKs

Azure SDKs are a collection of libraries used to interact with Microsoft Azure services. In the video, the speaker discusses how OpenTelemetry is used within these SDKs to improve their observability, allowing for better monitoring and understanding of their performance and issues.

πŸ’‘Telemetry

Telemetry is the process of collecting and transmitting measurements and other data at a distance. In the video, telemetry is discussed as a critical component for observability, with the speaker highlighting the need for detailed telemetry data to diagnose and optimize library performance.

πŸ’‘Library Owners

Library owners are the developers responsible for creating and maintaining software libraries. The video emphasizes the unique observability challenges faced by library owners, who may not have direct insight into how their libraries perform after being released and integrated into other applications.

πŸ’‘Integration Testing

Integration testing is the phase of software testing where individual units are combined and tested as a group to determine if they work together correctly. The speaker discusses the use of observability in debugging integration tests, which are often considered flaky due to network issues or configuration problems.

πŸ’‘Performance Testing

Performance testing is the process of validating that a system performs well under a particular load. The video describes how OpenTelemetry can enrich performance testing by providing detailed insights into how libraries perform under stress and how they handle network issues.

πŸ’‘Trace

In the context of observability, a trace is a way to visualize and analyze the path that a request takes through a system, often involving multiple services or components. The speaker uses the term 'trace' to describe the detailed breakdown of operations within a library, which can help identify inefficiencies or errors.

πŸ’‘Metrics

Metrics are quantitative measurements that provide insights into the performance of a system. In the video, metrics such as latency, error rate, and throughput are mentioned as part of the observability data that can be used to monitor and improve the performance of Azure SDKs.

πŸ’‘User Feedback

User feedback is essential for understanding how software is used and for identifying areas for improvement. The speaker suggests that library owners can act as their own 'user zero' to collect and analyze telemetry data, but also emphasizes the importance of gathering feedback from actual users to refine and validate the telemetry.

Highlights

Lila Malova introduces herself as a new member of the OpenTelemetry technical committee and a maintainer of OpenTelemetry semantic conventions.

She discusses the importance of observability in Azure SDKs and the challenges faced by library owners in understanding the post-release impact of their libraries.

Lila emphasizes the need for library owners to have their own observability tools, such as GitHub issues and back-end tracker systems, to improve their libraries.

The presentation highlights the difference between user observability and library owner observability, and the lack of detailed Telemetry for library owners.

Lila shares insights on how library developers can use their position as users of their libraries to collect and analyze Telemetry data for improvements.

Development time is identified as an optimal period for observability due to the developer's intimate knowledge of the code and setup.

A complex Trace example is presented, showing multiple layers of an operation and the potential for developers to identify and optimize inefficiencies.

The talk discusses the use of logs and Traces to understand and improve library performance, including the identification of unnecessary HTTP requests.

Lila explains how observability can help in integration testing by identifying flakiness and bugs in retry policies and configurations.

Performance testing is discussed, with a focus on how OpenTelemetry can provide more in-depth insights than traditional benchmarking.

The presentation shares examples of performance issues discovered through OpenTelemetry, such as excessive buffer allocation and thread pool size misconfiguration.

Lila describes a significant memory usage issue caused by improper prefetching in messaging libraries, which was resolved through OpenTelemetry insights.

The importance of being 'user zero' for library developers is emphasized, as it allows them to understand and improve their libraries from a user's perspective.

The need for user feedback beyond 'user zero' is discussed to refine and correct initial Telemetry implementations for broader user utility.

Lila concludes by stressing the importance of embracing network issues and high observability for debugging and improving software development and testing.

The applause signifies the end of the talk, highlighting the value and impact of the insights shared on OpenTelemetry and observability in SDKs.

Transcripts

play00:00

so I'm Lila malova work at Microsoft I'm

play00:03

a new member of open Teter technical

play00:05

committee I'm a maintainer of open

play00:08

Telemetry semantic conventions so and

play00:11

here today I'm going to share how we use

play00:14

open

play00:15

Telemetry uh in our Azure sdks to make

play00:18

them

play00:19

better um so when we think about

play00:21

observability we tend to think about it

play00:24

as something intended for users for

play00:28

somebody who works on the applic ation

play00:31

um or for somebody who runs it but

play00:34

effectively they decide which backend to

play00:36

use they decide how to configure it they

play00:39

can add data they can remove data it's

play00:42

their application but what about Library

play00:45

owners do we have any

play00:47

observability um do we know what happens

play00:50

to our libraries after we release

play00:55

them we don't collect Telemetry for

play00:59

ourselves

play01:00

I mean there are privacy concerns we

play01:03

need consent with enormous volume of

play01:06

data and no we don't we don't

play01:09

know and do we know if it works at all

play01:14

like I mean does it do the intended

play01:18

thing maybe

play01:20

sometimes um right so we do have some

play01:24

observability but our observability is

play01:26

quite different our observability tools

play01:28

are get hub issues or maybe some back

play01:31

tracker system we do live debugging

play01:33

sessions with our users we have logs we

play01:36

have ask users to for the re

play01:40

robs and when the issue happens I mean

play01:43

we we want impossible right we want

play01:47

detailed Telemetry because we don't want

play01:49

to do back and forth we want everything

play01:52

we want it to be on okay we want it to

play01:56

be on by default because we don't want

play01:57

you to reproduce issues um right so they

play02:02

should things should work right away um

play02:05

so okay it's every Telemetry possible

play02:09

it's always on it costs you nothing it

play02:12

does not affect performance and the main

play02:15

thing we want to is to access it on

play02:18

behalf of you right um so I guess we're

play02:22

out of luck there is no hope for us

play02:25

right we we cannot get it uh well yes or

play02:29

no

play02:30

so one thing we can do we we are the

play02:33

users of our libraries right we develop

play02:36

them we test them we test them in all

play02:39

different ways so we can be the users

play02:42

who collect this feedback right we can

play02:45

be the users who decide how to collect

play02:47

Telemetry and we could be the users who

play02:49

know how to analyze this data so let me

play02:53

give you some examples uh so there is no

play02:56

better time for observability than

play02:58

development time

play03:00

right I'm still in the context I still

play03:03

um know what my what the code is

play03:05

supposed to do right I didn't forget it

play03:07

yet it I know the setup I control it I

play03:11

can change things um and

play03:15

so let's see well you probably don't see

play03:19

it but anyway so what you're looking at

play03:22

is a very complicated Trace there are

play03:24

like 90

play03:27

Spin and um this is a part of a complex

play03:31

operation this operation downloads

play03:34

multiple layers of image from the

play03:36

container registry and there are a bunch

play03:39

of things that are going on at the same

play03:40

time there is authentication there is uh

play03:44

there's multiple layers and there is

play03:46

chunking and it kind of looks repetitive

play03:49

I'm not sure if you see it but what I

play03:51

see is a groups of spans uh some of them

play03:55

return 401 um and like if I'm a

play03:58

developer who works on this Library I

play04:01

will I really want to see what you see

play04:04

do you see R red things right yeah

play04:08

awesome so errors right these are four

play04:10

ones there are like four of them and

play04:12

they are on every trunk I'm

play04:14

downloading so if I'm a developer I I am

play04:18

on this Library I like why like why do

play04:22

if it wasn't part of normal

play04:24

authentication

play04:25

flow couldn't I reuse the token on the

play04:29

second chunk it should have worked if it

play04:31

worked the first time right so I can go

play04:33

and optimize and then there are actually

play04:35

groups of redirects and they can start

play04:37

raising questions do I need to redirect

play04:39

on every chunk can I optimize it maybe

play04:42

yes maybe no but effectively I know um

play04:46

that um there is something in the uh

play04:49

Library I don't really like and somebody

play04:52

can tell you okay I can use logs right

play04:55

there is the same information oh sorry

play04:58

there the same information as you you

play05:00

saw on the trace it's just in logs and

play05:02

yeah well I mean you can with this or

play05:07

this you

play05:09

decide okay so another example um there

play05:14

is a much easier API it just downloads

play05:16

something and it has two two HTTP

play05:20

requests underneath first one it

play05:24

downloads everything the second one has

play05:26

an error it returns 416 and 10 of range

play05:31

so I downloaded everything and then I

play05:34

made another request to like verify okay

play05:36

this is the end of stream um again as a

play05:39

developer who works on this Library I'm

play05:41

like why do I do this extra request can

play05:43

I avoid it in this particular case it

play05:46

would cut this operation it would be

play05:49

twice twice of

play05:51

improvement in in this particular case

play05:53

the API I'm using is intended for the

play05:55

cases when somebody can keep uploading

play05:57

stuff so I might not know

play06:00

uh when I've done the first request if

play06:02

it's the end of it but as a user looking

play06:05

at it I can decide oh okay why does it

play06:07

happen um I can go and read

play06:11

documentation and documentation will

play06:13

tell me oh you probably should use

play06:15

different API if if you can the easier

play06:17

one um as a owner of this Library I can

play06:21

go and document stuff I can say okay

play06:23

this API it's it's it's specific don't

play06:26

use it for simple download stuff

play06:30

um okay so the point here is that even

play06:35

though if you think about Library uh as

play06:38

a thin wrapper in fact it does a bunch

play06:41

of interesting things under the hood and

play06:44

they are under the hood even for the

play06:46

library developers it's part of some

play06:47

core logic and you might configure your

play06:51

retry policy and authentication policy

play06:54

in different orders but effectively um

play06:58

the things that are happen under the

play06:59

hood is yeah there are rot R there is

play07:02

content buffering chunking what not some

play07:05

caching and uh connection

play07:08

management um so it is

play07:11

complicated uh and now we come to an

play07:14

interesting uh problem where

play07:17

observability really shines the

play07:18

integration testing so we tend to think

play07:21

about integration test as something

play07:23

inherently flaky and like okay it failed

play07:26

again let me restart the test

play07:30

but

play07:32

why I

play07:34

hear something

play07:38

talking oh oh I see sorry uh okay anyway

play07:43

so we tend to think about integration

play07:45

test as something that is inherently

play07:47

flaky uh but why right yes network

play07:51

issues happen but we should have AET Tri

play07:53

policy in place did we retry like uh did

play07:58

we have the proper config ation uh maybe

play08:01

we had time out for 5 minutes um so they

play08:05

shouldn't be flaky and that when you

play08:08

have flakiness in your integration test

play08:10

it's a good sign that you have a bug why

play08:12

don't we uh debug them why don't we fix

play08:16

them because it's hard right uh the

play08:18

volume of this logs these beautiful logs

play08:21

I showed a few slides uh before uh is

play08:24

enormous and those were grouped by the

play08:26

trace ID our logs in the CI system well

play08:29

if you have them they are they could be

play08:32

terrible right so the time when you do

play08:35

integration testing is the best time to

play08:37

use observability to debug this test and

play08:39

to actually find the bugs in your Tri

play08:42

policy this is the worst bugs to have

play08:45

right it's very hard to detect them and

play08:48

effectively uh we by adding the

play08:52

Telemetry to libraries themselves we

play08:54

help both we help

play08:57

ourselves understand what what our

play09:00

libraries do and fix issues and also

play09:03

help users at the same

play09:05

time okay

play09:09

so the next

play09:12

part is performance testing right

play09:17

so how our testing looked before op

play09:22

Telemetry uh well effectively it's

play09:24

benchmarking right we get a little bit

play09:26

more data than this but effectively we

play09:28

get a number okay the throughput this

play09:30

was your throughput if there was a

play09:33

network issues during the test um we

play09:36

would see a regression we would spend

play09:38

days investigating why it happened but

play09:41

effectively the test is not valid in

play09:43

presence of normal

play09:46

Cloud uh or real life errors right we

play09:50

tend to asay this test as much as

play09:53

possible what changes with Optometry

play09:55

well of course we can do benchmarking

play09:57

but it it's kind of boring right we can

play09:59

do much more so we can Embrace these

play10:02

network issues we can even simulate them

play10:05

we can test our uh libraries in not in

play10:09

the in the realistic scenario right how

play10:12

user in the same place users use them in

play10:15

in perfect world uh and in order to do

play10:18

this we need to apply some real load we

play10:20

want we need to inject some failures and

play10:23

we need to run it for for a while and at

play10:25

this point it becomes a service and the

play10:29

stress test or reliability test it's

play10:32

just a service that you monitor similar

play10:35

similarly to anything else you enable

play10:37

the same observability you would want

play10:40

your users to enable you can um collect

play10:43

your all the data that you want to um

play10:47

and how it might look okay we have the

play10:49

pretty sure you don't see it but we have

play10:52

some beautiful dashboard for the test it

play10:54

has all the boring stuff the latency

play10:57

error rate throughput uh we have even

play11:00

more boring stuff some CPU memory

play11:03

metrics um and so on but we we have much

play11:07

more it's just open Telemetry right you

play11:08

go ahead and you um look for traces if

play11:12

you have continuous profiling enabled it

play11:14

becomes even better so I want to share

play11:17

some example of of things we were able

play11:20

to find um with this uh tests um and

play11:26

they uh even though they rely on some

play11:29

basic metrics the way to find them

play11:32

detect them and uh solve them would not

play11:35

be possible without all the richness of

play11:38

different signals we get with open

play11:40

Telemetry uh so the first one okay uh we

play11:45

allocated buffers of excessive size we

play11:48

could allocate the precise size which is

play11:50

small but we said okay we will always

play11:53

allocate one megabyte for this okay what

play11:56

happens we have high CPU High memory low

play11:59

throughput lower than we expected um we

play12:04

take memory dump we see all the buffers

play12:07

we fix it we get much higher throughput

play12:10

um it's all possible because we run it

play12:13

for a long time and compare

play12:16

easily um then the other story is the

play12:22

threadpool size you run your code well

play12:26

well our messaging libraries allow you

play12:28

to configure con

play12:29

curreny um and user can come and say

play12:32

okay I want 500 uh thre I want 500

play12:36

messages Protestant

play12:37

parallel um but what happens if you

play12:40

don't configure your thread pool size

play12:42

accordingly your concurrency is wasted

play12:44

you don't have threads to to accommodate

play12:46

this

play12:47

concurrency um and you see low througho

play12:52

but also low resource utilization you

play12:55

underutilize your stuff you go uh in

play12:58

this case check the number of threads

play13:00

bomb it scals linearly

play13:03

um and this one is it's my favorite of

play13:06

all times um it shows some uh C this is

play13:10

the fix that uh I don't know reduces

play13:14

memory usage in in Thousand Times uh

play13:17

hard to imagine but that's a great

play13:19

argument for people who say that all the

play13:21

problems come from Network um and your

play13:24

code just cannot do something so stupid

play13:27

well it can um

play13:30

so um there are multiple there are two

play13:34

bugs here uh but what happens uh our

play13:37

messaging libraries allow you to peret

play13:39

stuff so you process a batch of messages

play13:41

and in parallel they go to the broker

play13:44

and they pret a few more so then when

play13:47

you uh come back and you finish

play13:49

processing you get the next batch right

play13:51

away you don't need to wait for it okay

play13:54

so you can we configure a thousand

play13:56

messages to be preted we start the test

play13:59

memory grows exponentially boom out of

play14:03

memory uh we look at the memory dump and

play14:06

there are four millions of this messages

play14:10

there okay so one bug it's on us the

play14:13

second bug well it's also on us but I I

play14:15

want to blame this this framework so uh

play14:18

what we see here is a reactor uh it's my

play14:22

favorite framework of on Earth um what

play14:25

it does it Peres on behalf of You by

play14:27

default so the there is this um U comma

play14:32

zero thing com line

play14:35

23 which disables the default

play14:38

prefetching um

play14:42

stuff okay so

play14:45

um with this um I want to summarize so

play14:50

where people who don't have

play14:52

observability things they

play14:55

know they think they know their code

play14:58

they don't they just don't know they

play15:00

don't have any evidence that they don't

play15:03

right

play15:04

um to actually uh improve SD Cas we need

play15:10

to uh Embrace network issues right so

play15:14

when we develop stuff we rarely have any

play15:18

network

play15:19

problems uh we don't have the scale in

play15:22

production which shows them we're not

play15:24

exposed to them so we need to make an

play15:26

extra effort to actually run our stuff

play15:30

in a real environment being exposed to

play15:33

this metrix uh to this uh network issues

play15:37

and we need the level observability that

play15:40

helps us to debug this issues to

play15:43

understand what happened where this test

play15:45

flaky because where it was very

play15:46

unfortunate or our retri policy doesn't

play15:50

work correctly right um and when we

play15:55

instrument libraries uh we end up we use

play15:59

this Telemetry we end up with the same

play16:01

Telemetry as our users would need

play16:04

because um it's the volume is the same

play16:08

we have enormous amount of tests running

play16:11

we uh have all this performance and

play16:15

reliability testing if this Telemetry

play16:18

doesn't answer the question or if it's

play16:20

too verbose for us it's most likely that

play16:24

it's also to both for our users and also

play16:27

doesn't answer their question

play16:30

okay that's it thank you for coming to

play16:32

my talk um

play16:35

[Applause]

play16:43

I'm uh yeah the user that you can work

play16:48

closely together who can provide

play16:50

detailed feedback is awesome

play16:55

um but what I'm trying to say You are

play16:58

the user right you can be your user zero

play17:03

uh and was Library

play17:06

instrumentations the library owners they

play17:09

they tend to provide very deeply very

play17:13

deep Telemetry focused on their specific

play17:16

thing uh and they need user feedback to

play17:21

actually create something that would be

play17:23

useful for for end users so I I I would

play17:27

say yes you should be you're user zero

play17:30

but you need user one two and three to

play17:32

actually correct the mistakes you've

play17:34

done first

play17:39

right like to

play17:41

simulate oh that's uh cool so we tried

play17:45

to use chos smes we got a

play17:49

um no I wouldn't say it's a success uh

play17:53

but it allows to create some chaos it's

play17:56

hard to control it's it's it's hard to

play17:59

do it in multiple directions but mostly

play18:03

it's like you take

play18:05

the uh something you give it very little

play18:09

CPU memory quarter uh and you try to

play18:14

load it as much as you can when you see

play18:17

a bottleneck you try to fix it and

play18:20

understand where it comes from and even

play18:22

with this by just running it at maximum

play18:25

capacity you're exposing it to a lot of

play18:27

stuff and by running it let's say for

play18:30

days uh you get just regular network

play18:36

issues um what open Telemetry is helpful

play18:40

is that after you run it for days you

play18:42

can actually pinpoint the time and the

play18:44

problem right without it it wouldn't be

play18:47

possible

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Open TelemetryAzure SDKsObservabilityPerformanceIntegration TestingDeveloper InsightsLibrary MaintenanceTechnical CommitteeFeedback LoopDebugging Tools