GopherCon 2020: Ted Young - The Fundamentals of OpenTelemetry

Gopher Academy
22 Dec 202028:38

Summary

TLDRIn this informative talk, Ted Young introduces Open Telemetry, an observability platform for monitoring distributed systems. He explains the concept of telemetry, Open Telemetry's extensible components, and its role in emitting signals like distributed tracing and metrics. Young demonstrates how to instrument code with Open Telemetry, emphasizing context propagation for efficient tracing. He also discusses the importance of standardization in the telemetry ecosystem and provides practical examples and resources for getting started with Open Telemetry in various programming languages.

Takeaways

  • πŸ“š Open Telemetry is defined as an observability platform with extensible components for monitoring distributed systems.
  • 🌐 It unifies signals like distributed tracing, metrics, and system resources, providing the context needed to correlate them.
  • πŸ› οΈ Open Telemetry includes a data processing facility for data format transformation, manipulation, and distribution to multiple consumers.
  • πŸ”Œ The Open Telemetry SDK is installed in every service of a deployment, implementing the Open Telemetry API for instrumentation.
  • πŸ”— Open Telemetry Collector is a data pipelining service that can translate between various formats like OTLP, Zipkin, Jaeger, and Prometheus.
  • πŸ† Open Telemetry focuses on standardization for describing distributed systems in cloud environments, rather than standardizing data analysis tools.
  • πŸ“ It is designed via specification, a language-neutral document that allows building consistent implementations across different software ecosystems.
  • πŸ”§ Open Telemetry can be easily installed and configured with minimal code or command line arguments, supporting languages like Java, JavaScript, Python, and Go.
  • πŸ”‘ Context propagation is central to Open Telemetry's architecture, allowing the flow of execution and metadata across services in a transaction.
  • πŸ“ˆ The use of semantic conventions in Open Telemetry helps standardize the description of system components for better data analysis and understanding.
  • πŸ›‘ Baggage headers in Open Telemetry allow for the propagation of arbitrary key-value pairs, useful for passing correlations without additional system load.

Q & A

  • What is the definition of telemetry according to the Cambridge Dictionary?

    -Telemetry is defined as the science or process of collecting information about objects that are far away and sending the information somewhere electronically.

  • What is Open Telemetry and what does it aim to achieve?

    -Open Telemetry is an observability platform consisting of extensible components that can be used together or apart. It aims to standardize the language for describing how distributed computers operate in a cloud environment, allowing for better observability and analysis tools without the need to reinvent the telemetry ecosystem.

  • What are the three main types of signals emitted by Open Telemetry?

    -Open Telemetry emits distributed tracing, metrics, and system resources as its main types of signals.

  • What is the role of the Open Telemetry SDK in a service?

    -The Open Telemetry SDK, referred to as the client, implements the Open Telemetry API. It allows applications, frameworks, and libraries to use this instrumentation API to describe the work they are doing and then sends the data to a data pipelining service called the collector.

  • What is the purpose of the Collector in Open Telemetry?

    -The Collector in Open Telemetry is a data pipelining service that receives data from the SDK and can translate between various data formats, including OTLP, Zipkin, Jaeger, and Prometheus.

  • Why does Open Telemetry not provide its own backend or analysis tool?

    -Open Telemetry does not provide its own backend or analysis tool because its primary focus is on standardization efforts for describing distributed systems in cloud environments, rather than standardizing data analysis methods.

  • How does Open Telemetry ensure consistency and interoperability across different implementations?

    -Open Telemetry is designed via a specification, which is a language-neutral document that describes everything needed to build an implementation of Open Telemetry, ensuring consistency and interoperability.

  • What is context propagation in Open Telemetry and why is it important?

    -Context propagation is the core concept behind Open Telemetry's architecture. It involves sending the contents of the context object as metadata on network requests, allowing the flow of execution and key-value pairs to be tracked across services, which is essential for distributed tracing.

  • What are the primary HTTP headers used for trace context in Open Telemetry?

    -The primary HTTP headers used for trace context in Open Telemetry are 'traceparent' and 'tracestate', which contain information about the trace and span IDs, as well as any additional implementation-specific details.

  • How can baggage headers be used in Open Telemetry to improve observability?

    -Baggage headers in Open Telemetry allow users to pass arbitrary key-value pairs that can be used for correlation purposes. They can be propagated along with the context and used to index spans with additional metadata, such as project IDs, to identify usage patterns or troubleshoot issues.

  • What are semantic conventions in Open Telemetry and why are they important?

    -Semantic conventions in Open Telemetry are standard resources and trace attributes used to describe a system. They are important because they help analysis tools understand the information by providing a standardized way of reporting data, such as hostname, operating system, and other system characteristics.

Outlines

00:00

🌐 Introduction to Open Telemetry

Ted Young introduces the concept of Open Telemetry, an observability platform designed to collect and process signals from distributed systems. He explains that Open Telemetry is not just about data generation but also provides a data processing facility. The platform is extensible and can be used with various data formats and protocols, including its own OTLP. The goal of Open Telemetry is to standardize the language for describing cloud-based distributed systems, allowing for the easy development of new analysis tools. The script also covers the installation of the Open Telemetry SDK and the use of the Collector for data pipelining, emphasizing the ease of setup and the support for multiple programming languages.

05:01

πŸ“Š Observing and Analyzing System Transactions

This paragraph delves into the observability of system transactions, using a classic LAMP stack example. It discusses the importance of understanding latency, errors, and the sequence of operations within a transaction. The call graph is introduced as a way to visualize the time spent in each operation and the network calls between services. The paragraph highlights the need for context propagation in distributed tracing to correlate logs and identify patterns in latency and error rates. It also touches on the challenges of scaling and the importance of indexing logs with a single transaction ID for efficient debugging.

10:02

πŸ”„ Context Propagation in Open Telemetry

The core concept of context propagation in Open Telemetry is explained, which is essential for tracing and understanding the flow of transactions across services. The paragraph details how context is propagated through network requests by sending metadata in HTTP headers, following agreed-upon standards like the W3C's trace context. It also introduces baggage headers, which allow for the transmission of arbitrary key-value pairs for user-defined correlations. The paragraph concludes with a practical example of setting up an Open Telemetry HTTP server, including the configuration of service names, access tokens, propagators, and resources.

15:03

πŸ› οΈ Implementing Open Telemetry in Go

The paragraph demonstrates the implementation of Open Telemetry in a Go programming environment. It covers the setup of an HTTP server with Open Telemetry's instrumentation, including the creation of a tracer, the addition of a simple handler, and the use of semantic conventions to standardize service indexing. The script also shows how to add custom attributes to spans and how to create child spans and events for more detailed tracing. The importance of ending spans to avoid leaks and recording errors and events is emphasized, along with the use of debug logs for troubleshooting.

20:05

πŸ”— Context Propagation and Baggage in Client Requests

This section illustrates the use of Open Telemetry in an HTTP client, showing how context propagation works across network requests. It explains the process of wrapping the HTTP client's transport with Open Telemetry's instrumentation to enable tracing. The paragraph also introduces the concept of baggage, which allows for the propagation of additional data, such as a project ID, from the client to the server to avoid unnecessary database calls. The script includes code examples for creating an HTTP client, making requests, and extracting baggage values to enhance trace information.

25:06

πŸ“ˆ Advanced Tracing Techniques and Rollout Strategy

The final paragraph discusses advanced tracing techniques, such as creating a master span to group multiple HTTP requests into a single trace. It also addresses the practical considerations for rolling out Open Telemetry within an organization, emphasizing the importance of choosing the right languages and gaining internal support. The paragraph suggests starting with a high-value transaction to demonstrate the benefits of Open Telemetry and then expanding to other areas. It concludes with resources for further learning, including a documentation site and community engagement through GitHub and social media.

Mindmap

Keywords

πŸ’‘Open Telemetry

Open Telemetry is an observability platform for cloud-native software, consisting of a set of APIs, libraries, agents, and instrumentation that can be used to collect distributed traces, metrics, and logs from applications. It is central to the video's theme as it is the main subject being discussed and demonstrated. The script mentions that Open Telemetry provides a standardized way to describe how distributed systems operate in a cloud environment.

πŸ’‘Telemetry

Telemetry, as defined by the Cambridge Dictionary, is the science or process of collecting information about objects that are far away and sending the information somewhere electronically. In the context of the video, telemetry is foundational to Open Telemetry, which extends this concept to include the collection and correlation of signals across distributed systems.

πŸ’‘Distributed Tracing

Distributed tracing is a method for tracking the flow of requests through a distributed system. It is a key concept in the video, as Open Telemetry is used to create a detailed view of transactions in a distributed environment. The script uses the example of a mobile client uploading a photo to illustrate how distributed tracing can provide insights into the latency and errors in a system.

πŸ’‘Metrics

Metrics in the context of Open Telemetry refer to the numerical measurements that are collected to provide insights into the behavior of a system or its subcomponents. The script mentions metrics as one of the signals emitted by Open Telemetry, which can be used to monitor system performance and health.

πŸ’‘Observability

Observability in the video refers to the ability to understand the internal state of a system by observing its outputs or emitted signals. Open Telemetry aims to advance the field of observability by providing a robust telemetry pipeline for distributed systems, allowing new analysis tools to be built quickly and easily.

πŸ’‘Context Propagation

Context propagation is the mechanism by which the state of a transaction is passed along as it moves through various services in a distributed system. It is a core concept in the video, as it allows for the correlation of data across different components of a system. The script explains how context propagation works with HTTP headers to maintain the flow of execution.

πŸ’‘Trace Context

Trace context is a set of HTTP headers used to carry trace identifiers across services. It is a specific application of context propagation mentioned in the script, where 'trace parent' and 'trace state' headers are used to propagate trace and span IDs, enabling the correlation of operations across a distributed system.

πŸ’‘Baggage

Baggage in Open Telemetry refers to arbitrary key-value pairs that can be propagated alongside trace context, allowing for additional data to be passed through a system. The script provides an example of using baggage to pass a project ID from a client to a server, which can be useful for indexing and analyzing data without additional database calls.

πŸ’‘Instrumentation

Instrumentation in the video is the process of adding code to an application to enable the collection of telemetry data. It is a key part of setting up Open Telemetry, as the script describes how to instrument an HTTP server and client to generate and propagate telemetry data.

πŸ’‘Semantic Conventions

Semantic conventions in Open Telemetry are standardized resources and trace attributes that describe a system. The script emphasizes the importance of these conventions for ensuring that analysis tools can understand the collected data, providing examples such as 'hostname' as a standardized attribute.

πŸ’‘Collector

The Collector in Open Telemetry is a data pipelining service that receives telemetry data from the SDK and can translate between various data formats. While not the main focus of the script, the Collector is mentioned as part of the overall architecture of Open Telemetry, highlighting its role in the data pipeline.

Highlights

Open Telemetry is an observability platform for collecting signals from distributed systems.

It provides context to correlate across distributed tracing, metrics, and system resources.

Open Telemetry includes a data processing facility for data format changes and manipulation.

The Open Telemetry SDK is used for instrumenting applications and frameworks.

OTLP is Open Telemetry's own data protocol, but it can also translate between other formats like Zipkin and Prometheus.

Open Telemetry does not provide its own backend or analysis tool, focusing on standardization for cloud environments.

The project is designed via specification to ensure consistency and interoperability across implementations.

Open Telemetry can be easily installed and configured with minimal code or command line arguments.

Languages recommended for production-ready beta include Java, JavaScript, Python, and Go.

Observability concepts include understanding transactions in a distributed system with examples like a mobile client uploading a photo.

Call graphs represent operations and network calls to provide insights into transaction latencies.

Errors in transactions can be identified and debugged using Open Telemetry's tracing system.

Context propagation is central to Open Telemetry's architecture, allowing for powerful indexing of transactions.

Tracing headers like trace context and baggage headers facilitate context propagation across services.

Open Telemetry's API allows for the creation of spans, setting attributes, recording errors, and adding events.

Demonstration of setting up an Open Telemetry HTTP server and client with context propagation.

Using baggage to propagate additional data like project IDs can enhance tracing without additional server-side calls.

Traces provide a more efficient way to investigate transactions compared to traditional logging.

Strategies for rolling out Open Telemetry in an organization, including choosing production-ready languages and getting organizational buy-in.

Opentelemetry.lightstep.com offers resources, guides, and documentation for getting started with Open Telemetry.

Transcripts

play00:06

oh hey

play00:07

my name's ted young and my pandemic

play00:09

haircut is a hat

play00:12

fundamentals of open telemetry

play00:16

but what even is open telemetry come to

play00:18

think of it

play00:19

what's telemetry the cambridge

play00:22

dictionary defines telemetry as

play00:24

the science or process of collecting

play00:26

information about objects that are far

play00:28

away

play00:29

and sending the information somewhere

play00:31

electronically open telemetry is an

play00:33

observability platform

play00:34

a set of extensible components that can

play00:36

be used together or a

play00:38

cart open telemetry emits a variety of

play00:40

signals

play00:41

distributed tracing metrics and system

play00:43

resources being the most important

play00:46

rather than keep these signals separate

play00:48

open telemetry braids them together and

play00:50

provides the context you need

play00:51

to correlate across them in your back

play00:53

end in addition to data generation open

play00:56

telemetry provides a data processing

play00:58

facility

play00:59

this allows you to change data formats

play01:02

manipulate your data

play01:03

scrub it tee it off to multiple

play01:05

consumers

play01:06

everything you would need in a modern

play01:08

robust telemetry pipeline for your

play01:10

distributed system

play01:17

in every service in your deployment

play01:19

install the open telemetry client

play01:21

we refer to the client as the open

play01:23

telemetry sdk

play01:25

the sdk in turn implements the open

play01:28

telemetry api

play01:30

your applications frameworks and

play01:31

libraries use this instrumentation api

play01:34

to describe the work that they are doing

play01:36

the sdk then uses an

play01:37

exporter plugin to send the data to a

play01:40

data pipelining service called the

play01:42

collector

play01:43

open telemetry comes with its own data

play01:45

protocol called

play01:46

otlp but the collector can translate

play01:49

between a variety of formats

play01:51

including zipkin jaeger and prometheus

play01:54

notably open telemetry does not provide

play01:57

its own back end or analysis tool

play01:59

this is because at the heart of open

play02:01

telemetry is a standardization effort

play02:04

the goal is to come up with a universal

play02:06

language for describing how

play02:08

distributed computers operate in a cloud

play02:11

environment

play02:12

the goal is not to standardize how we

play02:15

analyze this data

play02:17

instead open telemetry hopes to push the

play02:20

field of observability

play02:21

forwards by allowing new analysis tools

play02:24

to be built quickly and easily

play02:26

without the need to reinvent this entire

play02:29

telemetry ecosystem

play02:31

speaking of software ecosystems how does

play02:34

open telemetry keep track of all of this

play02:36

code

play02:37

to ensure that different implementations

play02:39

remain consistent with each other

play02:41

and continue to interoperate open

play02:43

telemetry is designed via specification

play02:46

this specification is a language neutral

play02:48

document which describes

play02:50

everything you would need to build your

play02:51

own implementation of open telemetry

play02:55

before we dive into the details i do

play02:57

want to point out that it is easy to

play02:59

install

play03:00

open telemetry open telemetry can be

play03:02

packaged up into distros that make

play03:05

the configuration and installation only

play03:07

a few lines of code or in some cases

play03:09

just a command line argument

play03:11

at the time of this recording i

play03:12

recommend four languages for production

play03:14

ready beta

play03:15

java javascript python and of course go

play03:19

and i've written some easy quick start

play03:20

guides over at otel.lightstep.com

play03:24

you can always check there to get the

play03:25

latest information about production

play03:26

ready open telemetry

play03:29

okay so in this next section we're going

play03:31

to do a quick

play03:32

overview of the basic concepts behind

play03:35

open telemetry

play03:36

we're going to start with what it is

play03:38

that we're actually trying to observe

play03:40

fundamental concepts to how open

play03:42

telemetry approaches observability

play03:45

and how to set up and deploy open

play03:47

telemetry

play03:48

in your production environment

play03:52

okay so let's look at an example

play03:55

application to get an understanding of

play03:57

the kind of transactions we're talking

play03:58

about here

play03:59

let's say you have a mobile client that

play04:02

wants to let you

play04:03

upload a photo with a caption so that

play04:06

client is going to connect to a server

play04:09

but of course it's not going to be one

play04:10

server it's going to be a bunch of

play04:12

servers

play04:13

let's say the first thing that it hits

play04:15

is a reverse proxy

play04:17

and then that reverse proxy wants to

play04:19

authenticate you so it calls out to an

play04:21

authentication service

play04:22

once that comes back as aok it then

play04:25

uploads your image to a scratch disk

play04:29

once the image is successfully uploaded

play04:31

it calls out

play04:32

to an application with the location of

play04:34

that image

play04:35

that application then uploads the image

play04:38

to

play04:38

cloud storage and then stores the

play04:41

location of

play04:42

that image and the caption in a sql

play04:45

database via a data service

play04:47

that then you know holds the cache to

play04:50

that information

play04:51

in redis why is this application built

play04:54

like this

play04:55

who knows someone said build it this way

play04:57

and someone else said okay

play04:59

but seriously i feel like i've been

play05:01

looking at applications

play05:02

basically like this one for about 20

play05:04

years

play05:05

uh this is your classic lamp stack sort

play05:08

of setup

play05:10

and honestly there were just as annoying

play05:13

to observe back then

play05:14

as they are today so this view here

play05:17

represents the transaction

play05:18

as a service diagram while this is a

play05:21

useful display for getting an overall

play05:23

sense of which

play05:24

services were involved in a transaction

play05:26

it doesn't tell you a lot of details

play05:28

such as how long the transaction took

play05:31

where all that time went

play05:32

and what order were operations called in

play05:35

to get a deeper understanding of this

play05:37

transaction

play05:38

let's represent it as a call graph

play05:41

in this diagram each operation is

play05:44

represented

play05:45

by a line and the length of the line

play05:47

represents the amount of time

play05:48

spent in that operation the operations

play05:51

are then connected together via network

play05:53

calls represented by the arrows

play05:55

so here we can see the client span

play05:57

talking to the reverse proxy talking to

play05:59

the auth server

play06:00

so on and so forth as an operator

play06:03

there's a number of things we care about

play06:06

when we look at a diagram like this

play06:08

first and foremost we care about latency

play06:11

we want to understand where the time was

play06:13

spent in the system

play06:15

why is it slow can be such a cranky

play06:17

question to answer

play06:19

for example the client span in this case

play06:22

took the most amount of time

play06:24

but that doesn't tell you very much

play06:26

instead you want to find out

play06:28

where the system was doing work and

play06:31

where the system was

play06:32

waiting on work to be done the

play06:34

combination there will tell you

play06:37

where you need to focus your efforts if

play06:38

you're actually going to improve the

play06:40

latency in this system

play06:42

in this case the most amount of time was

play06:44

spent uploading

play06:46

the file to scratch disk and then

play06:48

uploading it again to cloud storage

play06:51

so if you tried to optimize anything

play06:53

else you really wouldn't be moving the

play06:55

needle very far

play06:56

because the overall amount of time you

play06:58

would be affecting would be minimal

play07:02

next of course we care about errors when

play07:04

a transaction is failing we want to be

play07:07

able to quickly identify

play07:08

which service actually had the error and

play07:11

which services were simply propagating

play07:14

an error downstream

play07:15

or responding to that error once you've

play07:18

identified an error

play07:20

you're going to want to debug it and in

play07:22

order to do that you're going to want

play07:24

more fine-grained detail

play07:27

in open telemetry's tracing system we

play07:29

call these events

play07:30

but you can think of them as logs

play07:32

because that's basically what they are

play07:34

however there is one major difference

play07:36

between tracing and logging

play07:38

with tracing every operation can be

play07:41

associated with a set of key value pairs

play07:44

that allow you to identify patterns in

play07:47

your latency and error rates

play07:49

being able to quickly or automatically

play07:51

identify that

play07:52

certain errors are associated with

play07:54

certain routes or regions or hosts

play07:57

certain latency patterns are associated

play07:59

with certain clients or certain project

play08:02

ids

play08:03

is open telemetry's killer feature

play08:06

while the transaction we've been looking

play08:08

at is simple enough

play08:10

the real issue here is scale as your

play08:12

system grows and grows

play08:14

the percentage of logs associated with

play08:17

any particular transaction

play08:18

shrinks and shrinks and it becomes more

play08:21

and more difficult to

play08:23

find the logs that are associated with a

play08:25

particular transaction

play08:27

for example we know that the reverse

play08:30

proxy talks to an application

play08:32

server but what if there are 50

play08:34

application servers

play08:35

how do you know where to look to find

play08:37

those logs

play08:38

obviously you're going to need to index

play08:40

these logs but index them with one

play08:43

the ideal index would be a single

play08:45

transaction id

play08:47

that was stapled to every log in the

play08:49

transaction

play08:50

so that if you found one log you would

play08:52

quickly be able to search for all the

play08:54

other logs associated with it

play08:57

and that in a nutshell is distributed

play09:00

tracing

play09:02

so how does open telemetry provide all

play09:04

of this awesome indexing

play09:07

the answer is context propagation

play09:10

context propagation is the

play09:11

core concept behind open telemetry's

play09:14

architecture

play09:15

if you can understand context

play09:17

propagation then everything else about

play09:19

open telemetry will fall into place

play09:22

so imagine we have two servers and

play09:26

they're connected together

play09:27

by a network request all of open

play09:30

telemetry's

play09:31

indices are stored in an object called

play09:33

the context

play09:34

this context object follows the flow of

play09:37

execution

play09:38

throughout your program when your

play09:41

transaction

play09:42

moves from one service to the next via a

play09:44

network call

play09:45

all of these key value pairs must come

play09:47

along for the ride

play09:50

sending along the contents of the

play09:51

context object as metadata on the

play09:54

request

play09:55

is called propagation when using http

play09:59

the contents of the context object on

play10:02

the client side

play10:03

are injected into the http request

play10:06

as http headers then

play10:09

on the server side the same values are

play10:12

extracted from the headers

play10:13

and deserialized into a new context

play10:16

object

play10:16

which continues to follow the flow of

play10:18

execution

play10:20

now obviously this only works if both

play10:23

the client

play10:23

and the server agree on which http

play10:26

headers are going to be used

play10:28

to make this more effective we're

play10:30

working through the w3c

play10:33

to add a set of tracing headers to the

play10:35

official http spec

play10:37

there are a number of tracing headers

play10:39

out there in the wild but let's have a

play10:41

look at these

play10:42

since they're going to be the standard

play10:43

going forwards

play10:45

the primary tracing headers are called

play10:47

trace context

play10:48

trace context consists of two headers

play10:51

trace parent and trace state

play10:53

trace parent contains two ids

play10:56

one represents the overall transaction

play10:59

that's called the trace id

play11:01

the other id represents the parent

play11:03

operation

play11:04

which is called the span id trace parent

play11:07

also includes a sampling flag

play11:09

to let you know whether tracing is

play11:11

enabled or not

play11:13

the trace state header contains any

play11:15

additional implementation specific

play11:17

details

play11:18

that a particular tracing system might

play11:20

need to propagate

play11:23

in addition to trace context there are

play11:25

also baggage headers baggage headers

play11:28

are literally arbitrary key value pairs

play11:31

that you can use as an end user to pass

play11:34

your own correlations down the line

play11:37

we'll see how that's useful in a bit

play11:39

okay so enough talk

play11:41

let's write some code for this example

play11:43

we're going to make a simple

play11:45

hello world http server let's get

play11:47

started by first installing open

play11:49

telemetry

play11:50

now an important thing to understand

play11:52

about open telemetry is that it's a

play11:53

framework

play11:55

and most of the configuration you do is

play11:57

about connecting it to different

play11:58

backends

play11:59

but once you've picked the back end you

play12:01

want to connect to most of that

play12:02

configuration becomes boilerplate

play12:05

to help with that we've created a

play12:06

concept called open telemetry distros

play12:09

since we're going to be connecting the

play12:10

light step in this example let's grab

play12:13

the light step distro

play12:15

the first required piece of

play12:16

configuration is the service name

play12:18

this lets you know where all the data is

play12:20

going to be originating from

play12:22

let's call it hello server then in order

play12:25

to connect to light step

play12:26

you're going to need an access token

play12:29

so that's just a little doodad we're

play12:32

going to grab

play12:32

from light step and paste that crazy

play12:35

thing into here

play12:37

and then to show some optional

play12:38

configuration

play12:40

first let's define which propagators

play12:42

we're going to use

play12:43

so we talked about trace context before

play12:46

but for this example let's switch to b3

play12:48

which are the zipkin

play12:50

headers you may encounter these if

play12:52

you're already using

play12:53

a tracing system the final bit of

play12:56

configuration we want to point out are

play12:58

resources resources are what you use to

play13:01

index your services

play13:02

so the same way traces have indices and

play13:05

operations of indices

play13:07

services can also have indices in open

play13:10

telemetry we have this concept called

play13:12

semantic conventions

play13:14

semantic conventions are standard

play13:16

resources and trace attributes that you

play13:18

can add to describe your system

play13:20

these conventions are defined in the

play13:23

specification

play13:24

uh we have everything that you might

play13:26

expect operating systems

play13:28

containers processes

play13:32

let's grab hostname and add that

play13:35

so the reason why it's important to

play13:37

standardize these conventions

play13:39

is so that your analysis tools can

play13:42

actually understand the information

play13:44

if we reported hostname in some cases as

play13:47

host.name

play13:48

and in other cases as host dash name or

play13:50

just host

play13:51

it would be much harder to do something

play13:53

useful with that data

play13:55

lastly you need to add a call to

play13:57

shutdown at the end

play13:58

to ensure that everything gets flushed

play14:00

when your program

play14:01

exits and that's all the setup we need

play14:04

to do

play14:04

from here on out we'll only be

play14:06

interacting with the open telemetry api

play14:09

the first thing we'll want to do is

play14:10

create a tracer

play14:12

tracers should be created at the package

play14:14

level

play14:15

and named after the package that they're

play14:17

instrumenting

play14:19

this allows you to attribute every span

play14:21

created

play14:22

to the package that created it next up

play14:26

we're going to create a simple hello

play14:28

world handler

play14:29

so all this handler is going to do is

play14:32

sleep for 30 milliseconds to pretend

play14:34

like it's doing work

play14:36

and then write out hello world

play14:39

super basic next up we need to install

play14:43

instrumentation libraries in most

play14:46

languages

play14:47

these libraries can be installed

play14:48

automatically but in go

play14:50

we don't really like any of that spooky

play14:52

automatic stuff

play14:53

instead we prefer to copy paste open

play14:56

telemetry comes with a variety of

play14:58

instrumentation libraries

play15:00

including those that cover all of the

play15:03

core

play15:03

http and networking libraries within

play15:06

standard lib

play15:07

as well as a number of common frameworks

play15:11

so in this case we're taking our http

play15:14

handler

play15:15

wrapping it in an instrumented http

play15:18

handler

play15:18

and adding that to our service

play15:22

and that's all it takes to add basic

play15:24

instrumentation and context propagation

play15:26

to an http server and go so

play15:30

let's have a look at this let's fire it

play15:32

up

play15:33

okay so let's start our http server

play15:37

and then hop into our browser hit

play15:40

refresh a whole bunch of times on this

play15:42

endpoint to ensure that we generate some

play15:44

data

play15:45

and then go look in our back end to see

play15:48

if we're getting anything and sure

play15:49

enough

play15:50

we've got spans coming in and if you

play15:52

click into one of these you'll see the

play15:54

world's most simplest trace

play15:56

it's a single span but look at how rich

play15:58

that span

play15:59

is with indices and data those are the

play16:01

semantic conventions i was referring to

play16:03

earlier

play16:04

okay now let's add some of our own data

play16:07

the first thing we need to do

play16:08

is get a handle on the current span to

play16:11

do that

play16:12

extract it from the context this is the

play16:14

same span

play16:15

that was set up by our server

play16:17

instrumentation earlier

play16:19

let's add an attribute describing the

play16:21

route for this endpoint

play16:23

using one of open telemetry's semantic

play16:25

conventions

play16:27

in general you should prefer decorating

play16:29

these existing spans with attributes and

play16:31

events

play16:32

rather than creating child spans okay

play16:34

let's run our server again and see how

play16:36

it looks

play16:37

this time i'm going to enable the debug

play16:39

logs

play16:40

these are useful for diagnosing a

play16:42

configuration error

play16:43

so i just wanted to point them out okay

play16:47

so let's generate some data again

play16:50

and go have a look and see what we've

play16:53

got

play16:56

we should be able to see that the spans

play16:58

coming in now

play16:59

have an http route key associated with

play17:02

them

play17:03

and sure enough there it is so this is

play17:06

really useful

play17:07

because you could now uh query by http

play17:10

route key and make apples and to apple's

play17:13

comparisons

play17:14

across different latencies for the same

play17:16

route

play17:18

okay so let's try creating our own child

play17:20

span to do that

play17:22

take the tracer that we created for our

play17:24

package and use it to call start on the

play17:26

current context

play17:28

along with a name for our new operation

play17:31

this will return

play17:32

a new context with our new child span

play17:34

set as active within it

play17:36

remember to always end your spans so

play17:39

that you can correctly record the

play17:40

latency and avoid creating a leak

play17:44

if an error occurs you can call record

play17:47

error to log it as an event if this

play17:50

error indicates that the entire

play17:52

operation is an

play17:53

error you also need to change the span

play17:56

status to error

play17:57

otherwise this error will just be

play17:59

recorded as an event

play18:01

you can also add regular events events

play18:04

are effectively really awesome

play18:06

structured logging

play18:07

each event has a time stamp a message

play18:10

a set of key value pairs and of course

play18:13

it's contextualized by the span

play18:15

and the trace that it's occurring inside

play18:17

of

play18:18

okay so make sure to pass in that

play18:20

context and we're good to go

play18:22

so let's start our server up again and

play18:24

generate some more data

play18:26

we should be seeing errors show up in

play18:29

the explorer now and sure enough there

play18:30

they are

play18:32

so if you click into one of these you'll

play18:34

see that

play18:35

the trace is now a little bit more

play18:36

complicated we've got two spans

play18:39

the parent span generated by the

play18:40

automatic instrumentation

play18:42

and the child span that we created

play18:44

ourselves and you can see that the child

play18:46

span

play18:46

has been marked as an error and contains

play18:50

the event that we added you can see that

play18:53

listed over in the corner

play18:57

you may notice that the child span has

play19:00

the same duration as the parent span

play19:02

this implies that all the work is being

play19:04

done in the child span

play19:06

and if you look at our code sure enough

play19:07

that's where the sleep is

play19:09

so if we add another sleep to imply

play19:11

we're doing work in the parent span

play19:13

we'll get a more interesting trace so

play19:16

just hitting refresh here

play19:18

sending our data again

play19:23

and we see some new data coming in and

play19:25

if you have a look here you can now see

play19:27

that roughly half of the work is being

play19:29

done in the parent span and half of the

play19:31

work is being done

play19:32

in the child span so we've covered

play19:36

six basic commands getting the current

play19:38

span from the context

play19:40

setting an attribute creating a child

play19:43

span

play19:43

recording an error setting a status and

play19:46

adding an event

play19:48

and that's all you need to know about

play19:50

the open telemetry api

play19:52

as an end user okay but this is supposed

play19:55

to be distributed tracing

play19:57

let's create an http client and watch

play20:00

that context propagation

play20:01

flow through our system first i'm going

play20:04

to copy pasta over

play20:06

all of our open telemetry setup code i'm

play20:09

not going to set up resources this time

play20:11

just for expediency but you really

play20:13

should do that in a production scenario

play20:15

this is going to be a very simple client

play20:18

all it's going to do

play20:19

is spin up make five http requests in a

play20:23

row

play20:23

and then shut back down okay so first

play20:26

we're going to create an http client

play20:29

and then create an http request

play20:32

handle the error we're just going to

play20:34

panic in this case

play20:36

no big deal then uh

play20:39

do the request on the client

play20:43

again panic with the error and then

play20:46

close the body of the response

play20:48

just to make sure we do this cleanly and

play20:50

then in our main

play20:51

we're going to make a for loop then make

play20:54

five requests in a row and that's that

play20:57

to instrument our client and enable

play20:59

context propagation

play21:00

we wrap our http clients transport in an

play21:03

open telemetry

play21:04

instrumentation library okay

play21:08

so let's get this client running we're

play21:11

just going to run it a number of times

play21:13

then look in our explorer again and see

play21:15

if those spans are showing up

play21:17

and sure enough we now see a more

play21:18

complicated trace

play21:20

in addition to those server side spans

play21:22

we saw before

play21:23

we can see they now have a client span

play21:26

uh

play21:26

set up as their parent and again notice

play21:29

that this client span is decorated with

play21:30

a bunch of http information

play21:33

in addition to that uh there's also an

play21:35

attribute called instrumentation

play21:37

name and this points to the origins of

play21:40

this particular span so in this case we

play21:43

can see it came from the otel http

play21:45

library

play21:46

that's useful information if you're

play21:48

trying to debug your tracing

play21:50

for example i can click through on

play21:51

instrumentation name and do a query

play21:54

and find all the spans that are being

play21:56

generated by this particular

play21:57

instrumentation package across my system

play22:01

now that we have context propagation

play22:03

flowing

play22:04

let's add some baggage so baggage is a

play22:07

really cool feature

play22:09

let's imagine that we have a project id

play22:12

that's available on our client but would

play22:15

be expensive to grab on the server

play22:17

perhaps it would be an extra database

play22:19

call that we don't want to make

play22:21

so rather than doing that we can flow

play22:24

our project id from the client to the

play22:27

server using baggage

play22:28

so to do that i'm grabbing our context

play22:31

object

play22:32

i'm attaching the baggage to the context

play22:34

object

play22:35

using context with baggage values

play22:38

then flowing that context into our

play22:41

request

play22:44

then on the server side i'm grabbing our

play22:47

context

play22:48

and pulling the baggage off of it

play22:52

so to do that we're going to take our

play22:54

context object

play22:56

and call a baggage value on it

play22:59

and extract the project id

play23:06

we can then take that project id and use

play23:09

it as a correlation by setting it as a

play23:11

span

play23:11

attribute it's also very useful to use

play23:14

these correlations with metrics

play23:16

though we're not getting into metrics in

play23:17

this talk

play23:20

okay so let's reboot our server

play23:23

and have a look at this new data

play23:28

oh running the client again you'll get

play23:30

used to that

play23:32

and looking in the explorer and

play23:36

clicking on one of our internal spans

play23:39

and sure enough there's project id

play23:43

and if we were to add more services

play23:45

downstream of this one

play23:46

then that baggage would continue to

play23:48

propagate indexing your spans with

play23:50

concepts such as project id can be very

play23:53

useful

play23:54

because they identify potential usage

play23:56

patterns

play23:57

for example it would be an important

play23:59

insight to understand that the errors

play24:02

and latency issues you are seeing in

play24:04

your system were actually correlated

play24:06

with just a handful of accounts

play24:09

one cautionary note baggage values

play24:12

aren't free

play24:13

as they have to be propagated downstream

play24:15

each value you add

play24:16

increases the size of every http request

play24:20

so you should use them sparingly okay

play24:23

so let's make this trace a little more

play24:24

interesting we're going to go back into

play24:27

our

play24:27

http client and connect all of these

play24:29

requests together

play24:31

into a single trace so the first thing

play24:33

we're going to do

play24:34

is flow our context all the way through

play24:36

make request

play24:38

and then we're going to create a master

play24:40

span

play24:41

in order to create a span we're going to

play24:43

need to create a tracer handle

play24:44

so we do that at the package level and

play24:47

then name the tracer after the package

play24:49

so that we'll know where this span

play24:51

originated from

play24:53

once we've done that we start the span

play24:57

name it whatever something something

play25:02

make sure we end the span that's always

play25:05

a gotcha you don't want to leak

play25:08

and then let's have a look we should now

play25:11

see a single trace with five http

play25:13

requests

play25:14

attached to it and sure enough

play25:17

there it is so i hope this shows if

play25:21

anything

play25:21

what a time saver tracing can be over

play25:24

regular logging

play25:25

you notice at no point during this

play25:27

demonstration

play25:28

did i go in and start filtering by

play25:31

request id

play25:32

and then finding some other id and

play25:34

adding that to the filter and

play25:35

building up a query just in order to get

play25:38

a view of the transaction

play25:39

i started with a view of the transaction

play25:42

and from there pivoted into

play25:43

investigating various ways this

play25:46

transaction relates to other

play25:47

transactions

play25:48

and being able to just move naturally

play25:50

into that workflow without having to

play25:52

slow yourself down and

play25:54

build up these queries all the time has

play25:56

proven invaluable

play25:57

especially at the beginning of an

play25:59

incident when i'm trying to cast the

play26:00

wide net

play26:01

and diagnose the problem or when there

play26:03

are no active problems but i want to

play26:05

proactively investigate latency issues

play26:08

and improve the overall user experience

play26:10

for my application

play26:14

okay we're getting close to the end and

play26:16

before we go i'd like to do a quick

play26:18

review

play26:18

of rolling out open telemetry if you're

play26:21

thinking about rolling out open

play26:22

telemetry and organization

play26:24

the first thing to double check is that

play26:26

the languages that you're using are

play26:27

ready for production

play26:28

as i mentioned before at this time i

play26:30

recommend java javascript python and go

play26:33

erlang is also getting ready there's a

play26:35

number of them getting ready

play26:37

the best way to find out the current

play26:38

state of any open telemetry project

play26:41

is to ask the special interest group

play26:42

that's working on it you can find them

play26:44

by checking out the github repo

play26:46

for each one of these projects or going

play26:48

to open telemetry io because that links

play26:50

out to all of that including calendars

play26:52

for meetings

play26:54

getter rooms everything you need in

play26:55

order to get involved in the community

play26:57

since we're in beta i really recommend

play26:59

doing that

play27:00

the next thing you need to do is get

play27:02

buying within your organization

play27:04

i don't recommend trying to boil the

play27:06

ocean especially if you have a number of

play27:08

service teams that you're going to have

play27:09

to go to and ask to do the work of

play27:11

setting up distributed tracing

play27:13

if installing open telemetry everywhere

play27:16

is looking like it might be a lot of

play27:17

work

play27:18

the best thing to do is to pick a

play27:20

particular pain point

play27:21

find one transaction that has high value

play27:24

that you'd like to understand

play27:25

say the latency or error rates about and

play27:29

implement all of the services necessary

play27:31

to understand that high value

play27:33

transaction

play27:34

once you've got that one transaction

play27:36

implemented from start to finish

play27:38

you'll be able to really see open

play27:41

telemetry work

play27:43

with everything installed properly and

play27:44

all the data being correct

play27:46

rather than sort of a scatter shot

play27:48

approach where maybe one team installs

play27:50

it and then another team installs it but

play27:51

it's not part of an organized effort

play27:54

once you've instrumented that particular

play27:56

transaction

play27:57

you can expand out from there look for

play27:59

outliers another low-hanging fruit

play28:02

if you're looking for more information

play28:03

about getting started i'm putting a

play28:05

documentation site together called

play28:08

opentelemetry.lightstep.com

play28:09

this is where you'll find getting

play28:11

started guides guides to our own distros

play28:13

the open telemetry launchers

play28:15

and coming soon you'll be seeing

play28:17

cookbooks deep dives etc

play28:19

i really want to make it an excellent

play28:21

resource for anyone

play28:22

trying to use open telemetry if you'd

play28:24

like to keep track of all of this i post

play28:26

regular updates on twitter

play28:28

at tetsuo so you can follow me there or

play28:30

send me a dm

play28:32

thanks for watching and i hope you get

play28:34

involved in open telemetry

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Open TelemetryCloud ObservabilityContext PropagationDistributed TracingTelemetry SDKOTLP ProtocolData ProcessingAPI InstrumentationPerformance MonitoringSystem Analysis