How OpenTelemetry Helps Generative AI - Phillip Carter, Honeycomb

CNCF [Cloud Native Computing Foundation]
29 Jun 202424:08

Summary

TLDRPhilip from Honeycomb's product team discusses the role of Open Telemetry in improving generative AI applications. He emphasizes the importance of observability to understand user interactions and model performance, highlighting challenges in managing costs and ensuring reliability with AI's unpredictable nature. The talk explores the practical aspects of building AI applications, the use of language models, and the significance of context and prompting techniques. It also touches on the ongoing work within the Open Telemetry community to standardize tracing and metrics for AI applications.

Takeaways

  • πŸ˜€ Open Telemetry is a crucial tool for improving generative AI applications, despite being considered one of the least interesting parts of the project by the speaker.
  • πŸ” The speaker emphasizes the importance of observability in AI, especially in understanding user inputs and outputs to make AI applications more reliable.
  • πŸ’‘ AI and generative models are becoming more accessible and affordable, shifting the bottleneck from large tech companies to the broader developer community.
  • πŸš€ The speaker discusses the challenges of managing costs and understanding model performance when building AI applications.
  • πŸ”‘ The key to building successful AI applications is understanding the right prompting techniques and knowing when to fine-tune or train your own language models.
  • πŸ“ˆ The speaker highlights the significance of having good data to feed into the AI models, as well as the right context for user inputs to produce accurate outputs.
  • πŸ› οΈ The process of building AI applications involves a stack of operations before calling a language model, including search services and retrieval-augmented generation.
  • πŸ”¬ Observability is akin to a tracing problem, where capturing the entire flow from user input to model output is essential for analysis and improvement.
  • πŸ“Š Metrics like latency, error rates, and cost are important but often easier to manage compared to ensuring the right data is fed into the model and the model behaves correctly.
  • πŸ“ The speaker suggests logging extensive information about the AI application process, including prompts, model responses, and post-processing steps for better debugging and improvement.
  • 🌐 Open Telemetry is being improved with semantic conventions for AI applications, aiming to standardize the representation of operations and data within traces and logs.

Q & A

  • What is the speaker's name and what team does he work for at Honeycomb?

    -The speaker's name is Philip, and he works for the product team at Honeycomb.

  • What is the main topic of Philip's talk?

    -The main topic of Philip's talk is how open telemetry helps generative AI, although he mentions that he won't be discussing open telemetry in depth.

  • What does Philip consider to be the least interesting part of the project?

    -Philip considers open telemetry to be the least interesting part of the project, as it should just work and be helpful without being the main focus.

  • What is the purpose of good observability in the context of generative AI applications?

    -Good observability is important for understanding what users are inputting, what the outputs look like, and how to improve the AI based on real-world usage.

  • What is the current state of AI in terms of accessibility and cost?

    -AI, particularly powerful machine learning models, is becoming more accessible and affordable for a broader audience, with costs decreasing over time.

  • What challenges do developers face when managing generative AI applications?

    -Developers face challenges such as managing costs, understanding model performance, and determining the right kind of application to build.

  • What is the significance of the 'killer apps' mentioned in the script?

    -The 'killer apps' like chat GBT and G Co-pilot represent successful applications of AI, but they also indicate the competitive landscape and the need for innovation beyond these applications.

  • What does Philip mean by 'inscrutable black boxes' in the context of generative AI?

    -By 'inscrutable black boxes,' Philip refers to the non-deterministic nature of AI models, which can be difficult to understand and predict in terms of their outputs.

  • What is the importance of understanding user behavior and inputs in AI application development?

    -Understanding user behavior and inputs is crucial for improving AI applications, as it helps developers to refine prompts, model usage, and overall application performance.

  • What role does open telemetry play in addressing the challenges faced by developers in AI applications?

    -Open telemetry plays a role in providing observability into the AI application's performance, helping developers to trace and understand the flow of data and the impact of various components.

  • What is the 'golden triplet' that Philip mentions for analyzing AI applications?

    -The 'golden triplet' refers to the combination of inputs, errors, and responses for each user request, which is essential for evaluating and improving AI application performance.

Outlines

00:00

πŸ€– Introduction to Generative AI and Open Telemetry

Philip, a product team member from Honeycomb, introduces the topic of generative AI and its integration with open telemetry. He emphasizes the project's goal to operate quietly in the background, benefiting users without being a central focus. The talk is based on his experience in enhancing AI features by observing user interactions post-release. Observability plays a key role in understanding user inputs and system outputs, which is crucial for improving AI applications. The discussion touches on the accessibility of powerful machine learning models and the challenges faced in managing costs and performance, as well as the need for creativity and reliability in AI outputs.

05:01

πŸ” The Role of Observability in AI Application Development

This section delves into the importance of observability in developing AI applications, particularly focusing on the input-output dynamics of generative AI models. It outlines the process of gathering contextual information to enhance the model's output, mentioning the concept of retrieval-augmented generation (RAG). The speaker discusses the challenges of ensuring the right data is fed to the model and monitoring the model's behavior with the correct inputs. The section also touches on the less critical but still important aspects like latency, error rates, and cost, suggesting that these are usually easier to manage compared to the core data handling and model behavior.

10:03

πŸ“ˆ Tracing and Observability in AI Application Management

The speaker presents a simplified diagram of a typical generative AI application, highlighting the complexity of the processes involved in input handling and output generation. He discusses the use of tracing to monitor the end-to-end flow of user interactions with the AI system. The importance of capturing detailed information about the system's operations, such as input prompts, model responses, and post-processing steps, is emphasized. The section also introduces the concept of using open telemetry for observability, suggesting that while it may not be the most exciting part of the project, it is a crucial and well-suited tool for the job.

15:04

πŸ“ Open Telemetry's Application in AI and Ongoing Developments

This part of the script discusses the practical application of open telemetry in AI systems, focusing on the need for detailed logging and analysis of user inputs, errors, and model responses. The speaker shares his experience in using open telemetry for pattern recognition and improvement of AI applications. He mentions the ongoing work in the open telemetry community to define standards for instrumenting AI applications, including the handling of prompts, responses, and other metadata. The potential for auto-instrumentation in the future is also highlighted, suggesting that open telemetry will become an even more integral part of AI application development.

20:05

πŸ”§ Debugging and Data Collection for AI Model Improvement

The final paragraph discusses the intricacies of debugging AI models and the importance of data collection for model improvement. The speaker talks about the challenges of dealing with black box AI models and the strategies used to understand and improve their outputs. He describes the process of building databases of examples for few-shot prompting and the use of CSV files to analyze patterns in user behavior and system responses. The section also touches on the practical aspects of data volume and the cost of observability systems, suggesting that while there are challenges, they are manageable and often lead to better system performance and understanding.

Mindmap

Keywords

πŸ’‘Open Telemetry

Open Telemetry is an observability framework for cloud-native software, aiming to standardize the capture, transmission, and utilization of telemetry data (metrics, logs, and traces). In the context of the video, it is highlighted as a crucial tool for understanding and improving the performance of generative AI applications. The speaker mentions that while Open Telemetry is a foundational aspect, it is not the most interesting part of the discussion, which is the application of AI and its optimization.

πŸ’‘Generative AI

Generative AI refers to artificial intelligence systems that can generate new content, such as text, images, or code. The video discusses how generative AI is a hot topic and how it has become more accessible and powerful, leading to a shift in the tech world. The script mentions 'gp4' as an example of a generative AI model used behind an API.

πŸ’‘Observability

Observability in the context of software refers to the ability to understand the internal state of a system through external observations. The script emphasizes the importance of observability for monitoring user inputs, outputs, and the behavior of AI models in production. It is key for improving AI applications by providing insights into user interactions and system performance.

πŸ’‘Language Models

Language models are a type of generative AI that can understand and generate human-like text. The video script discusses the challenges of using language models, such as their non-deterministic nature and the difficulty in predicting their outputs. The speaker also mentions 'gbd 3.5' as an example of a language model they have been using in production.

πŸ’‘Prompting Techniques

Prompting techniques are methods used to guide the output of a language model by providing it with specific inputs or 'prompts.' The script mentions that there are numerous prompting techniques, which can affect the performance of AI models differently, and finding the right technique is part of the challenge in optimizing generative AI applications.

πŸ’‘Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation is a technique where a language model is fed with additional data during a request to act as if it was trained on that data. The script describes RAG as a way to enhance the capabilities of language models without the need for retraining, allowing them to produce more contextually relevant outputs.

πŸ’‘Vector Search

Vector search is a method of information retrieval that involves representing data points in a vector space to find the closest matches. The video script discusses the importance of vector search in gathering contextual information to aid AI models in producing accurate responses. It is part of the process of building a 'context package' for AI model inputs.

πŸ’‘Traces

In observability, traces are end-to-end records of requests or transactions within a system. The script explains that traces are used to capture the flow of data and operations leading up to a language model call, which is essential for understanding and improving the performance of AI applications.

πŸ’‘Auto Instrumentation

Auto instrumentation refers to the automatic collection of telemetry data by libraries or frameworks without requiring manual code changes. The video mentions ongoing work in Open Telemetry to create auto instrumentation for AI applications, which would simplify the process of gathering observability data.

πŸ’‘Evaluation

In the context of machine learning, evaluation refers to the process of assessing a model's performance against a set of data. The script discusses using evaluation to compare the actual outputs of an AI model with the expected outputs, which helps in identifying areas for improvement and debugging the model.

πŸ’‘Debuggability

Debuggability is the ability to identify and fix issues within a system. The video script touches on the challenges of debugging complex AI models, which are often treated as 'black boxes.' It also mentions the potential for more sophisticated techniques to improve debuggability in the future.

Highlights

Open Telemetry's role in generative AI is less about the technology itself and more about its seamless integration and utility.

Observability is crucial for understanding user interaction with generative AI applications and improving AI features based on real-world usage.

Generative AI models, while powerful, are often non-deterministic and can be challenging to manage for consistent output quality.

The accessibility of advanced machine learning models has democratized AI development, making it more widely available to developers.

Managing costs and understanding model performance are key challenges in the practical application of generative AI.

The importance of selecting the right application for AI to ensure its success and avoid competing with established market leaders.

Generative AI applications often involve a combination of user input, context gathering, and language model calls to produce output.

The concept of retrieval-augmented generation (RAG) allows for leveraging existing language models with contextual data to enhance responses.

Two key questions in building AI applications are ensuring the right data is fed to the model and verifying the model's correct behavior with the right inputs.

Latency and error rates are often secondary concerns in AI applications due to user expectations and the nature of language models.

The importance of logging and tracing in understanding the flow of data and decisions leading up to a language model call.

Open Telemetry's fit-for-purpose nature makes it a suitable tool for tracing and observability in generative AI applications.

The ongoing work in the Open Telemetry semantic conventions working group to standardize the instrumentation of AI applications.

The potential for auto-instrumentation in AI SDKs to simplify the implementation of Open Telemetry for AI applications.

The use of the 'golden triplet' of inputs, errors, and responses to analyze and improve AI application performance.

The practical approach to debugging and improving AI models by capturing detailed logs and understanding user behavior patterns.

The consideration of sampling strategies and cost management when dealing with large volumes of data in AI applications.

Current research into true debuggability of AI models and the potential for more sophisticated understanding of model decisions.

The comparison of AI observability challenges with regular system observability and the focus on pattern recognition for improvement.

Transcripts

play00:00

so hi uh my name is Philip I work for

play00:02

honeycomb um I'm on the product team uh

play00:05

I um talking about a fun topic uh how

play00:09

open Telemetry helps generative AI I'm

play00:11

not going to talk about open Telemetry

play00:12

very much though um and that's because

play00:14

open Telemetry is like one of the least

play00:15

interesting parts of all of this which I

play00:17

think is kind of like a goal of the

play00:19

project if you will just kind of works

play00:21

and is really helpful for people um so

play00:23

that's what this is going to be about uh

play00:25

this is based on uh last year uh early

play00:28

last year um I did the old around and

play00:31

find out thing um where in the course of

play00:34

around I found out a whole lot about how

play00:36

you can make AI better when you uh build

play00:38

a feature and then release it to all of

play00:40

your users uh and um find out what they

play00:43

actually want to do with it uh once it's

play00:45

live um and it turns out having good

play00:48

observability for things like what are

play00:51

people putting into this input box what

play00:53

is the output look like what do we do

play00:56

about that um it's kind of a a good good

play00:59

use for obser ability so let's get into

play01:01

it um so a is all the hype AI is all the

play01:05

hype these days um this talk is not

play01:07

going to be focused on infrastructural

play01:09

level stuff so this is not about like

play01:10

monitoring your gpus um or anything like

play01:13

that if you're like working in the cloud

play01:16

offering gen Services you might care

play01:18

more about that you might care more

play01:19

about how you do inference um monitoring

play01:22

all of those are like completely other

play01:23

talks that we could give uh this is for

play01:26

the majority of people out there who are

play01:28

building applications that use some

play01:31

generative AI model like

play01:33

gp4 um behind an API and they want to

play01:37

just make it good because what's really

play01:39

cool is like it's kind of got into this

play01:42

world where like quite literally the

play01:43

world's most powerful machine learning

play01:45

models are broadly available for anyone

play01:47

to use at a relatively cheap price and

play01:50

getting a lot faster and a lot cheaper

play01:51

by the like month basically um so like

play01:54

this bottleneck where like you could

play01:56

maybe build something really cool using

play01:58

AI um was like stuck in the likes of

play02:02

Amazon and Google and meta and Microsoft

play02:05

and all that like you just couldn't do

play02:07

that as a normal developer now that's

play02:09

not the case anymore but that doesn't

play02:11

mean that like everything is all magic

play02:12

and sugar and all of that um there's

play02:16

there's a lot of problems that people

play02:18

have um when it comes to managing costs

play02:21

when it comes to understanding how these

play02:22

models even perform when it comes to

play02:24

figuring out what the right kind of

play02:25

application you're going to build is um

play02:28

there's like killer apps already like

play02:30

chat gbt and G co-pilot but like chances

play02:32

are if you're going to create like a

play02:34

chat rapper it's not going to do very

play02:36

well if you're going to try to create

play02:38

like a little code completion thing um

play02:40

Good Luck competing against GitHub uh

play02:42

they've got like a five-year head start

play02:44

on you so um you know it's great but

play02:47

like there's a lot of opportunities out

play02:48

there that are just outside of chat apps

play02:50

and outside of little tab completion

play02:52

code helpers um but I think like it's

play02:55

safe to say that this sort of broke the

play02:57

tech world uh and it's still a little

play02:58

bit broken and we're probably due for

play03:00

one of those like Gartner hype cycle

play03:02

troughs of disillusionment pretty soon

play03:05

um but like the the the like it's

play03:07

fundamentally changed now like we have

play03:08

fundamental Computing capabilities that

play03:10

we just didn't have anymore um so what

play03:14

does that mean well they are powerful

play03:17

but inscrutable black boxes um It Is by

play03:20

design that language models that

play03:22

generative AI uh in general is either

play03:26

non-deterministic by Design or even if

play03:27

you turn down the temperature conf which

play03:30

is a value that you can that you can use

play03:32

um down to zero uh depending on the

play03:33

model you're using it's still

play03:35

non-deterministic and like there's a lot

play03:36

of variants there where some smaller

play03:38

models actually are deterministic and

play03:39

all of that but like the point is if you

play03:41

want to generate so-called creative

play03:43

responses to Things based off of inputs

play03:45

that come in you don't want something

play03:47

that produces boring outputs usually

play03:50

like that you just don't use AI if

play03:52

that's the case like you want something

play03:53

that produces really interesting outputs

play03:55

that are interesting but like now your

play03:58

users kind of do expect some degree of

play04:00

reliability and you have an inscrutable

play04:02

black box that like you try to prompt it

play04:05

and like good luck trying to understand

play04:06

what the best prompting technique is

play04:08

there's like 20 or 30 of them that are

play04:10

probably helpful and some are going to

play04:12

regress certain things and make other

play04:13

things better and like you're not going

play04:15

to know UPF front which one is the right

play04:16

one um you're going to have totally

play04:19

different behavior in production

play04:20

compared to development because your

play04:21

users are going to do things you could

play04:23

never possibly expect and you're going

play04:25

to have to learn the hard way that like

play04:27

you got to do something different you

play04:28

can't just like write some unit test and

play04:30

hope it gets better that's not going to

play04:31

happen if you just try to say well it

play04:33

looks good on my machine let's throw it

play04:34

in production like it's going to it's

play04:36

going to produce garbage it's not going

play04:38

to be any good and you're not going to

play04:39

be able to keep that feature in

play04:40

production so um let's stare at a

play04:43

diagram for a little bit this is

play04:44

basically every geni app today uh it's

play04:47

massively oversimplified of course um

play04:50

but outside of like the super boring

play04:52

useless chat apps that people right um

play04:54

when they're not named open AI um

play04:57

there's some form of input generally

play04:59

producing some kind of output it's

play05:01

almost always some kind of Json and the

play05:03

things that happen in between are really

play05:04

interesting there's one or more language

play05:06

model calls it's usually only one

play05:08

because that's usually all that you need

play05:09

but there's this whole stack of stuff

play05:11

beforehand um called search Service uh I

play05:15

pulled this this diagram out there where

play05:16

it talks about Vector search a lot of

play05:18

the AI world is sort of relearning that

play05:20

Vector search is not the only kind of

play05:21

search that you can do that's really

play05:22

helpful the goal is that you want to

play05:25

take user input gather a whole bunch of

play05:27

contextual information about like what

play05:29

could be helpful to produce an answer to

play05:31

the question that they have or the

play05:33

output that you're trying to achieve and

play05:35

gather as much of that as possible and

play05:37

produce um uh a a context package if you

play05:39

will it's uh it's called retrieval

play05:42

augmented generation or rag it was this

play05:44

really cool behavior that some meta

play05:45

researchers found in 2020 where they

play05:48

figured out that language models if they

play05:50

are not trained on a certain kind of

play05:51

data but you feed in that data on a

play05:54

request to it they can kind of act like

play05:56

they were trained on that data and

play05:57

there's a little bit of wiggle room

play05:58

there but like it's really cool because

play06:00

you don't need to train your own

play06:02

language model you can use an

play06:03

off-the-shelf language model and pass in

play06:05

a whole bunch of stuff and produce

play06:06

useful things and so this is what almost

play06:08

everybody who is building AI apps today

play06:10

is building some form of this diagram

play06:14

okay so there's really two key questions

play06:18

that you need to answer when you're

play06:19

building this stuff and you want to make

play06:20

it better um notice I don't have the

play06:23

words latency error rate CPU statistic

play06:26

GPU blah blah blah whatever uh it's is

play06:29

the data right like is the right data

play06:30

being fed to the model in the first

play06:32

place like if I'm gathering context for

play06:34

somebody's user input am I actually

play06:36

Gathering the right context uh I talked

play06:38

to someone last year who has a version

play06:41

of that diagram where that search

play06:42

service is actually six different

play06:44

databases and one of the questions they

play06:46

have is okay based off of the user's

play06:47

input are we calling the right database

play06:50

or not um how many should we call how do

play06:53

we merge those results together what

play06:55

sorts of search um systems do we do we

play06:57

actually have and like can I ass

play06:59

statically show that on like classes of

play07:02

inputs I can produce context packages if

play07:06

you will that are actually right for

play07:08

that kind of input how do I like measure

play07:11

that and see if it's continuing to

play07:13

improve over time and like not

play07:15

progressing over time similarly on the

play07:18

model side how do you know that it's

play07:19

behaving correctly when you actually do

play07:21

have the right inputs right so like

play07:24

assume that you have retrieval which is

play07:26

a really hard problem in a lot of cases

play07:28

solved is it actually still doing the

play07:30

right thing like are you using the right

play07:31

prompting techniques are you at a point

play07:33

where you need to actually fine-tune a

play07:35

language model are you at a point where

play07:37

God help you you have to train your own

play07:38

language model I certainly hope that's

play07:40

not the case and similarly can you

play07:42

systematically show that you're making

play07:43

progress when you go to production and

play07:46

people are inputting all kinds of weird

play07:47

stuff and there's like weird outputs and

play07:50

you start thinking that you're fixing

play07:51

those outputs are you a actually fixing

play07:54

those outputs and B are you not

play07:56

regressing the stuff that was already

play07:57

working in the first place these are

play07:59

like these are really important things

play08:02

um there's some other stuff that doesn't

play08:03

really matter as much but it still kind

play08:05

of matters around like latency and error

play08:06

rates um I'm labeling them this way

play08:09

because to be frank they're usually

play08:11

pretty easy to solve in part because

play08:13

users don't expect language models to be

play08:15

instantaneous and so if it takes like

play08:17

one or two seconds to produce a response

play08:18

it's usually fine uh these things are

play08:21

getting hugely better over time uh when

play08:24

we released our our application early

play08:26

last year um average response times were

play08:28

like 5 Seconds and now it's down to like

play08:30

1.5 seconds uh through like no no action

play08:34

of our own um cost is also something

play08:38

that like I mean in this economy

play08:40

everybody's worried about cost but like

play08:42

let's be real like most organizations

play08:44

have budget for AI and they're willing

play08:46

to spend it uh and you really don't need

play08:48

the most powerful models to achieve most

play08:49

outcomes that you're looking for uh

play08:51

we've been live in production with gbd

play08:53

3.5 since May of last year and have had

play08:55

no need to change it uh if we can do it

play08:58

you probably can do um and

play09:02

hallucinations like C previous slide

play09:04

people talk about oh I don't want the AI

play09:06

app to hallucinate but like it's not

play09:08

about hallucinations it's am I feeding

play09:10

the right information to the model and

play09:13

am I producing the right output based

play09:14

off of that right information that I'm

play09:16

feeding in the first place and can I

play09:18

actually systematically show that over

play09:19

time this is just the core of making

play09:22

these apps more

play09:23

reliable and so like a way that that

play09:25

might look is you could imagine you have

play09:28

a whole bunch of info like you want to

play09:29

log like a full prompt that you build up

play09:31

programmatically maybe you have a whole

play09:33

bunch of steps that lead up to that um

play09:35

in the the application that we built

play09:37

last year there's actually on the order

play09:39

of about 38 distinct operations that

play09:41

happen Upstream of a language model call

play09:43

so like logging all of that stuff and

play09:46

tracking that understanding your latency

play09:48

like your status code what your error

play09:50

was like your usage if you're doing any

play09:52

post-processing on the Json like what

play09:53

postprocessing steps you're actually

play09:55

doing your diagram kind of looks like

play09:57

this um and uh it just involves

play10:00

Gathering user input contextual

play10:02

information request to a service

play10:04

sometimes you may do multiple searches

play10:06

sometimes you may have to rerank search

play10:08

results based off of like different

play10:11

techniques that could work better and

play10:12

certain C like certain inputs May lend

play10:14

themselves better to a different like

play10:16

search um system like these are kind of

play10:18

complicated things um and like you

play10:20

eventually get to the point where you're

play10:21

calling an llm and you want to have like

play10:23

okay what was the input what was the

play10:24

output but like there's this whole

play10:25

system that you're trying to gather

play10:27

information about and post processing

play10:29

steps can often be a rather large um set

play10:32

of things like um speaking again from

play10:35

from production we have about two dozen

play10:36

or so possible post-processing steps

play10:38

that can occur where a language model

play10:40

gets something mostly right and that

play10:42

mostly right is actually something that

play10:44

we can deterministically check and

play10:46

either insert or remove data from like

play10:48

the response that we get like this is

play10:51

when you're in production you're trying

play10:52

to make stuff better for your users you

play10:53

find out all of this fun stuff where you

play10:55

can make this stuff actually work um so

play10:59

sounds an awful lot like a tracing

play11:00

problem right I got all this stuff

play11:02

happening Upstream of this black box

play11:04

maybe it involves some other black boxes

play11:06

maybe it involves a whole bunch of calls

play11:07

to language models maybe it eventually

play11:09

calls a language model maybe it calls a

play11:10

language model 20 times maybe it calls

play11:12

it five times who knows who cares I do

play11:14

something afterwards like there's all

play11:16

these words like Services flying around

play11:18

this is literally just a tracing problem

play11:20

this is an observability problem so this

play11:21

is where I talk about open Telemetry um

play11:24

and as I said the otel part is like one

play11:26

of the least interesting Parts but I

play11:28

think that's great because

play11:29

uh it's actually quite fit for purpose

play11:32

here um what do you want to capture well

play11:35

traces yay uh you have like an end to

play11:38

end flow like a user types in a thing

play11:40

and they like click a button or they hit

play11:42

enter like what are all the different

play11:44

things that are actually hit how do you

play11:46

capture that well you use traces to like

play11:48

tie all of that together now it gets a

play11:50

little bit more um into the Weeds about

play11:52

if you want to capture a whole bunch of

play11:54

information in that Trace data or if you

play11:56

want to capture like for example like a

play11:58

full prompt text or LM response

play12:00

depending on its size like that may be

play12:01

more fit for a log that you then

play12:03

correlate with the trace it's kind of up

play12:05

to you it's kind of up to like what you

play12:07

use for your tracing backend to analyze

play12:09

this data in the first place um you want

play12:11

to capture information about

play12:12

post-processing results um and you can

play12:16

also aggregate some metrics around

play12:18

things like latency and cost your

play12:19

typical error rate just typical boring

play12:21

stuff you can throw up on a dashboard to

play12:22

sort of say okay like I know that

play12:23

generally speaking it's doing all right

play12:26

um there is literally nothing as far as

play12:28

I can tell at least in my experience an

play12:30

open Telemetry that prevents you from

play12:32

doing this today um they're depending on

play12:35

the language you're using maybe like for

play12:37

example go with logs is like not as far

play12:39

along with Java with logs so if you have

play12:40

a Java app it's going to be a lot easier

play12:42

than if you use go or something but like

play12:44

fundamentally all the places are there

play12:45

for you to be able to do this um and so

play12:49

then you get into the fun stuff like

play12:50

actually analyzing this information um I

play12:54

have found I I I I put it in quotes I

play12:56

called it the golden triplet I don't

play12:58

know if it's actually that um inputs

play13:00

errors and responses for each request

play13:01

that a user gives and uh like if I have

play13:04

an agent or a chain or something like

play13:06

maybe there's there's like a correlation

play13:08

ID that I that I tie to like that

play13:09

particular thing that I'm doing or maybe

play13:11

it's represented as multiple traces that

play13:12

are linked together via span links um

play13:15

again oel fit for purpose for this kind

play13:17

of stuff um and I just look at patterns

play13:19

of inputs and outputs like in the

play13:20

natural language query feature that we

play13:22

built last year it was somebody that

play13:24

like for example Honeycombs back end is

play13:26

like strangely complicated to ask what

play13:28

an error rate is if you don't have a

play13:29

metric about that um and people were

play13:32

asking for what's my error rate and it's

play13:34

like well crap actually that is like

play13:36

weirdly unanswerable in certain ways so

play13:38

like what do we even do um when when

play13:41

like this is a common thing they want to

play13:42

do so like we were failing in like a

play13:45

category like you can imagine all the

play13:46

different ways that somebody might

play13:47

phrase what is my error rate um doesn't

play13:50

matter how they phrased it the category

play13:52

of input led to a category of output

play13:54

that just sucked and so we're like great

play13:57

this is like a class of bug that can now

play13:59

try to solve for and we can dig into

play14:02

some of those requests be like okay

play14:03

these are all the decisions that we made

play14:05

Upstream with a language model call this

play14:07

is what the language model actually

play14:08

produced these are the post-processing

play14:10

steps where like we accidentally just

play14:12

removed a bunch of stuff that we

play14:13

shouldn't have removed and there was

play14:14

like a bug in that that was unrelated to

play14:16

the language model it was just us being

play14:17

dumb um and uh brought it into

play14:21

development and just said great I have

play14:23

like concrete what is actually happening

play14:25

here and I can start then annotating

play14:28

outputs and saying this is what the

play14:29

output was this is what the output

play14:31

should be and that's called an

play14:33

evaluation if you're in the ml world and

play14:35

you start building up sets of these

play14:37

evaluations and then you can start

play14:38

systematically actually fixing this

play14:40

stuff and making it better and it makes

play14:42

these inscrutable black boxes tangible

play14:44

and actionable rather than just throw

play14:47

stuff at the wall and hope it sticks so

play14:50

what's open Telemetry doing to help well

play14:51

aside from being mostly fit for purpose

play14:54

um there is work going on in the uh llm

play14:57

semantic conventions working group on

play14:59

slack uh this is where it turns out

play15:02

there's a whole lot of common operations

play15:04

in this kind of application that you're

play15:05

building when you're talking about

play15:06

Vector databases you're talking about

play15:09

calling different language models whe a

play15:10

language model is like a single shot

play15:12

sort of thing or if it's a part of an

play15:14

agent um like there's names that you can

play15:16

assign to this kind of stuff and names

play15:18

that we are uh assigning to like what

play15:20

should live on a span versus like should

play15:23

this be captured in an event that's

play15:24

correlated to a span and like what

play15:27

should the default be should this data

play15:29

captured or should it be redacted by

play15:30

default and can you turn it on what does

play15:32

that mechanism look like um and we're

play15:34

working with the uh open lemetry folks

play15:37

who have taken a spike at like let's

play15:39

build a bunch of Auto instrumentations

play15:41

for this stuff and see what it actually

play15:42

looks like and working with them to say

play15:45

okay based off of that this works this

play15:47

one doesn't work this one works really

play15:49

well this one yeah maybe I don't know

play15:51

and see if we can formalize that into a

play15:53

spec so it's very much underway right

play15:55

now um there are pieces that are like

play15:59

pretty like I don't want to say it's

play16:01

stable it's like totally experimental

play16:03

but like you could reasonably build

play16:04

instrumentations off of what's been

play16:05

defined today uh but there's a lot more

play16:07

work to come and uh we really I would

play16:10

really encourage anyone who's interested

play16:11

in this space to uh uh engage in this

play16:14

area um especially if you're working for

play16:17

any of the tech companies that is

play16:18

involved in building models because

play16:20

y'all models have like weird ways to

play16:22

capture inputs and outputs and like we

play16:24

like standards and stuff so it'd be

play16:26

great if we could figure out the best

play16:28

possible way way to represent stuff

play16:29

instead of treating open AI as a deao

play16:32

standard for example um so this is

play16:35

what's going on otel is like good enough

play16:37

for you to use today you got to do a

play16:39

little bit more manual instrumentation

play16:40

but like chances are uh with the budget

play16:42

that's being assigned to these

play16:43

applications you're going to have the

play16:44

time to do that uh and there's going to

play16:47

be more Auto instrumentations coming

play16:48

there's good um uh good spec level stuff

play16:52

being defined right now and I think in

play16:53

the near future you could see this as

play16:55

being as commonplace in otel as like

play16:59

database stuff or HTTP stuff um and

play17:02

hopefully without too much turn in the

play17:04

spec itself so that's what I

play17:10

got uh right now this is all patching

play17:12

from the outside um so like I've written

play17:15

like a library for like python that that

play17:18

just calls like the open wraps the open

play17:19

AI calls for example from the python SDK

play17:22

um the open elemetry project is similar

play17:24

sorts of things um what we are hoping as

play17:26

a part of this that um like the AI

play17:30

providers in their sdks just have the

play17:32

otel apis and just you know like it's

play17:35

just producing like no up spands for

play17:37

example so it doesn't impact anyone

play17:38

unless they turn it on um kind of going

play17:40

again with sort of the goal of votel

play17:41

where like instrumentation is everywhere

play17:43

and then it just you can just turn it on

play17:45

and and it's available so um but first I

play17:48

think like we need to lay some of the

play17:49

groundwork because they're going to have

play17:51

immediately questions like hey should I

play17:53

like put the prompt in the span or

play17:56

should I create an event or like should

play17:57

I even do that like what do I I do uh

play17:59

and that's where like us nailing this

play18:01

down on the spec Level side from the

play18:03

open Telemetry project standpoint is

play18:04

really going to help them out yeah so so

play18:07

that yeah for anyone who didn't hear the

play18:09

question this is so I gave input an

play18:10

example of like user input possible

play18:12

error and LM output as something that

play18:14

you could look at what are some other

play18:16

examples of things um so uh some some

play18:19

examples that I can tell you by way of

play18:23

um example from one of honeycombes

play18:25

features is so like it's like natural

play18:27

language to querying tool um so you need

play18:30

to query a data set that data set has a

play18:33

schema that schema can be massive and

play18:35

you can't just include literally every

play18:37

single name of everything in the schema

play18:38

inside of every request that you make to

play18:40

the model so there's this problem of

play18:42

like okay what subset do we actually

play18:43

pick which one is the most appropriate

play18:45

subset so um we have like like there's

play18:49

there's text based search and there's

play18:51

Vector search and there's like which one

play18:52

did we pick um which subset from each

play18:55

did we end up picking what was the

play18:56

actual like result that we gave so like

play18:59

you know I'm I'm like you know you you

play19:00

don't want to capture like the ACT if

play19:02

you use ve Vector edings you don't want

play19:03

to capture the actual Vector embeddings

play19:04

because they're massive and like you're

play19:06

not going to be able to interpret them

play19:07

but you want to distill that down a

play19:09

little bit uh there's also other

play19:10

contextual things so like for example in

play19:13

in our application um each request that

play19:15

a user Make may be different to other

play19:18

requests so like if you're talking to a

play19:20

different data set that's a different

play19:22

schema involved so you want to capture

play19:23

some information about what that what

play19:25

what's actually going on there so you

play19:26

can distinguish between what are my

play19:28

errors for this data set versus what are

play19:30

my errors for this data set and like do

play19:32

we perform better or one or the other

play19:33

and does that tell us like okay is that

play19:35

a problem with our prompting or is that

play19:37

a problem with how we do retrieval

play19:38

across different data sets um another

play19:40

thing is there's other like specific

play19:43

stuff that you pull in so a very common

play19:44

prompting technique is called um like

play19:47

few shot prompting where you sort of

play19:49

embed little examples inside of the

play19:50

prompt that you send as a part of a

play19:52

request uh you can actually create a

play19:54

database of those examples of like

play19:56

well-known okay given some like

play19:59

representation of what retrieval data

play20:01

looks like a user's input and what the

play20:03

ideal output for that input should look

play20:05

like based off of that data you can

play20:06

build up a whole a whole database of

play20:08

that you can also do search techniques

play20:10

on which pieces of that you actually

play20:12

pull in on a pro request basis so if you

play20:14

have like 50 fuse shot examples that are

play20:16

all like generally really good which

play20:19

three to five are going to be the most

play20:20

helpful for this specific request

play20:22

capture that information and then like

play20:24

basically what you end up with is you

play20:27

end up with like a really really big

play20:28

grouping like imagine like a big CSV

play20:30

with just tons and tons of columns and

play20:31

you're like okay for each request here's

play20:33

all the stuff that was interesting about

play20:35

that and now you get into like okay what

play20:37

are the patterns in each of those um

play20:39

user behaviors uh fun fact if you are

play20:41

working with an ml engineer they're

play20:42

going to want that CSV uh and they're

play20:44

going to want as many columns in it as

play20:46

possible because that's going to help

play20:47

their job if they're building like

play20:48

evaluation sets it's going to make them

play20:50

more um like I don't know I've talked

play20:52

with a bunch of ml engineers and they're

play20:53

like please load that CSV up with as

play20:55

much data as you possibly can like air

play20:57

on the side of too much dat data because

play20:59

it's probably not even enough um so like

play21:02

I don't know that that hopefully that's

play21:03

helpful um well gigabytes per

play21:07

hour I don't know it kind of depends on

play21:09

the application I would say that first

play21:11

chances are your prompts don't need to

play21:13

be as big as they are and your responses

play21:14

may not necessarily be that big either

play21:17

um but like this I think is not too

play21:20

different from any other observability

play21:21

problem regarding sampling like chances

play21:23

are that for some system there's going

play21:27

to be some like Paro distribution of

play21:28

like the kinds of inputs that people

play21:30

actually want to ask about like if it's

play21:32

a natural language tool for like

play21:33

Prometheus for example um like 80% of

play21:37

the questions that people are going to

play21:38

ask are going to follow a pretty similar

play21:40

kind of pattern and so you could sample

play21:41

that much more aggressively than others

play21:43

and so there's ways that you could

play21:44

actually detect that um there are other

play21:47

like to be frank like some observability

play21:50

systems are a lot cheaper than others

play21:52

and like it's a great opportunity to be

play21:53

like oh wow maybe my bill's a little too

play21:55

high right now and um maybe per gigabyte

play21:59

pricing is not the right pricing scheme

play22:00

for what I'm trying to deal with um I

play22:04

think it kind of depends there but like

play22:06

I don't think we're really at a point

play22:08

where we're going to be limited by that

play22:10

unless you're at the like Amazon

play22:13

Microsoft scale of like oh I have a

play22:15

million users who are doing this well I

play22:17

don't know it's just going to be

play22:18

expensive operating at that scale is

play22:19

expensive um I think today yes there's

play22:22

like there is certainly some active

play22:24

research being done around like true

play22:25

debuggability into these things but like

play22:28

I I I think some of that could also just

play22:29

end up being like incomprehensible where

play22:33

like a model like gbd 3.5 even is just

play22:36

there's so many like activations of

play22:39

different layers that are going on that

play22:41

like it that may not even be helpful um

play22:43

or it may just be too hard to sort of

play22:45

wrangle around now I know that there are

play22:47

certain um things that you can do like

play22:49

you can um you can ask it to generate

play22:51

multiple responses and you have a system

play22:53

that picks which response you want and

play22:55

there's these there's these things that

play22:57

are called log probabilties that assign

play22:59

like okay the the probability of like

play23:02

this token like this and and it'll

play23:04

basically say like here's like a set of

play23:05

tokens that we were going to generate

play23:07

and these were the probabilities that

play23:08

were assigned to them and that's why

play23:10

this one was was was chosen now it

play23:12

doesn't tell you the actual

play23:13

decision-making process that led into

play23:15

that but that can inch you a little bit

play23:17

closer to that um to be honest I've not

play23:19

really run into anyone who's like really

play23:21

used that stuff a whole lot uh like I

play23:24

know that it exists but like it's um

play23:27

you're you're getting pretty

play23:28

sophisticated and you're debugging it at

play23:30

that point and I would say that most

play23:32

people are just not at that point yet um

play23:35

if ever um I think like also it's like

play23:38

regular observability of systems is like

play23:40

well sometimes we're working with stuff

play23:41

that are black boxes that kind of suck

play23:43

um sometimes and do weird stuff in

play23:45

production that you can never reproduce

play23:46

locally uh and then you are still still

play23:49

in this place of like okay like what

play23:51

patterns are leading to these outputs

play23:53

and what can I do with that info um and

play23:56

uh I think we're there right now

play23:59

now um we might get somewhere in the

play24:01

future where you could more like

play24:03

fine-tune debug something but um

play24:05

probably not for a while

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Open TelemetryGenerative AIProduct TeamObservabilityAI ModelsAPI UsageUser InputModel PerformanceCost ManagementAI ApplicationsDeveloper Tools