Mastering Summarization Techniques: A Practical Exploration with LLM - Martin Neznal

Productboard
9 Nov 202330:14

Summary

TLDRThe speaker discusses using large language models like GPT for text summarization and other natural language processing tasks. He outlines common issues when deploying these models in production, like poor quality output, instability, and evolving model versions. The talk then covers techniques to improve summarization quality, including data cleaning and prompting properly. Methods of evaluating summary quality are also mentioned. The speaker concludes by describing challenges of scaling multiple production NLP services relying on a single provider's API.

Takeaways

  • 😊 The talk focused on using large language models like GPT for text summarization and other natural language tasks
  • πŸ“ Cleaning and processing input text before feeding it into models improves summarization quality
  • πŸ’‘ Careful prompting, including context, instructions, and examples, significantly impacts model performance
  • πŸ”Ž There are various methods to evaluate summarization quality, from reference-based to annotation-based
  • πŸ€– OpenAI API provides high quality summaries, but has downsides like rate limits, changes and outages
  • ⏱ Deploying summarization at scale has challenges around processing speed, errors, and rate limits
  • πŸ”Ž Regularly evaluating new language models is key to maintain optimal production systems
  • πŸ˜• Relying solely on one provider like OpenAI has risks, so backup plans should be considered
  • πŸ”’ Managing customer data privacy with third-party models requires transparency and secure pipelines
  • πŸ“š For free alternatives, quality depends on the specific use case and pretrained open source models

Q & A

  • What were some of the initial challenges faced when deploying large language models into production?

    -Some initial challenges were getting low quality or nonsense summaries, figuring out which model works best for each use case, handling instability and outages of models like OpenAI, and dealing with constantly evolving models.

  • What are two main categories of problems encountered with using large language models?

    -The two main problem categories are: 1) Quality of results - unclear how to achieve the best results for each model and use case. 2) ML engineering - issues like outages, instability, and models rapidly evolving over time.

  • How can preprocessing and cleaning of input text improve summarization results?

    -Preprocessing to remove irrelevant text, filter common/uncommon n-grams, select key sentences etc. helps GPT focus on the most salient parts of the document for better summarization.

  • How does prompting help in generating high quality summaries using GPT?

    -Prompting provides critical context about the purpose, reader, expected structure/length. It also includes examples and clear instructions of what content to include/exclude. This guides GPT to produce more accurate summaries.

  • What are some common methods used to evaluate quality of AI-generated summaries?

    -Reference-based (compare to human summary), pseudo-reference based (compare to auto-generated summary of key points), and annotation-based (manually label and validate summary quality).

  • What are some advantages and disadvantages of using OpenAI APIs in production?

    -Advantages are high output quality and model selection flexibility. Disadvantages are low rate limits, instability, frequent changes, and outages.

  • How often does the author's team evaluate new language models for production use?

    -The author's team re-evaluates the landscape of new language models, their quality, APIs, costs etc. every quarter to determine the optimal model for production.

  • What were some complications faced in deploying summarization models at scale?

    -Issues faced were OpenAI API slowness and errors, hitting rate limits quickly, instability requiring model switching, and prioritizing requests across multiple concurrent services and new customers.

  • What other NLP services does the author's company offer beyond summarization?

    -Other services offered are topic and entity summarization, real-time conversational summarization, embedding generation for search, sentiment analysis, and more.

  • What is the current challenge the engineering team is working on?

    -They are working on a middleware to optimally manage and prioritize requests across their various NLP services into the OpenAI APIs to maximize throughput.

Outlines

00:00

πŸ˜€ Introducing topic of mastering summarization techniques

The speaker introduces the topic of the talk - mastering summarization techniques using large language models. He discusses the hype around large language models and some of the challenges with using them, such as unpredictable quality of results, constantly changing models, and infrastructure/engineering challenges.

05:01

😊 Improving summarization quality with data processing and prompting

The speaker explains two key techniques that helped improve the quality of GPT-generated summaries: 1) Cleaning and processing the input text to remove noise, filter out certain n-grams, etc. 2) Carefully crafting prompts to provide context, instructions, and examples to GPT on what is needed.

10:01

πŸ˜ƒ Evaluating quality of AI-generated summaries

The speaker discusses different methods to evaluate the quality of GPT-generated summaries, including: 1) Reference-based methods like BLEU and ROUGE that compare to human summaries 2) Pseudo-reference methods that compare to auto-generated reference summaries 3) Annotation methods where humans manually assess and label summary quality.

15:04

πŸ€” Comparing capabilities of different language models

The speaker compares different large language models like GPT, Jurassic, Anthropic Claude etc. in terms of quality, cost, limits etc. For their use cases, OpenAI provided the best balance but they reevaluate models quarterly as the landscape keeps changing.

20:05

😊 Sharing experience deploying summarization in production

The speaker shares challenges faced while deploying GPT summarization in production - dealing with OpenAI outages, rate limits, scaling requests optimally. He discusses the need for a middleware to manage requests across services using OpenAI.

25:06

πŸ™‚ Discussing current challenges and future work

The speaker concludes by listing their other production services using LLMs (real-time streaming, embeddings etc.) and the challenge of making these services aware of each other to optimize OpenAI API usage. He invites interested folks to join them in solving these problems.

30:06

😊 Wrapping up main points covered in the talk

The speaker wraps up by highlighting the key points covered in his talk: techniques for summarization using GPT, evaluating quality of summaries, comparing different language models, and experience deploying summarization in production.

Mindmap

Keywords

πŸ’‘Summarization

The main technique being discussed in the video. Summarization refers to using AI models like GPT to automatically generate summaries of documents and texts. The speaker explores challenges with getting good quality summaries from large language models, evaluating the summaries, and deploying summarization systems into production.

πŸ’‘GPT

GPT stands for Generative Pretrained Transformer. It is a type of large language model developed by OpenAI that can generate human-like text for a variety of applications like summarization, question answering, etc. The speaker uses GPT models like GPT-3 and GPT-3.5 to generate summaries.

πŸ’‘Prompting

Prompting refers to providing instructions and examples to GPT models to help guide their text generation. The speaker emphasizes how important prompting is to get high quality output from GPT when doing summarization or other natural language tasks.

πŸ’‘Evaluation

Assessing the quality of automatically generated summaries using methods like comparison to human references, semantic similarity of key concepts, and manual annotation. This allows for iterating and improving the models.

πŸ’‘Production

The speaker discusses challenges with deploying summarization models into real-world production systems at their company to serve customer needs. This includes handling failures, rate limits, changes in models over time, etc.

πŸ’‘OpenAI

OpenAI is an AI research company that has developed popular large language models like GPT-3. The speaker relies extensively on OpenAI's API for running production summarization workloads because of the high output quality, ease of use, and low cost.

πŸ’‘Alternative models

Besides OpenAI, the speaker analyzes tradeoffs with alternative large language models like open source models, models focused on specific tasks, etc. The choice depends on use case needs like precision vs. cost.

πŸ’‘Multitask orchestration

Challenge of optimally routing production traffic for multiple concurrent NLP services (like summarization, search, etc.) to OpenAI while respecting constraints like rate limits and latency requirements.

πŸ’‘Data privacy

Important consideration raised about how to process customer data securely when using third-party large language models like OpenAI's API.

πŸ’‘Model evolution

The speaker emphasizes the need to continually evaluate new and improved large language models as they are released for potential integration into production NLP applications.

Highlights

There is hype around large language models, but actually using them can be challenging

Key problems when using LLMs are unpredictable quality and ML engineering issues

Cleaning and processing input text before feeding it to the LLM improves results

Prompting is critical for getting good predictions from LLMs

Reference-based, pseudo-reference-based, and annotation-based methods can evaluate LLM summary quality

OpenAI APIs have good quality but can have rate limits, changes, and outages

We regularly evaluate new LLM models for production use cases

Deploying LLMs has many real-world complexities to handle

We use GPT-3.5 for most production summarization

Relying solely on one LLM provider has risks

Other LLM models can have advantages for specific use cases

We tell customers openAI does not train on their data

For free summarization would use open source LLM matched to use case

We process customer data separately from openAI

Do not currently evaluate quality of document embeddings

Transcripts

play00:04

hi

play00:07

everyone so I prepared for you topic of

play00:10

mastering summarization techniques

play00:13

practical exploration uh with large

play00:15

language

play00:16

models this talk will be mainly about

play00:20

summarization but I think it's

play00:21

applicable not only to summarization but

play00:23

to many different LLP task and using

play00:27

summarization I want to basically show

play00:28

you our story

play00:30

how we first deployed our uh model into

play00:34

production using large language

play00:37

models I would like to start this talk

play00:40

with the hype around large language

play00:42

models I suppose that all of you here

play00:45

saw it know about it you all know that

play00:48

everyone is talking about it yesterday

play00:50

no no two days ago open AI had a big

play00:53

keynote presentation during which they

play00:55

presented lot of new stuff that is that

play00:57

is

play00:58

happening and

play01:01

so there is super big hype around it but

play01:03

when when it comes to the actual usage

play01:05

is it actually that easy to use it can

play01:09

we just connect something some NP API to

play01:12

our text and get summaries get topics

play01:15

get I don't know sentiment get whatever

play01:18

we

play01:19

want we've actually started using large

play01:22

Fage models like almost two years ago we

play01:25

started with

play01:26

GPT and we wanted to get summaries

play01:30

this was our first uh one of the first

play01:33

examples that we get as a summary that

play01:34

was a summary of a this is a recipe how

play01:37

to make scrambled X and but we didn't

play01:40

feed this document to the to GPT we we

play01:42

had some feedback document some of our

play01:44

customers had some problem and we wanted

play01:46

to summarize this document but GPT gener

play01:48

generalized the gener generated this

play01:51

this

play01:53

summary there are many other problems

play01:56

with with large Mage model generally

play01:59

here I group them into two categories or

play02:01

two fields from the data science point

play02:03

of view so basically about the quality

play02:05

of the the

play02:06

results still I I think it's better than

play02:09

it was I don't know year or two years

play02:10

ago but still it's unclear how to get

play02:13

the best best summaries how to achieve

play02:15

the best results when you use these

play02:17

models there are many different uh

play02:20

models that you can use it's not that

play02:22

straightforward to know which model is

play02:24

the best for which which use case and so

play02:27

on and so on the

play02:30

type of problems it's related to the ml

play02:32

engineering for that I just wanted to

play02:34

show you what happened like uh two or

play02:37

three hours ago there was a big outage

play02:40

of open AI it was for one and a half

play02:43

hours the open AI apis for both GPT gp4

play02:48

and so on wasn't working at all and this

play02:50

affected all our services we have like

play02:53

five different models in Productions and

play02:55

this is for example what we were getting

play02:57

this this is just like some internal

play02:59

dashboard that we have and I'm showing

play03:01

you uh some examples of uh errors that

play03:05

we were getting and that we were

play03:06

basically not able to generate uh

play03:08

generate these predictions uh into our

play03:11

uh in our

play03:14

production back to the

play03:19

presentation another problem that's

play03:21

that's in this era is that these models

play03:23

are constantly changing and evolving

play03:25

some people are telling you that you

play03:27

have to deploy your own uh open open

play03:29

source model and use it in our own

play03:32

infrastructure even though it's for

play03:33

example not not the case for you because

play03:36

these models can if you use open AI it's

play03:38

cheap so you have to think of it from

play03:41

these point of

play03:44

views so this was about the hype now

play03:48

with the summarization I would like to

play03:49

tell you basically how we how we are

play03:53

using GPT for summarization and for

play03:56

other task in product board I want to

play03:58

show it on example this is a feedback

play04:01

document from open AI Community Forum

play04:04

someone is having problem with openi

play04:08

website and when we just easily ask uh

play04:12

GPT to generate uh summary we just say

play04:16

GPT summarize this this is the summary

play04:18

that we get it's not optimal because

play04:22

it's in first person and so on but when

play04:24

we just like simply work with the prompt

play04:27

we ask in a in a better way we received

play04:30

much better summary it's more concise

play04:32

it's in a third person and so on by the

play04:35

way can you hear me well I'm not I'm not

play04:37

sure if yeah

play04:39

cool so I would like to mention few

play04:42

steps what what helped us when using GPD

play04:45

I will not be covering all of them this

play04:47

would be like a separate topic for like

play04:49

complete complete talk but I just want

play04:51

to mention the two most important things

play04:53

that helped us and that's uh basically

play04:57

processing and cleaning of the Tex that

play04:58

we are feeding into

play05:00

GPT the GPT itself it doesn't require

play05:03

text in a human readable form so what

play05:06

you can do is to you can clean the data

play05:08

process it remove all system text you

play05:10

can also do some engram filtering remove

play05:13

some uh engrams that are either

play05:16

occurring quite often or not that often

play05:18

and so on or you can apply some more

play05:21

advanced methods you can uh for example

play05:24

when when you are facing some

play05:25

multi-document uh summarization when you

play05:28

want to let's say that you want to

play05:29

summarize

play05:30

thousands of documents and generate One

play05:32

summary for all these documents you can

play05:35

for example select the most sent

play05:37

sentences uh buy some approaches and

play05:40

generate summaries only for those those

play05:43

sentences so that was the one thing that

play05:46

helped us but the second thing and the

play05:47

thing that really helped us the most

play05:49

while achieving good summaries but not

play05:50

only summaries with any prediction that

play05:52

we we are generating and it was

play05:54

prompting I think that this is like

play05:56

quite known fact that prompting is

play05:58

really important and I just want to

play06:01

quickly summarize what how we actually

play06:03

prompting

play06:04

GPT it is really important to give it

play06:07

the context so basically tell GPT why

play06:10

you are actually uh asking it to do some

play06:14

things and how who should be the the

play06:16

person who will be reading this output

play06:18

how it should look like and so on so

play06:21

give it really clear definition what you

play06:23

want it to happen what you what you want

play06:25

it to generate and what you don't want

play06:26

it to generate uh you can also use some

play06:29

quotes to separate instructions for

play06:31

example and an input and if you can also

play06:35

try some future prompting to give it

play06:36

some examples for example in this case

play06:38

with summarization to give it some

play06:39

examples of good

play06:41

summaries this is just an example of

play06:43

what we use in production for generating

play06:46

uh for generation of some summaries here

play06:48

uh you see that we are telling it you

play06:50

are an assistant helping product manager

play06:52

summarize feedback from various customer

play06:53

so this is the

play06:55

context people that are usually reading

play06:57

our summaries our product managers in

play06:59

different companies so this should help

play07:02

GPT to know how the output should look

play07:04

like and you can also give specific

play07:06

instruction here this is like quite a

play07:08

short list but you can have like yearly

play07:10

tens T tens of different instructions

play07:14

here in this case we are telling it that

play07:16

the output should shouldn't be in bullet

play07:17

ballet points order list and only

play07:20

maximum two sentences so this is

play07:21

basically the

play07:24

idea

play07:26

so now you kind of know how you can what

play07:29

you can used to generate good summaries

play07:31

let's say that you have those summaries

play07:32

and let's say that you have a system

play07:34

that is generating tens of thousands of

play07:36

summaries that can you somehow assess

play07:39

the quality can you somehow know which

play07:42

summaries are good which summaries you

play07:44

don't want to show to customers for that

play07:46

there are many different methods how you

play07:48

can achieve it here I prepared a brief

play07:50

list of these methods the first group of

play07:52

method is called reference-based

play07:56

evaluation the goal of this method is

play07:58

basically that you you generate the uh

play08:01

summary using the AI method so in this

play08:03

case we use GPD generate summaries then

play08:07

we ask some human to actually read this

play08:09

document and generate the summary and

play08:11

then we basically compare those two

play08:13

summaries uh and based on the similarity

play08:16

we measure how the artificial the

play08:18

summary generated by AI how good it

play08:21

is another set of methods is Pudo

play08:24

reference based evaluation the idea here

play08:26

is similar but in this case you don't

play08:29

have to ask some human to generate the

play08:30

summary what you do is that you generate

play08:35

the reference summary automatically so

play08:37

you can generate for example this

play08:38

reference summary by taking the most

play08:41

important sentences the most sent

play08:43

sentences in the uh in the document and

play08:46

then compare basically those

play08:48

artificially created

play08:50

summaries the advantage of it is that it

play08:53

doesn't require any input from humans

play08:55

but it's not that precise the last the

play08:58

last set of methods is annotation based

play09:00

evaluation the idea is that you B you

play09:04

generate the summaries and then you read

play09:06

those summaries you read the input

play09:08

document and you validate the quality so

play09:10

you basically assess the quality by some

play09:12

labels and in the end you know that this

play09:15

uh you for example have labels that this

play09:16

that the summaries can be good okay or

play09:19

bad and in the end you know that you

play09:21

generate those summaries for one 100

play09:22

examples you read them and validate and

play09:24

evaluate them and you can in the end

play09:26

when you Summit uh to know what is the

play09:29

quality I will just briefly name some

play09:32

examples of these methods so the first

play09:34

two ones from the reference based

play09:36

evaluation there are methods called blue

play09:38

and rou these are quite simple methods

play09:41

they exists like 20 or 30 years I think

play09:44

and here the idea is that you basically

play09:46

compare the number of matching words so

play09:48

you take the words in the reference

play09:51

summary you take the wordss in the

play09:53

summary generated by GPT and you uh take

play09:57

the number of matching words number of

play09:59

same words and you divided it either by

play10:01

number of words in the reference summary

play10:02

or the number of words in the mission

play10:04

generated summary so these are kind of

play10:06

like the recall or Precision of of

play10:10

summary as you can probably imagine

play10:12

these methods are quite simple uh it

play10:15

cannot it can only match if the words

play10:17

are matching and if there is some like

play10:18

semantically semantic similarity it

play10:21

cannot match it so it's it's like the

play10:23

usage is quite limited on the other hand

play10:26

we have some other method method called

play10:28

bcore for example example the idea of

play10:30

this method is that it use embeddings

play10:34

and then it Compares uh the similarity

play10:36

of those embeddings the ex the advantage

play10:39

of these methods is that it's more

play10:41

precise the results are better but it's

play10:43

more timec consuming and uh it might

play10:45

take longer to actually generate those

play10:47

those similarities and to get overall

play10:50

the quality of the

play10:51

summary from the pelo reference based

play10:54

evaluation what we can do is that you

play10:57

you somehow can find uh uh these

play11:00

important sentences in the in the input

play11:02

document and using that you can generate

play11:05

the uh the reference summary so you can

play11:07

use I don't know you can either normally

play11:10

select the important sentences you can

play11:11

use text text R uh basically whatever

play11:14

meod that can find you some important

play11:16

part of the document and then compare

play11:18

this important part of the document with

play11:19

the summary that you generated and uh

play11:22

use it as the as the uh quality of the

play11:25

summary as you can probably imagine this

play11:28

is like not size uh but the advantage is

play11:32

that it doesn't require any human input

play11:34

and you can use this to assess quality

play11:36

of thousands of

play11:38

summaries the yeah I yeah uh so that was

play11:43

about the methods this is again area for

play11:45

like completely whole presentation so I

play11:47

just briefly describe some of the

play11:49

methods uh and this I was mentioning for

play11:51

summaries but I think this is applicable

play11:53

for any any almost any NLP task when you

play11:56

want to validate some some qualities of

play11:58

some some NLP

play12:00

predictions what we use in in in product

play12:03

board we actually use a combination of

play12:05

these of the method of the second and

play12:07

third

play12:08

method we use the third methon the the

play12:10

last method in a case in which we want

play12:13

to get uh quality of when we want to

play12:16

assess quality of a new prompt so let's

play12:17

say that we want to test a new prompt we

play12:19

have a new variant and for that we use

play12:21

the the third variant we generate

play12:23

results based on the new prompt and we

play12:26

read actually those those summaries We

play12:29

compare them with the input document and

play12:31

we label it and then we have some number

play12:33

in the end and we can compare it with

play12:35

the current result that is for example

play12:36

production and replace the new prompt if

play12:38

if these results are

play12:40

better the the second the second method

play12:43

we use is in production when we want to

play12:46

assess the quality of thousands of

play12:47

summaries so we generate uh I don't know

play12:49

thousand thousands 10 tens of thousands

play12:51

summaries daily and we cannot obviously

play12:53

read them but we don't want to send to

play12:55

our customers a recipe how to make

play12:57

scramber eggs so so that that's actually

play12:59

when it comes in we have a we are using

play13:02

this uh this techniques to check

play13:05

automatically if the summary is talking

play13:07

about a similar thing as the input

play13:11

document so that was about generate

play13:14

generating summaries and how to assess

play13:16

the

play13:17

quality now I would like to slightly

play13:19

move out and go to the large language

play13:21

model space and basically try to discuss

play13:26

what what what language model exists I

play13:28

will like not

play13:29

not describe all of them I would maybe

play13:32

say that as you all probably know there

play13:34

are like open source models or some paid

play13:35

ones uh here is some list this is

play13:38

changing basically every day for example

play13:40

this Gro I think it was announced last

play13:42

week by by Elon Musk so this is changing

play13:46

basically every time and when you have

play13:48

some production jobs or production

play13:51

models that are using that they

play13:53

generating LLP predictions you want to

play13:55

have the best prediction so you need to

play13:58

evaluate the quality

play13:59

or the parameters of these models quite

play14:01

often here I just wanted to show an

play14:03

example summaries from these models we

play14:06

generally found this was a year ago so I

play14:08

don't know if it's still the case but

play14:10

the summaries for the open source models

play14:12

were not great for example this is

play14:14

summary from the from the FC model and

play14:17

it's like yeah

play14:20

nonsense and so this this was the list

play14:24

of models what we used we used as it was

play14:26

probably visible from my talk we used

play14:28

open AI for all our uh for all our use

play14:33

cases it has some pros and pros and cons

play14:37

the the advantages of openi I would say

play14:39

are that the the quality of the results

play14:42

is the best and we can also choose which

play14:44

models we want to use so if we have

play14:46

something that we need to be super

play14:48

precise for that we use

play14:50

gp4 uh if we have some uh use cases for

play14:53

which uh we it doesn't have to be super

play14:56

precise we can use uh GPD five and so on

play15:00

the other Advantage Advantage is the

play15:03

that we don't need to serve it on our

play15:06

own maintain it we are actually like

play15:08

small team we are only three ml

play15:10

engineers and we don't have power to

play15:12

like deploy model in our architecture

play15:14

take care of it uh especially as usage

play15:18

of openi is quite cheap and relatively

play15:23

stable the advant disadvantages of it

play15:26

are for example rate limits uh if you

play15:29

want to

play15:30

use uh open AI to generate like quite a

play15:34

lot of predictions it's not that not

play15:36

that easy

play15:39

always the I think the base the the base

play15:42

limits for open eye are quite low so if

play15:45

if you have for example three use cases

play15:47

in in in production that are using open

play15:49

AI uh you might strle with eight limits

play15:52

and that your uh your pipelines will be

play15:55

in a que and waiting for uh space to

play15:58

actually gener those predictions the

play16:00

other problems are that it's changing

play16:03

basically every day that there there are

play16:05

outages as we saw that was like one a

play16:07

few hours ago and uh that's kind of

play16:12

that's these are kind of the the cons of

play16:15

of open AI so this basically means that

play16:18

we are regularly every quarter we sit

play16:21

together we look at the new models that

play16:23

were launched we see how we can actually

play16:25

use them if they have some API what are

play16:27

the rate limits

play16:29

what is the performance of the models

play16:30

what what what is the quality of let's

play16:32

say summaries and other other

play16:36

predictions and we want to do this to

play16:39

achieve to basically have the best

play16:40

possible modeling production and

play16:42

currently we are using open AI I would

play16:44

say that next at least few quarters we

play16:47

will use it but who knows maybe in a

play16:49

year we will we can migrate to some new

play16:55

model in the end I would like to share

play16:57

with you a story how we deploy

play16:58

summarization in Wild optimally it it

play17:02

sounds like an easy thing right that

play17:04

let's say that there is a there is a

play17:05

customer uh this customer wants you to

play17:08

generate thousands of summary they have

play17:10

thousand of document and they want it

play17:13

they want those documents to be

play17:14

summarized so you feed those summaries

play17:16

into your M pipeline they got processed

play17:18

the prom is created you feed it into

play17:20

open Ai and then you send it to

play17:22

production yeah this is ideal situation

play17:25

in a real scenario it doesn't work like

play17:26

that the speed of open I is quite slow

play17:29

so you cannot generate uh you would like

play17:33

to generate them in parallel okay this

play17:35

doesn't work also because what happens

play17:38

quite often actually with open a is

play17:40

there are some errors um there is some

play17:43

like minor outage so some summaries are

play17:46

not processed so you need to repeat

play17:48

Sometimes some of the modules is not

play17:49

working properly so you need to switch

play17:53

to a different uh GPT model uh there are

play17:56

those straight limits as I was saying

play17:58

saying so you cannot feed all these

play18:01

thousand request in in parallel to GPT

play18:03

because it would crash so you have to

play18:06

basically feed them there as soon as

play18:09

those rate limits are down and there is

play18:10

some space in the in the rate limits and

play18:14

uh this is even more complicated when

play18:16

there is a new customer and then he

play18:17

comes at the same time and he wants more

play18:19

and more predictions to be

play18:22

generated so this was about this was

play18:25

about summarization I would like to just

play18:27

briefly describe our current problems

play18:29

what we have summarization is only one

play18:31

of the service services that we have uh

play18:34

we are summarizing uh single documents

play18:37

we are also summarizing topics some

play18:39

different uh entities we also have a

play18:42

realtime streaming service this

play18:44

streaming service it works kind of like

play18:46

C GPT when we are communicating with it

play18:49

uh you can use it to generate let's say

play18:51

that you have some long document that

play18:52

you want to generate pain points from

play18:53

this document uh it's then streaming

play18:56

these results into into product board we

play18:58

also have embedding surveys that is

play19:00

generating embeddings and we are using

play19:02

those embeddings for semantic search for

play19:04

topics and others and this this is nice

play19:09

but then let let's say how you

play19:12

tell some system to be aware of other

play19:15

initiative how you how you tell

play19:18

summarization surveys that it should

play19:19

wait for some time because there is

play19:21

something more important in in the

play19:23

streaming service and some user is

play19:25

waiting uh for the prediction to be

play19:27

there while for example summaries don't

play19:29

have to be there in one second because

play19:31

uh the user don't mind it much so this

play19:33

is for example our current um current

play19:36

challenge that we are uh working on to

play19:39

basically prepare some some some this

play19:41

box that will be able to some middleware

play19:44

that will get the request from all the

play19:47

services and feed them optimally to to

play19:51

openi and then into product

play19:54

board so that that was it for me I here

play19:57

I have a summary generated by GPD I just

play19:59

pasted the do into into GPD so I've

play20:03

discussed summarization techniques using

play20:05

the GPD

play20:06

model how to actually evaluate the

play20:09

quality of summaries uh what are other

play20:12

other other models that we can use and

play20:14

what are their disadvantages and

play20:16

advantages uh and how we actually

play20:18

deployed it into into uh into

play20:21

production that was it for me thank you

play20:23

for your attention and I just want to

play20:25

tell you that if for example the uh the

play20:28

problem that I was describing like two

play20:30

minutes ago is something that you would

play20:31

like help us to solve uh we are hiring

play20:34

so let us know and we are looking

play20:36

forward to thear you thank

play20:45

you I think now is the now is the time

play20:47

for

play20:49

questions

play20:52

yeah do you have your own Workforce to

play20:54

evalate summaries when doing the

play20:56

supervised Evolution or do you Outsource

play20:57

if few Outsource what do use and how

play21:00

satisfied are you with the provider yeah

play21:02

unfortunately we used our own Workforce

play21:05

uh we ask help our our product managers

play21:07

our support team

play21:09

and

play21:11

we it's it's not that we would be uh

play21:14

validating like this tens of thousands

play21:16

of summaries so we are able to manage it

play21:19

on on our

play21:23

own I I will go to the next one

play21:28

which GPT model version are you using

play21:31

did you do some cost benefit analysis uh

play21:35

we are using the GPT 3.5 there are

play21:39

multiple variants of the when it was

play21:41

launched and I'm not sure which one

play21:44

which one it is exactly but for majority

play21:47

of our initiative initiatives we use

play21:49

this one uh we have uh semantic search

play21:52

in production that is basically able

play21:55

to able to you WR write some some some

play21:59

input it's able to find you a feedback

play22:01

that is similar to the input that you

play22:03

write and for that we actually use gp4

play22:07

because because it help us to really get

play22:08

us the best like we use uh embeddings

play22:11

embeddings model to generate embeddings

play22:12

but when we have those embeddings to

play22:14

search in in those embeddings we use GPD

play22:16

4 because it uh it help us to get the

play22:20

the best

play22:23

results I hope it's it's it's responding

play22:26

the question but if you have additional

play22:28

question to

play22:39

that yeah I will just repeat a question

play22:42

if we try to compare basically GPD 4

play22:44

with uh the

play22:45

gp35 uh we we were comparing it uh the

play22:49

performance was the performance was

play22:52

better but for example in the case for

play22:53

summar the price is I'm not sure now

play22:56

because I think the price changed uh on

play22:58

on Tuesday but it was like 20 times more

play23:01

expensive to use GPD 4 than GPD 3.5

play23:04

something like that and so like it

play23:06

wasn't that the predictions would be

play23:07

like 20 times better or something like

play23:09

that so we just we were we were

play23:11

completely fine with uh with the the

play23:14

current

play23:19

version do you think in a day to-day

play23:22

work people would also benefit from

play23:24

enhanced promting I mean I think context

play23:26

concrete request uh I'm not sure if I'm

play23:29

the right one to to respond to this

play23:31

question but I think that obviously if

play23:34

we would be yeah if we would say why we

play23:37

want some things to happen then probably

play23:42

yes uh follow a question would people

play23:45

benefit from introducing a quality

play23:47

metric like a prompt a prompt to your

play23:50

colleague does not satisfy enhance

play23:51

prompt C rephrasing like grammarly for

play23:54

quality or it would be too much or well

play24:08

um

play24:16

yes for Which languages are we providing

play24:18

the solution so basically majority of

play24:22

customers that use product board uh they

play24:24

use it in English but we have customers

play24:26

that use it only in French Russian maybe

play24:29

in Czech or I think and for the this

play24:32

solution that we have it works in all

play24:34

languages majority of these of of these

play24:37

uh use cases work in all languages

play24:40

however the prediction that we generate

play24:41

is always in English so let's say that

play24:44

you have some text that you want to be

play24:45

summarized let's say it's in French we

play24:47

generate a summary in English so

play24:50

majority of it it's in English we have

play24:52

some initiatives in which we don't

play24:54

support other languages for example it

play24:57

was

play24:58

if I'm not sure it

play25:00

was in sentiment analysis for example

play25:03

for which we use some prain model that

play25:05

is in

play25:10

English open ey models are dominant in

play25:13

the field due to their brand recognition

play25:14

and implementation in Asia is there any

play25:17

risk of relying on a single model

play25:19

provider yes do different models from

play25:21

different companies have advantages for

play25:23

certain use cases like to the to the

play25:25

first question definitely there is it

play25:27

was with for example in this outage that

play25:29

was happening uh few days a few hours

play25:32

ago like we had no backup and like if

play25:35

this outage would take two two days yeah

play25:38

we are kind of in in problems because we

play25:41

don't have a backup so it does it is a

play25:44

it is a risk and we are like we have in

play25:47

mind a way how to solve it but it wasn't

play25:50

priority for us and we think that this

play25:54

uh we are like in know in for open I I

play25:58

would say we are quite a small customer

play25:59

we are not like I don't know like notion

play26:03

for example that might be using quite a

play26:04

lot um so I think that it would be quite

play26:07

a big problem for open AI so I don't

play26:10

think that something is like has like

play26:11

big probability to happen but definitely

play26:13

we have it in mind and uh we are uh

play26:16

trying to think about it about the

play26:18

second

play26:20

question uh I would say yes it can be

play26:23

not only about models from different

play26:25

companies I would say also about these

play26:27

open Source models when you have some

play26:29

model that is trained for example to

play26:31

summarize uh conversations it can

play26:34

perform in a similar way as open AI for

play26:37

summarization of conversations like you

play26:39

cannot use this model to generate

play26:41

summaries for uh I don't know some other

play26:43

things or to any other uh any any other

play26:47

domain but if you use it for uh for

play26:51

conversations it the results are good so

play26:54

definitely uh some other models uh have

play26:57

advantages for some specific use

play27:05

cases have you have you tried to improve

play27:07

the prompt using meta prompting for

play27:09

instance you have a summary score which

play27:11

you are trying to optimize and you can

play27:13

include it to get prom and interactive

play27:15

prom the module conver to converge to

play27:17

better prompt uh no it's it's a good

play27:21

questions we good question we haven't

play27:23

died it uh but but yeah it's a it's a

play27:26

good question

play27:32

I don't know how much time do we

play27:36

have Co how do you manage to process

play27:39

data from your customers in a private

play27:41

and safe way considering third party

play27:43

large language models

play27:45

so that was one of the main one of the

play27:48

main problems that our customers had

play27:50

with large language models uh

play27:53

they they were not happy as uh for

play27:57

example openi had some data leaks and so

play27:59

on what will happen to their data if

play28:01

openi is not training uh based on their

play28:05

data So currently we are telling those

play28:07

customers what open ey basically is

play28:09

telling publicly that they are not

play28:10

training on their model using the data

play28:13

that we use through the

play28:14

API uh and uh so we are telling this to

play28:17

the

play28:19

customers and uh yeah I hope that it's

play28:24

basically respon to the

play28:26

question uh when we have some pipelines

play28:28

that for for example we have thousands

play28:31

of customers we are always uh processing

play28:35

data from our customers in a different

play28:37

environment so we don't Mi data of our

play28:44

customers how do you evaluate the

play28:46

quality of document

play28:52

embeddings good question we are not we

play28:56

are not doing

play29:00

it and yeah we are not doing it I would

play29:04

say the approaches could

play29:07

be some way what I was mentioning would

play29:09

be applicable but yeah we are not

play29:12

evaluating it and uh we don't have yeah

play29:16

we are not we are not doing

play29:18

it maybe time for last question cool

play29:24

cool if paying for service not an option

play29:27

what free llm would you use for TCH

play29:30

summarization I would say it

play29:33

depends

play29:35

if it depends on the specific use case

play29:37

if I would if I if I would need to

play29:39

summarize books or if I would need to

play29:41

summarize conversations I would use I

play29:44

don't now recall the name I think I have

play29:46

some of the on on my slides but I would

play29:49

use uh

play29:51

some uh some pretrained open source

play29:53

model if I would

play29:58

want to have this model to be General as

play29:59

open AI I'm not sure and I don't have a

play30:03

recommendation in in this

play30:06

field

play30:08

cool that's it for me thank

play30:13

you