Goldfish Bowl RAG Intro

Royal English
6 Sept 202428:57

Summary

TLDRВ этом видео представлен обзор проекта RAG (Retrieval Augmented Generation), который использует модель для обучения с помощью дополнительных контекстов в виде ссылочных текстов. Цель - научить модель эффективно комбинировать информацию из разных источников для ответа на вопросы. Проект включает в себя создание уникальных запросов, основанных на ссылочных текстах, и оценку двух генерируемых ответов по шкале от 1 до 3 по пяти критериям. Основные требования к запросам - использование ссылочных текстов и избегание общих тем. Подробно рассматриваются примеры хороших и плохих запросов, а также методика оценки ответов.

Takeaways

  • 🐟 RAG (Retrieval Augmented Generation) - это метод, при котором модель обучается использовать дополнительный контекст из ссылочных текстов для ответа на запросы.
  • 🔍 В рамках проекта Goldfish Bowl, участникам предлагается создавать запросы, используя ссылочные тексты, чтобы обучить модель эффективно комбинировать различную информацию.
  • 📝 Создание запросов должно быть основано на ссылочных текстах, и они должны быть достаточно конкретными, чтобы избежать общих или неправильных запросов.
  • 🚫 Запрещено использовать чат-ботов или другие LLM (Large Language Models) для создания запросов, что может привести к блокировке участника.
  • ✅ Хорошие примеры запросов должны быть основаны на ссылочных текстах и включать в себя основное требование и дополнительное ограничение.
  • 📑 Для каждого запроса необходимо добавить от двух до десяти ссылочных текстов, каждый из которых должен содержать не менее 150 слов.
  • ✅ Ответы модели оцениваются по шкале от 1 до 3 в пяти категориях: основанность на ссылочных текстах, достоверность, полезность, следование инструкциям и стиль написания.
  • 🔍 Оценка основанности на ссылочных текстах и достоверности является наиболее важным критерием при выборе предпочитаемого ответа.
  • 📝 Для многоступенчатых задач предпочтение отдается ответу, который лучше всего удовлетворяет запросы и использует ссылочные тексты для продолжения диалога.
  • 👍 Важно тщательно проверять и оценивать каждый ответ, чтобы обучить модель правильно использовать ссылочные тексты и давать точные ответы.

Q & A

  • Что означает аббревиатура RAG?

    -RAG означает 'retrieval augmented generation', что подразумевает использование модели в определенном контексте для обучения ее предоставлять ответы на основе дополнительных контекстов, предоставленных через ссылки на тексты.

  • Каковы ключевые требования к написанию запроса (prompt) для проекта RAG?

    -Запрос должен быть основан на ссылочных текстах, иметь как минимум 10 слов, не содержать вежливых фраз, быть достаточно сложным с основным запросом и дополнительными ограничениями, избегать словосложием и не использовать ограничения по количеству слов или предложений.

  • Что такое ссылочный текст и какова его роль в проекте RAG?

    -Ссылочный текст - это дополнительная информация, предоставленная для модели, чтобы помочь ей ответить на запрос. Модель использует этот текст специфического контекста для генерации ответа.

  • Какие критерии используются для оценки ответов модели в рамках проекта RAG?

    -Ответы оцениваются по пяти критериям: основанность на ссылочном тексте, достоверность, полезность, следование инструкциям и стиль написания.

  • Как определяется 'основанность на ссылочном тексте' в контексте оценки?

    -Ответ считается основанным на ссылочном тексте, если все утверждения напрямую основаны на информации из ссылочных текстов и не содержат информации из других источников.

  • Что подразумевается под категорией 'достоверность' при оценке ответов?

    -Категория 'достоверность' оценивается на основе того, являются ли утверждения в ответе правильными или нет, основаны ли они на достоверных и проверенных фактах.

  • Какова цель использования системы оценки по шкале от 1 до 3 для каждого из пяти критериев?

    -Целью использования шкалы оценки является детальное измерение качества ответа модели по каждому из критериев, что позволяет сделать более обоснованное и точечное сравнение между двумя ответами.

  • Почему важно использовать Liker score при выборе предпочитаемого ответа?

    -Liker score используется для указания предпочтения одного из двух ответов и дает возможность оценить, насколько один ответ лучше другого по шкале от 'много лучше' до 'равноценно'.

  • Что означает 'следование инструкциям' в оценке и как это влияет на результат?

    -Следование инструкциям оценивается на основе того, как хорошо модель понимает и выполняет требования запроса пользователя, включая ограничения и основные запросы.

  • Какой является последовательность действий для создания и оценки запроса в рамках проекта RAG?

    -Последовательность действий включает в себя написание запроса, основанного на ссылочных текстах, предоставление двух ответов модели, оценку каждого ответа по пяти критериям, выбор предпочитаемого ответа с использованием Liker score и обоснование выбора.

Outlines

00:00

🐟 Введение в проект RAG (Retrieval Augmented Generation)

В этом видео представлен обзор проекта RAG (Retrieval Augmented Generation), где RAG является аббревиатурой для 'retrieval augmented generation'. Проект направлен на обучение модели использовать дополнительный контекст в виде ссылочных текстов для ответа на вопросы. Авторы проекта планируют использовать модель для создания отчетов и ответов на вопросы, используя только предоставленные данные. В видео также рассматриваются общие ошибки и важные моменты, которые следует учитывать при работе с моделью.

05:01

📝 Создание подсказок для обучения модели RAG

Второе видео скрипта фокусируется на создании подсказок, которые должны быть основаны на ссылочных текстах. Подсказки должны быть специфичными, не содержать вежливостей и быть сложными с основным запросом и дополнительными ограничениями. Авторы предостерегают от использования чат-GPT или других LLM для создания подсказок, что может привести к блокировке участника в проекте. Также приводятся примеры хороших и плохих подсказок и объясняется, как их можно улучшить.

10:02

📚 Использование ссылочных текстов и оценка ответов модели

Третья часть видео скрипта обсуждает, как использовать ссылочные тексты для ответа на подсказки и как оценивать два генерированных ответа модели. Оценка проводится по шкале от 1 до 3 по пяти критериям: основанность на ссылочных текстах, достоверность, помощь, следование инструкциям и стиль письма. Авторы подробно объясняют, как проверять каждый утверждение в ответе на соответствие ссылочному тексту и достоверность информации.

15:04

🔍 Проверка точности и полноты ответов модели

В четвертой части видео скрипта рассматривается, как проверять точность и полноту ответов модели. Авторы подчеркивают важность проверки каждого утверждения на соответствие ссылочному тексту и его достоверность с помощью Google или других источников. Также рассматриваются примеры ответов и их оценка на основе указанных критериев.

20:05

📊 Оценка и предпочтения между ответами модели

Пятое видео скрипта фокусируется на оценке и выборе предпочтительных ответов между двумя генерированными ответами модели. Авторы вводят шкалу от 1 до 5 для выражения предпочтений и подчеркивают, что оценка должна основываться на ключевых критериях, таких как основанность на ссылочных текстах и достоверность. Также объясняется, как оправдывать предпочтения с помощью примеров и деталей из ответов.

25:05

🗣️ Многоступенчатые диалоги и использование предпочтительных ответов

В заключительной части видео скрипта рассматривается процесс создания многоступенчатых диалогов с использованием предпочтительных ответов. Авторы дают советы по созданию естественных и релевантных подсказок, основанных на ссылочных текстах, и подчеркивают важность корректного использования ссылочных текстов для успешного обучения модели RAG.

Mindmap

Keywords

💡RAG

RAG (Retrieval Augmented Generation) - это метод обучения модели, использующий дополнительный контекст в виде ссылочных текстов для улучшения ответов модели на вопросы. В контексте видео это означает, что модель будет использовать предоставленные ссылочные тексты для генерации более точных и информативных ответов на запросы пользователя. Примером из скрипта может служить использование внутренних данных компании для ответа на вопросы о продажах или запасах.

💡Reference Text

Ссылочный текст - это дополнительная информация, предоставляемая модели для того, чтобы она могла использовать ее для генерации ответов. В видео подчеркивается важность использования ссылочных текстов для обучения модели RAG, поскольку они помогают модели лучше понимать, какую информацию использовать для ответа на конкретный запрос.

💡Prompt

Prompt - это запрос или вопрос, который пользователь задает модели. В видео упоминается, что prompt должен быть основан на ссылочных текстах и использовать их для генерации ответа. Это ключевой элемент в проекте RAG, так как он определяет, как модель будет реагировать на пользовательский ввод.

💡Turn

Turn - это пара 'запрос-ответ' в диалоге между пользователем и моделью. В видео упоминается, что задачи будут иметь указанное минимальное и максимальное количество ходов, что важно для структуры диалога и обучения модели.

💡Liker Score

Liker Score - это система оценки, используемая для выражения предпочтений между двумя ответами модели. В видео объясняется, что выбранный ответ будет иметь более высокий Liker Score, и это поможет в дальнейшем обучении модели. Это ключевой элемент в процессе выбора и оценки ответов.

💡Rubric

Рубрика - это система критериев оценки, используемых для определения качества ответа модели. В видео рассматриваются различные аспекты, такие как основание на ссылочных текстах, достоверность, полезность, следование инструкциям и стиль написания, которые оцениваются для каждого ответа.

💡Truthfulness

Достоверность - это одно из критериев в рубрике, который оценивает, насколько ответ модели соответствует фактам и информации из ссылочных текстов. В видео подчеркивается важность проверки каждой утверждения модели на достоверность для обеспечения качества обучения.

💡Helpfulness

Полезность - это критерий оценки, который определяет, насколько ответ модели удовлетворяет запрос пользователя. В видео говорится, что ответ должен быть не только точным, но и полностью удовлетворительным для пользователя, включая все релевантную информацию и следуя инструкциям запроса.

💡Instruction Following

Следование инструкциям - это критерий оценки, который проверяет, как хорошо модель понимает и выполняет запросы пользователя. В видео упоминается, что ответы должны соответствовать всем аспектам запроса, включая любые ограничения или специфические требования.

💡Writing Style and Tone

Стиль написания и тон - это критерий оценки, который относится к качестве выражения и форматированию ответа модели. В видео подчеркивается, что ответы должны быть не только информативными, но и хорошо структурированными и привлекательными для пользователя.

Highlights

RAG (Retrieval Augmented Generation) is an AI technique that uses reference texts to provide context for generating responses.

The project involves training a model to answer prompts using specific reference texts, ensuring the responses are grounded in the provided information.

Reference texts are crucial as they guide the model to generate responses that are relevant and accurate to the given prompts.

Prompts must be based on the reference text and should not be general or contrived; they should emulate real user queries.

Avoiding sensitive topics and ensuring prompts are specific and complex is key to effective RAG training.

The project requires participants to write prompts that are at least 10 words, with a minimum and maximum number of turns specified.

Prompts should be free from pleasantries and word/sentence count constraints to mimic natural user interactions.

Examples of bad prompts include those that are too broad, contain spelling mistakes, or lack specificity.

Good prompts are specific, grounded in reference texts, and avoid summary requests or word count constraints.

The model generates two responses for each prompt, which are then reviewed and rated based on five criteria.

Criteria for rating responses include reference text grounding, truthfulness, helpfulness, instruction following, and writing style and tone.

Reference text grounding is critical; all claims in the response must be directly grounded in the reference text provided.

Truthfulness ensures that the claims made in the response are accurate and verifiable, avoiding unfounded statements.

Helpfulness evaluates how well the model answers the user's request, with a focus on providing relevant and sufficient information.

Instruction following assesses the model's ability to understand and adhere to the constraints and requirements of the user's prompt.

Writing style and tone are the least important criteria but still contribute to the overall quality of the response.

A five-point Likert scale is used to indicate preference between the two model responses, with justifications required for the selection.

Justifications must align with the Likert score, providing specific examples and details to explain the rating choices.

For multi-turn tasks, the preferred response from the previous turn is used as context for the next turn, maintaining a natural conversation flow.

The importance of utilizing reference texts in prompts cannot be overstated, as it is fundamental to the success of the RAG project.

Transcripts

play00:05

George and I am a member of the Goldfish

play00:07

Bowl rag team thank you for joining the

play00:11

project and taking the time to watch

play00:13

this introductory video in this video

play00:16

I'm going to be giving a bit of an

play00:17

overview as to what we are doing in this

play00:20

project and why specifically how we are

play00:22

going to do it and then I will also go

play00:25

over some common errors we've been

play00:27

seeing as well as just some important

play00:29

things to keep in mind with some tips so

play00:33

first and foremost as the is in the name

play00:36

this is a rag project rag is an acronym

play00:40

for retrieval augmented generation so

play00:43

what that means is we are going to be

play00:46

using the model in a very specific way

play00:48

in order to train it to do rag that

play00:51

means we are giving the model additional

play00:54

context via reference texts to help it

play00:57

answer our prompt and so imagine you

play01:00

have a company and your company has a

play01:02

large language model it would be great

play01:05

if you could give that large language

play01:06

model all of your internal company data

play01:09

like sales projections supply

play01:12

chain data uh inventory calculations and

play01:16

then use that large language model to

play01:19

answer questions like create a business

play01:21

report and it would only use the

play01:23

information in that material you gave it

play01:26

so that's essentially the context of rag

play01:30

we are going to be prompting the model

play01:32

giving it what we call reference text or

play01:35

additional

play01:36

information and then the model will use

play01:38

that information specifically to answer

play01:41

our question so we are teaching the

play01:43

model how to do that what information to

play01:46

use and how to combine information

play01:50

effectively across different reference

play01:52

texts so that is a brief overview as to

play01:55

rag it is a super exciting space right

play01:58

now which makes this project all the

play02:00

more exciting as

play02:01

well so in this project specifically we

play02:05

are going to be giving the model a

play02:07

prompt with a set of reference text the

play02:10

model will generate two responses we

play02:13

will be reviewing each response uh and

play02:16

rating it according to five different

play02:18

criteria then we will decide which of

play02:21

these models we prefer and we will

play02:23

indicate that preference with what is

play02:25

called a Liker to score we are going to

play02:28

justify our preference and explain why

play02:30

we are rating and then we will

play02:33

potentially continue the conversation in

play02:35

another

play02:36

turn so to go into more specifics step

play02:39

one of this project is to write a prompt

play02:42

that utilizes reference text some key

play02:44

requirements The Prompt needs to be at

play02:46

least 10 words we give a list of three

play02:49

suggested topics for inspiration

play02:51

although they are not

play02:55

mandatory the tasks will also be given a

play02:58

specified minimum and maximum number of

play03:00

turns a turn is a prompt response pair

play03:03

so if I ask a model tell me about Sweden

play03:08

and then it tells me about Sweden and

play03:10

then I ask give me an itinerary for

play03:13

Sweden and then it responds again that'

play03:15

be two turns in that conversation

play03:17

there's two prompt response

play03:19

pairs what's really important in this

play03:23

project is that the prompts utilize the

play03:26

reference

play03:27

text I will reiterate the prompts need

play03:31

to be based on the reference

play03:34

text so what that means is that prompts

play03:37

cannot be general for example to say

play03:41

summarize these reference texts is a

play03:43

terrible prompt the prompts must be

play03:46

based on the reference text and use that

play03:48

information in a very specific way I

play03:52

will go over that in more detail but

play03:54

this is Paramount to the project the

play03:57

prompts need to be based on the

play03:59

reference text in addition we have other

play04:03

guidelines for the prompts they should

play04:05

really emulate how a user would use the

play04:07

model and not be contrived they should

play04:10

not have any pleasantries no hey yo

play04:13

thank you I appreciate it nothing like

play04:16

that they should be sufficiently complex

play04:19

with a main request and additional

play04:22

constraint so this example here

play04:25

highlights a few important things but

play04:28

the prompts tell me about George

play04:29

Washington in exactly 10 sentences is

play04:32

terrible it is bad for a number of

play04:35

reasons this is contrived a real user

play04:38

would not care very much about a

play04:40

specific sentence count and on that note

play04:43

we want to avoid any constraints that

play04:45

have word counts or sentence counts this

play04:48

is also a summary and it's a very basic

play04:51

request a good example would be the

play04:54

following explain the new regulatory

play04:56

changes for cryptocurrency trading that

play04:58

were implemented in the EU in the past

play05:01

three months and how it might affect my

play05:03

investments in Bitcoin and

play05:05

ethereum this prompt is incredibly

play05:09

specific imagine ref texts that explain

play05:11

the regulatory changes maybe an analysis

play05:14

of it as well as just the

play05:16

facts the current state and maybe future

play05:19

predictions of Bitcoin how these relate

play05:22

to it this is a great way to teach the

play05:24

model how to do

play05:26

rag when we are doing rag we want to

play05:29

make sure to avoid sensitive topics so

play05:32

nothing inappropriate or hot button or

play05:35

contentious finally you cannot use chat

play05:38

GPT or other llms to create prompts or

play05:41

if we catch you doing it you will be

play05:42

banned for the

play05:44

project now I will go over a few more

play05:48

examples here are some bad examples

play05:50

first hey I'm doing a school project on

play05:53

DH Lawrence who was he first and

play05:56

foremost it starts with a

play05:58

pleasantry second of all it is a simple

play06:00

summerization

play06:02

request this example here is also

play06:05

incredibly contrived and unnatural with

play06:07

all of the spelling mistakes overly

play06:09

informal tone and lack of specificity of

play06:13

the

play06:14

request now we will go over two examples

play06:17

of making prompts

play06:19

better here you can see a prompt yo can

play06:22

you give me a list of athletes who left

play06:24

has a great impact of

play06:27

sports we can remove the pleasant tree

play06:29

become more specific and become more

play06:31

grounded in the reference

play06:34

text so a similar but significantly

play06:37

better example would be who are some of

play06:40

the most impactful athletes in Olympic

play06:42

history what are their greatest

play06:43

achievements and most memorable

play06:46

moments another similar prompt compare

play06:50

the impacts accomplishments and Legacies

play06:52

of top Olympic athletes you s bolt

play06:54

Michael Phelps and Simone biles you

play06:56

could imagine reference text here maybe

play06:58

perhaps one on bolt one on Michael

play07:01

Phelps one on Simone biles one on a

play07:05

analysis of top moments in the Olympics

play07:07

one specifically on the 2012 Olympics so

play07:11

on and so forth but these are specific

play07:14

to the ref text and you can imagine a

play07:16

real user writing

play07:18

this moving on we have another bad

play07:21

example you are a 9-year-old who's picky

play07:25

who's a picky eater that's going to

play07:27

Japan for 2.5 weeks what's your dream

play07:29

itinerary look like incredibly

play07:32

broad incredibly

play07:35

contrived so we can make this better by

play07:37

adulating a real user becoming more

play07:39

specific and becoming more grounded in

play07:41

the reference text a better example

play07:44

would be I'm going to Japan with my

play07:46

nine-year-old daughter for two and a

play07:47

half weeks she's a picky eater though

play07:50

what are some good types of foods for

play07:51

her that are still authentic you have a

play07:54

number of constraints your nine-year-old

play07:56

daughter who's a picky eater your time

play07:58

frame good types of foods that are

play08:01

authentic but for picky eaters would be

play08:04

very specific to the reference

play08:06

text an even better example I'm going to

play08:09

K in Osaka a specific region with my

play08:12

nine-year-old daughter hicker for two

play08:14

and a half weeks what are some good

play08:16

restaurants where I can get an authentic

play08:18

local meal but where she can have some

play08:20

options now we are going from types of

play08:23

foods to specific restaurants in a very

play08:25

specific area and the constraints of the

play08:28

user getting authentic local

play08:30

meal but the daughter having options as

play08:33

a picky

play08:34

eater so what do good examples look like

play08:38

good examples have responses that use

play08:40

the reference texts to fully answer the

play08:42

questions of the

play08:44

prompt example one I'm thinking about

play08:48

taking my 13-year-old daughter to see

play08:49

the movie Inside Out too because I think

play08:52

it is about dealing with emotions during

play08:53

puberty I want to have a conversation

play08:56

with her

play08:57

beforehand what should I say

play09:00

you can imagine the reference text about

play09:02

the plot about reviews perhaps even

play09:04

advice on how to speak to your

play09:06

13-year-old

play09:07

daughter and as a result the model will

play09:10

use all of this in order to answer it'll

play09:12

get the context of the movie how it

play09:15

relates to dealing with emotions during

play09:17

puberty and perhaps information about

play09:20

how to approach this conversation based

play09:22

specifically on the reference text

play09:24

another example is the Yankees have not

play09:27

been great for the past few seasons

play09:28

despite having Aon judge and Garrett

play09:30

Cole why haven't they been able to live

play09:33

up to their potential you would imagine

play09:35

reference text here about the Yankees

play09:37

recent performance perhap stats about

play09:39

Aaron judge and Gary Cole analysis of

play09:42

their past

play09:43

Seasons you know you can really have a

play09:45

lot of different analysis different

play09:46

opinions as well as combining that with

play09:49

facts like statistics from those past

play09:52

Seasons again the model will incorporate

play09:54

all of this decide what is relevant

play09:57

extract it combine it use it to fully

play09:59

answer and satisfy this

play10:02

prit a final example based on the recent

play10:05

stock market volatility and the

play10:07

expectation for the FED to cut rates in

play10:09

their next meeting should I be looking

play10:11

to invest in more equities or treasuries

play10:14

you could imagine here the reference

play10:16

text would be about recent Market

play10:17

movements analysis of the stock market

play10:19

and perhaps news or even predictions

play10:22

about the Federal

play10:23

Reserve so reiterating the most

play10:27

important fact and requirement about

play10:30

writing a prompt in this rag project is

play10:33

that it utilizes the reference text in a

play10:35

very specific way here is what it will

play10:37

look like you will have suggested topics

play10:41

you will have the box for your prompt

play10:43

and now is the part about adding

play10:44

reference text I'll note that you can

play10:47

actually find the reference text before

play10:48

writing your prompt sometimes it is

play10:50

helpful to come up with a category or a

play10:52

topic maybe find reference text or find

play10:55

your prompt around

play10:57

them but again the model we use the

play10:59

reference text to answer the prompt and

play11:01

as a rule of thumb if the prompt can be

play11:04

asked on any set of reference texts it

play11:06

is a bad prompt like a summary I will

play11:10

repeat the rule of thumb is that if a

play11:12

prompt can be asked on any set of

play11:14

reference text then it is a bad prompt

play11:18

in order to add your reference text you

play11:19

will click on the purple plus sign in

play11:21

this box here to add between two and 10

play11:25

reference texts and their

play11:27

URLs each reference text needs to be at

play11:30

least 150 words and the total length

play11:32

should be between 500 and 2500

play11:36

words you also May split up and reorder

play11:40

your reference text what that means is

play11:42

that a single URL and serve as multiple

play11:45

reference texts perhaps you upload the

play11:48

first paragraph and the second paragraph

play11:50

as the content with the URL as one

play11:52

reference text and the next one you can

play11:54

upload the last paragraph and the same

play11:58

URL as the next reference text these can

play12:02

be formatted as either markdown or raw

play12:07

formatting so now that we've The Prompt

play12:09

that utilizes the reference text and the

play12:11

reference text the model will give us

play12:13

two

play12:14

responses we will be rating those

play12:16

responses on a scale of one to three

play12:18

where one is major issues two is minor

play12:21

issues and three is no issues on the

play12:24

following five criteria ref text

play12:26

grounding truthfulness help

play12:29

instruction following finally writing

play12:32

style and

play12:34

tone we will go over the rubric now with

play12:36

some

play12:38

examples so what we mean by reex

play12:40

grounding is the question are each of

play12:42

the claims and the response grounded in

play12:44

the ref text reorienting oursel to the

play12:48

goal of this project is teaching the

play12:49

model rag to use the reference text to

play12:52

answer our question that means the model

play12:54

does a good job if it is using that

play12:57

information and only that information

play12:59

that means the model is doing a bad job

play13:02

if it is using information that is

play13:04

completely unrelated and outside of the

play13:06

reference text so major issues would be

play13:10

if core response statements are not

play13:12

directly round grounded in the reference

play13:14

text minor issues would be some

play13:17

statements are not directly grounded or

play13:19

inferred from the reference text or

play13:21

supplementary to the

play13:23

response no issues would be that all the

play13:25

response statements are grounded in the

play13:27

reference text so let's look at an

play13:29

example The Prompt here is I'm looking

play13:32

for the best noise cancelling headphones

play13:33

with the best sound quality what do you

play13:35

recommend the reference text is this

play13:38

following segment from an article about

play13:40

noise cancelling

play13:42

headphones imagine if this were the

play13:44

model response it says the Sony wh1000

play13:48

xm4 Sony

play13:50

w1004 seems to be the best all-around

play13:52

Choice balancing performance and value

play13:54

however if budget isn't concern you want

play13:55

the absolute best in noise cancellation

play13:57

and sound quality the Bose quiet Comfort

play14:00

Ultra headphones might be an ideal chair

play14:02

Suite so how do I evaluate this I need

play14:06

to go Claim by claim in this response

play14:09

and make sure that it comes from the

play14:11

reference text so the claim that the

play14:14

Sony headphones are the best allaround

play14:17

choice is in the reference text the

play14:21

statement that the Sony headphones

play14:23

balance performance and

play14:26

value comes from the reference text on

play14:28

bewall and quality design and value is a

play14:32

quote however if budget isn't concern

play14:34

you want the absolute

play14:36

best the Bose quiet Comfort Ultra might

play14:39

be ideal choice for you we can see in

play14:41

the reference text the Bose quiet

play14:43

Comfort Ultra headphones are the best

play14:45

premium noise cancelling headphones

play14:47

don't mind spending a bit extra to get

play14:49

the best these offer the best do so this

play14:53

would be scored a three on raex

play14:55

grounding there are no issues every

play14:58

statement every claim in this response

play15:00

is directly grounded in the reference

play15:03

text now we can look at response

play15:06

B the Sony wh1000 xm4 improves on last

play15:10

year's model the wh1000

play15:15

xm3 this statement right here is not

play15:18

directly grounded in the reference text

play15:21

we can see in the reference text

play15:24

that these are the newest release since

play15:27

the Sony wh1000

play15:29

xm3 but in nowhere in the reference text

play15:33

does it say it is last year's model that

play15:36

is

play15:39

inferred so this would be rated A2 minor

play15:42

issues on reex grounding as this claim

play15:45

that it came out last year is inferred

play15:48

from the reference

play15:51

text so if you aren't confident that an

play15:53

inference like this could be made air on

play15:55

the side of caution and be strict it is

play15:58

important that if you between two scores

play16:00

on this project choose the lower one be

play16:04

strict now let's imagine this the

play16:06

response says that the Sony headphones

play16:08

are the best all around Choice balancing

play16:09

performance and value with a speakto

play16:11

chat technology that automatically

play16:13

reduces volume during

play16:16

conversations that

play16:18

feature is nowhere to be found in the

play16:21

reference text it is completely outside

play16:23

of scope hence this would score A1 or

play16:26

major issues on refex grounding this is

play16:29

a major fail because the core part of

play16:32

that reasoning and the

play16:34

response fails to use the information

play16:37

from the reference text it is outside of

play16:39

it another

play16:41

example in this example we say however

play16:43

if budget isn't a concern you want the

play16:45

best go with the senheiser momentum for

play16:48

wireless again this is a major issue as

play16:51

these senheiser momentum for wireless

play16:54

appear nowhere in the reference text it

play16:57

is outside of scope there is a major

play16:59

grounding

play17:01

issue so now that we understand refex

play17:04

grounding we can go to the next category

play17:06

which is similar but it is truthfulness

play17:09

which we are asking here are each of the

play17:12

claims in the response correct or

play17:15

incorrect so major issues would be that

play17:18

one or more major claims contain

play17:19

meaningful inaccuracies or unfounded

play17:22

claims making it unhelpful to the

play17:26

user a minor issue would be on major

play17:28

claim are factual and accurate perhaps

play17:30

minor claims contain meaningful

play17:32

inaccuracies or unfounded claims no

play17:35

issues would be all claims are accurate

play17:37

based on reputable web

play17:40

evidence so we'll look at an example

play17:43

here the example prompt is what are the

play17:46

impacts advancements in renewable energy

play17:49

on the

play17:51

economy here would be the reference text

play17:54

response a goes into three categories by

play17:57

job creation infrastructure investment

play17:58

in a reduction in energy cost and we can

play18:01

see the solar technology

play18:03

here appears in the reference text the

play18:06

wind

play18:09

power also appears in the reference text

play18:13

the battery storage also appears in the

play18:16

reference text increase job

play18:18

opportunities in

play18:19

engineering

play18:21

manufacturing installation roles all

play18:23

appear in the reference

play18:25

text the section on infrastructure

play18:27

investment the growth of Ren energy

play18:29

sector has stimulated investments in

play18:32

infrastructure is also directly grounded

play18:35

in the reference

play18:38

text let us find exactly

play18:45

where here stimulating investments in

play18:49

infrastructure finally reduction in

play18:51

energy cost decrease in energy costs

play18:53

over time comes directly from the

play18:55

reference tax so there are no issues

play18:56

here

play19:00

these are all true

play19:03

statements now imagine if there was

play19:05

response

play19:06

B where it says advancements in

play19:09

renewable energy are expected to

play19:11

generate 1, 342,000 49 jobs by

play19:15

2030 This Is A Minor truthfulness error

play19:18

because of unverifiable claims the

play19:21

number of jobs is unfounded and so while

play19:25

it makes the response a little less

play19:27

accurate overall it is is still somewhat

play19:29

helpful to the user so this minor

play19:31

unfounded caim and how you would verify

play19:34

this is to Google it say you're Google

play19:39

no something like uh job projections

play19:43

from renewable ngg in

play19:46

2030 but I'll give you a hint it is not

play19:49

this incredibly specific

play19:51

number so while we are talking about

play19:53

truthfulness this is also not grounded

play19:55

so I would also point out that we did

play19:58

look at this response for groundedness

play20:00

but we are in the category of

play20:02

truthfulness so it is actually important

play20:04

in this section to do two things one

play20:08

check every single claim with the

play20:10

co-pilot and with Google and I'll get

play20:13

into the co-pilot

play20:14

momentarily imagine here was response

play20:17

C imagine the response at high energy

play20:21

photovoltaic cells are built primarily

play20:23

from

play20:24

copper let's Google that

play20:34

what are high energy photovoltaic cells

play20:37

made

play20:39

from Silicon by

play20:43

far or high efficiency philic cells made

play20:47

from copper

play21:01

let's

play21:04

see and here's actually a very

play21:06

interesting case because the AI overview

play21:09

here is

play21:11

incorrect because we can go down to the

play21:15

Department of energy for instance and

play21:17

see that it's made out of

play21:22

so in response D we have the statement

play21:25

that the growth of the renewable energy

play21:27

sector has caused Mass of failures in

play21:29

the financial sectors in recent years

play21:31

this is categorically false a major

play21:34

claim one of three major claims in this

play21:38

response is false so this would score a

play21:40

one on

play21:43

truthfulness in next criteria is

play21:45

helpfulness which is how well does the

play21:47

model answer the user's

play21:49

request and that would be basically like

play21:54

if it leaves out relevant content or

play21:57

there's an excessive amount of relevant

play21:58

content that would be a major issue if

play22:01

it's missing a little bit of relevant

play22:02

content or has a little bit of

play22:03

irrelevant content that would be a minor

play22:06

issue and if it is fully satisfying then

play22:09

it would be no issues this is also

play22:12

related to instruction following which

play22:14

is how well does the model understand

play22:16

the requirements of users's prompt a

play22:18

major issue here would be the response

play22:20

ignoring or violating key parts of the

play22:22

prompt like the constraints making the

play22:24

response

play22:25

useless a minor issue would be that that

play22:28

the response follows most to the

play22:30

instructions but misses certain elements

play22:32

and finally all instructions are

play22:34

followed now we'll look at an

play22:37

example say the prompt is to give me a

play22:40

fstep recipe to bake chocolate cake if

play22:43

the response is an eight-step recipe to

play22:46

bake a chocolate cake that is still very

play22:48

helpful there's no relevant information

play22:50

missing there is no you like excessive

play22:53

amount of irrelevant information in here

play22:55

this is a great eight step recipe to

play22:56

bake a cake

play22:59

however on instruction following it

play23:01

would be scored a two for minor issues

play23:02

because it is eight steps instead of

play23:04

five

play23:04

steps I like your response B we have a

play23:07

five-step recipe to bake

play23:10

croissance this is incredibly unhelpful

play23:12

so it's a one on helpfulness and it does

play23:15

not follow the instructions at all so it

play23:18

is a one as

play23:20

well our final category is writing style

play23:23

and tone which is just how well the

play23:25

response is worded and formatted

play23:30

so to reiterate it is extremely

play23:33

extremely extremely important to verify

play23:36

and double check every claim made in the

play23:38

model response that will be used to

play23:41

measure raex grounding and truthfulness

play23:44

so you go Claim by claim if the claim is

play23:47

in the reference

play23:49

text that determines how we rate it on

play23:51

ref text grounding and we would also use

play23:54

Google or the factuality co-pilot which

play23:57

you can see here

play23:59

to verify it is

play24:01

accurate so once we've rated the five

play24:03

criteria for both responses we will be

play24:06

selecting which one we prefer and we

play24:08

will do that using a one to five lier

play24:10

scale you can see here on one on the

play24:13

ends of the spectrum we have the

play24:15

responses being much better slightly

play24:18

better and then neutral in the middle so

play24:19

you can think about it like a scale or

play24:21

the further away from the middle we go

play24:23

the more of a difference there is so a

play24:26

one would signify that response one is

play24:27

much better

play24:28

a two would signify that response one it

play24:30

is slightly better a three would signify

play24:33

that there is a neutral preference or

play24:35

they're the same a four would signify

play24:38

the response two is slightly better and

play24:40

a five would signify response two is

play24:42

much better it is extremely important to

play24:45

note the

play24:46

following you should only select a three

play24:49

if the ratings are identical for both

play24:51

responses otherwise there is a

play24:54

preference when evaluating your

play24:56

preference please keep in mind that the

play24:58

most most important criteria are Vex

play25:00

surrounding and truthfulness by far

play25:01

given the context of this is a rag

play25:05

project Follow by helpfulness then

play25:07

instruction following and then writing

play25:09

style and tone is the least

play25:11

important after we give our lyer score

play25:15

we will give a

play25:17

justification the justification is

play25:19

incredibly important for viewers the

play25:21

project team and the customer to better

play25:23

understand the preference here and

play25:25

understand your thinking your logic and

play25:27

why you chose the score you did they

play25:30

should use examples and details to

play25:32

highlight key differences between the

play25:33

responses focusing on the most major

play25:37

issues they must must must align with

play25:40

your Liker score so a good practice

play25:42

would be if your Liker score is one you

play25:44

should start your justification with

play25:45

response one is much better because your

play25:48

like score is four you should start it

play25:50

with response two is slightly better

play25:52

because but the key here is to give

play25:55

specific details examples quotes even to

play26:00

explain your thought

play26:02

process finally for multi-turn tasks we

play26:06

will use the preferred response so for

play26:08

instance if the Liker was a four two is

play26:10

preferred so I'll use that response as

play26:12

context for the next turn we will

play26:14

continue the conversation in a way that

play26:16

should flow naturally as if you were

play26:17

speaking to another human or you were

play26:19

speaking to a model in real life making

play26:21

a

play26:22

conversation it is incredibly important

play26:24

to again reiterate all prompts must use

play26:27

the reference

play26:29

text if the next turn asks about how the

play26:33

previous response relates to topics

play26:35

outside of the reference text that is

play26:38

that if instead it asked about different

play26:40

parts of the reference text or for more

play26:41

information that would be

play26:44

good a bad example here is something

play26:47

that is

play26:48

contrived and then not conversational

play26:50

goes off track and is not based on the

play26:52

ref

play26:53

text a good example here assuming a ref

play26:57

text set about

play26:59

festivals in Italy would be the first

play27:01

prompt as I'm planning a trip to Italy

play27:03

and I'd love to know about some popular

play27:05

food festivals there term two prompt

play27:08

could be give me some highlights about

play27:09

the Saga Dela fuo and Alba and the third

play27:13

prompt could be up debating between the

play27:14

chiani class wine festival and the sagio

play27:18

but can't do both help me decide

play27:20

assuming that all of these festivals are

play27:22

in the reference text all of these proms

play27:25

would in a very specific way utilize

play27:27

those reference texts and be exceptional

play27:30

you repeat these steps until the desired

play27:32

number of turns is

play27:35

reached so please please please read

play27:39

this document it is full of good

play27:41

information examples on justifications

play27:44

here we could

play27:45

see and we also have a cheat sheet at

play27:48

the top but I will broadly say again

play27:51

that we are teaching this model how to

play27:52

do rack to use our reference text to

play27:54

answer our prompts that means the promps

play27:58

need to be good they need to utilize the

play28:00

reference text or else they are useless

play28:03

and it is incredibly incredibly

play28:05

important to go slowly and be diligent

play28:07

with your ratings we can only train a

play28:10

model on how to do rag if we can

play28:13

correctly evaluate is it grounded in the

play28:15

reference text and is it truthful so go

play28:19

Claim by claim double-checking verifying

play28:23

on one hand it's in the reference text

play28:25

for this category on the other hand it

play28:27

is true in this

play28:30

category

play28:33

so we look forward to working with you

play28:36

on this project thank you very much for

play28:38

taking the time to watch this video and

play28:41

please reach out via discourse if there

play28:42

are any questions

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
RAG модельконтекстные ответыобучающий проектреференс текстыоценка ответовинструктивный видеоAI обучениеданные компанииконтекстные запросыAI разработка
¿Necesitas un resumen en inglés?