Stanford CS25: V3 I Retrieval Augmented Language Models

Stanford Online
25 Jan 202479:26

Summary

TLDR本次讲座深入探讨了增强型语言模型(Retrieval-Augmented Generation, RAG)的前沿发展和面临的挑战。演讲者首先回顾了语言模型的历史和发展,强调了RAG在提高语言理解和生成能力方面的重要性。随后,他详细分析了RAG的工作原理、优势以及存在的问题,如幻觉、属性和时效性。演讲者还讨论了如何通过改进检索器和生成器的协同工作来优化RAG系统,并对未来的研究方向提出了展望,包括多模态性和系统端到端的优化。

Takeaways

  • 🎓 讲座嘉宾是Contextual AI公司的CEO,同时也是斯坦福大学符号系统系的兼职教授,他在机器学习和自然语言处理(NLP)领域有着深厚的背景和专业经验。
  • 📈 语言模型的时代已经到来,但目前企业级应用仍面临准确性、幻觉、归因和数据更新等挑战。
  • 🔍 检索增强(Retrieval Augmentation)是一种新兴的研究方向,通过结合外部记忆(如文档数据库)来增强语言模型的能力。
  • 🌐 通过外部检索,语言模型可以实现‘开卷考试’式的运作,无需将所有信息记忆在参数中,提高了效率和灵活性。
  • 🔄 RAG(Retrieval-Augmented Generation)是一种结合了检索器和生成器的架构,可以在生成文本时动态地检索相关信息。
  • 📚 讲座中提到了一些具体的技术如TF-IDF、BM25、DPR等,这些技术在文档检索和信息检索领域有着广泛的应用。
  • 🔧 在构建检索增强型语言模型时,需要考虑如何优化整个系统,包括检索器、生成器和查询编码器等组件。
  • 🔍 讲座强调了评估检索增强系统的有效性,需要考虑训练数据、任务类型和更新策略等多方面因素。
  • 🌟 未来研究方向包括多模态检索、系统级优化、以及如何更好地控制和测量语言模型的‘幻觉’现象。
  • 🚀 讲座提出了对现有语言模型的一些创新思考,包括使用增强检索来提高效率、生成更好的训练数据、以及探索新的硬件和软件架构。

Q & A

  • 什么是语言模型?

    -语言模型是一种用于计算一系列词汇出现概率的模型,它可以用来预测下一个词或者生成文本。语言模型在自然语言处理领域有着广泛的应用,如机器翻译、语音识别等。

  • 语言模型的历史有多久了?

    -语言模型的概念并不是近几年才有的,实际上这个概念已经有几十年的历史了。最早的神经网络语言模型可以追溯到1991年。

  • 什么是RAG(Retrieval-Augmented Generation)?

    -RAG是一种结合了检索和生成的模型,它通过检索相关信息来辅助生成任务,如文本生成。这种模型通常包含一个检索器(Retriever)和一个生成器(Generator),检索器负责从大量数据中检索相关信息,生成器则基于这些信息生成响应或文本。

  • RAG模型如何解决语言模型的静态问题?

    -RAG模型通过引入外部记忆(如文档数据库)来解决语言模型的静态问题。这样,模型可以访问最新的信息,而不必局限于训练时的数据,从而保证了信息的时效性和准确性。

  • RAG模型如何减少生成过程中的幻觉问题?

    -RAG模型通过检索真实世界的数据作为上下文信息来减少幻觉问题。这样,生成的内容可以基于真实可靠的信息,减少了模型凭空捏造信息的可能性。

  • 在RAG模型中,如何理解参数式(parametric)和非参数式(non-parametric)的区分?

    -参数式方法指的是模型通过其参数(例如神经网络的权重)来存储和处理所有知识。而非参数式方法则是指模型通过外部数据(如检索到的文档)来辅助决策和生成过程,而不是仅仅依赖于模型内部的参数。

  • 什么是Tfidf?

    -Tfidf(Term Frequency-Inverse Document Frequency)是一种用于信息检索和文本挖掘的常用加权技术。它通过计算一个词在文档中的频率(TF)以及该词在所有文档中出现的频率的逆(IDF),来评估一个词对于文档的重要性。

  • BM25是什么?

    -BM25是一种基于概率检索框架的排名函数,用于估计文档与用户查询的相关性。它是目前信息检索领域中最广泛使用的排名算法之一。

  • 在RAG模型中,如何理解闭卷考试(closed book)和开卷考试(open book)的比喻?

    -闭卷考试比喻的是传统的语言模型,它需要将所有知识记忆在模型的参数中。而开卷考试比喻的是RAG模型,它允许模型在生成回答时访问外部信息,就像在考试时可以查阅书籍和笔记一样。

  • RAG模型中的检索器(Retriever)和生成器(Generator)是如何协同工作的?

    -在RAG模型中,检索器首先根据输入的问题或上下文检索出相关的信息或文档,然后将这些信息作为上下文提供给生成器。生成器基于这些上下文信息生成回答或文本。两者的协同工作使得模型能够结合内部参数和外部信息来完成任务。

  • RAG模型在实际应用中有哪些优势?

    -RAG模型结合了检索和生成的优势,可以提供更加准确和时效的信息,减少生成内容的幻觉问题。此外,它还可以通过外部记忆来更新和修订信息,使得模型能够适应不断变化的数据和知识。

Outlines

00:00

🎓 讲座介绍与背景

本次讲座是本季度的最后一讲,邀请了Contextual AI公司的CEO,同时也是斯坦福大学符号系统系的兼职教授,以及之前在Hooking Face和Facebook AI研究部门担任研究负责人的嘉宾。嘉宾拥有剑桥大学的博士学位和硕士学位,以及阿姆斯特丹大学的逻辑学硕士学位,并在本科阶段学习了哲学和认知人工智能。他的工作重点是机器学习和自然语言处理(NLP),特别是开发更好的语言理解和生成模型,以及更好的评估工具。

05:00

🤖 语言模型的发展与现状

讲座首先介绍了语言模型的历史和发展,指出这是一个非常古老的概念,并非由OpenAI发明。通过大量数据训练,现代的语言模型展现出强大的能力,但同时也存在问题,如生成内容的真实性和时效性。为了解决这些问题,研究者们正在探索如何通过外部记忆和检索增强(Retrieval Augmentation, RAG)来改进语言模型。

10:03

🔍 检索增强的基本概念与架构

讲座详细讨论了检索增强的基本概念和架构,包括生成器、检索器和查询编码器的角色。通过将检索到的外部信息作为上下文传递给语言模型,可以提高模型的准确性和时效性。同时,讲座也探讨了在训练和测试阶段对这些组件的不同处理方式,以及如何通过优化整个系统来提高性能。

15:04

📚 从稀疏到密集:检索方法的演进

讲座回顾了从稀疏检索(如TF-IDF和BM25)到密集检索(如基于向量的DPR和ORA)的发展历程。密集检索通过向量表示提供了更好的语义相似性,而稀疏检索则依赖于词频计数。现代的检索系统往往采用混合方法,结合稀疏和密集检索的优点,以提高检索的准确性和效率。

20:06

🌟 RAG的优化与未来方向

讲座探讨了如何优化RAG系统,包括如何更好地训练检索器、如何更新文档编码器以及如何通过端到端的训练来提高系统的整体性能。同时,提出了一些开放性问题,如如何更高效地训练大规模RAG系统、如何生成更好的训练数据、如何衡量RAG系统的有效性,以及如何将多模态性整合到RAG中。

25:06

💡 互动环节:问题与讨论

在互动环节中,嘉宾回答了关于RAG系统、模型架构、训练策略和硬件优化等多个方面的问题。讨论了如何通过不同的训练方法和硬件支持来提高RAG系统的性能,以及如何处理模型生成内容的真实性和创造性。

Mindmap

Keywords

💡语言模型

语言模型是人工智能领域中用于处理和理解自然语言的算法模型。在视频中,提到我们正处于语言模型的时代,这些模型能够预测下一个词或短语,是自然语言处理的基础。例如,视频中提到了OpenAI并没有发明语言模型,而是几十年前就已经存在的概念。

💡NLP

自然语言处理(NLP)是计算机科学和人工智能的一个分支,专注于使计算机能够理解和解释人类语言。视频中提到,演讲者的工作重点是机器学习和NLP,特别是开发更好的语言理解和生成模型。

💡多模态

多模态指的是结合多种不同类型的信息,如文本、图像、声音等,以提高模型的理解和生成能力。在视频中,演讲者提到了多模态性作为一个讨论话题,暗示这是当前研究的热点之一。

💡检索增强

检索增强是一种结合了检索系统的语言模型,通过检索相关信息来增强模型的知识和回答能力。视频中的演讲者特别强调了检索增强是当前领域的热门话题,并讨论了如何通过检索增强来解决语言模型的一些常见问题。

💡用户界面

用户界面(UI)是人与计算机系统交互的媒介,包括语言模型的输入和输出设计。视频中提到,早期的语言模型需要复杂的提示才能工作,而改进用户界面是让语言模型更易于使用的关键。

💡生成对抗网络(GAN)

生成对抗网络(GAN)是一种深度学习模型,由生成器和判别器组成,用于生成新的、与真实数据相似的数据。虽然视频中没有直接提到GAN,但提到了类似的概念,如生成文本和改进模型的生成能力。

💡开放域问答

开放域问答是指在没有特定领域限制的情况下,对任意问题进行回答的能力。视频中提到了开放域问答范式,以及如何通过检索增强来提高语言模型在这一领域的性能。

💡稀疏表示

稀疏表示是一种数据表示方法,其中大部分元素为零,仅少数元素是非零的。视频中提到了稀疏检索方法,如TF-IDF和BM25,这些方法通过计算文档和查询之间的重叠程度来进行信息检索。

💡密集表示

密集表示是指数据中的每个元素都是非零的,并且具有实际意义的表示方法。视频中提到了密集检索,如使用向量表示的Dense Passage Retriever (DPR),这种方法通过计算密集向量之间的相似性来进行检索。

💡上下文理解

上下文理解是指模型对给定文本或信息的背景、环境和相关性的理解能力。视频中强调了上下文理解在改进语言模型性能中的重要性,特别是在检索增强和多模态应用中。

💡模型评估

模型评估是指对机器学习模型性能的测量和分析。视频中提到了评估语言模型的重要性,特别是在考虑其在实际应用中的准确性和可靠性时。

Highlights

我们正处于语言模型的时代,语言模型的概念并非OpenAI首创,而是几十年前就已经存在。

语言模型的基本思想很直观,即对输入序列进行概率分解,预测下一个词。

ChatGPT的出现改进了语言模型的用户界面,使得人们可以更自然地与模型交互。

当前语言模型面临的挑战包括幻觉问题、归因问题以及信息过时问题。

检索增强(Retrieval Augmentation)是一种新兴的语言模型架构,通过结合外部记忆来增强模型。

RAG(Retrieval-Augmented Generation)是一种结合了检索器和生成器的架构,可以端到端地优化整个系统。

通过优化RAG系统,可以实现更高效的语言模型,减少计算资源的消耗。

在训练RAG系统时,可以采用不同的策略来更新检索器和生成器,以适应不同的应用场景。

RAG系统可以通过主动检索来决定何时进行检索,从而更智能地分配计算资源。

通过使用不同的索引和数据集,RAG系统可以适应不同的领域和需求。

RAG系统的发展强调了从系统层面而非单一模型层面进行优化的重要性。

未来的RAG系统可能会包括多模态能力,如结合视觉信息来增强语言模型。

RAG系统的一个关键研究方向是如何生成更好的训练数据,以提高模型性能。

衡量RAG系统性能的挑战在于如何准确评估检索的准确性和语言模型的输出。

RAG系统的发展可能会推动专用检索硬件的出现,以提高检索效率。

RAG系统的发展趋势是从单一模型转向系统优化,实现成本与质量之间的最佳平衡。

Transcripts

play00:05

hey guys welcome to the our last lecture

play00:08

um of this quarter and we're very happy

play00:12

to have a daa here he's a the CEO of

play00:16

contextual AI um the Enterprise llm

play00:19

company as well as an Adjunct professor

play00:22

in symbolic systems here at Stanford and

play00:25

previously he was the head of research

play00:26

at hooking face and before that a

play00:28

research scientist Facebook AI research

play00:32

uh he received his PhD and masters from

play00:34

the University of Cambridge um as well

play00:36

as a master in logic from the University

play00:38

of Amsterdam and studied philosophy and

play00:40

cognitive AI in undergrad um and his

play00:43

work focuses on machine learning as well

play00:45

as NLP specifically on developing better

play00:48

models for language understanding and

play00:50

generation and better tools for

play00:52

evaluation and Ben yeah give it up for

play00:56

Adella

play00:58

right thank you so I guess I have to

play01:00

sort of stand here in the corner so

play01:02

people can see me on the zoom as well um

play01:06

yeah thanks so much for having me here

play01:09

um so I asked Steph what I should talk

play01:11

about there were a couple of things I

play01:13

could talk about multimodality or

play01:15

evaluation uh and this was the preferred

play01:18

topic I guess uh because the others were

play01:20

already covered um so yeah I'm I'm very

play01:24

happy to talk to you about everything

play01:25

retrieval augmentation um I think this

play01:28

is really one of the topics right now in

play01:31

our field um so I I'll just give you an

play01:34

overview of what's been happening and

play01:36

what I think are the interesting

play01:37

questions to think about um so first of

play01:41

all obviously in case you've missed it

play01:43

we are in the age of language models um

play01:46

and I just wanted to do a quick poll

play01:48

here in this not not super big audience

play01:51

I guess there's more people on the zoom

play01:52

but uh who invented language

play01:55

models if if you thought open AI then

play01:58

I'm angry with you right so so actually

play02:01

uh this is a very very old idea so the

play02:04

idea is just you you take a sequence and

play02:06

you factorize out the token

play02:07

probabilities right and um so it wasn't

play02:11

invented by open AI it's not like a few

play02:14

years old it's actually several decades

play02:16

old uh so I'm bringing this up because I

play02:19

was talking to someone and they were

play02:20

like open AI invented language models

play02:22

and I was like you're kidding me right

play02:24

um so um I I I went back to the

play02:27

literature and this is the oldest one I

play02:29

could find actually 1991 first neural

play02:31

language model um there's a very nice

play02:34

paper from 2003 from

play02:36

pjo where they they actually have like

play02:38

word embeddings and everything already

play02:40

in there uh so obviously these are LMS

play02:43

not llms um and as it turns out if you

play02:46

make them really big and you

play02:48

parameterize them with these massive uh

play02:50

neural Nets then you get something

play02:52

really powerful that really shows

play02:53

emerging uh properties right and that's

play02:55

why we're all excited in this stuff um

play02:59

so if we think about this from like a

play03:01

classic CS perspective there's input

play03:03

output right there's this kind of thing

play03:05

in the middle it's the generator so uh

play03:07

we take a sequence the input sequence

play03:10

and then the the task of the model is to

play03:12

predict the next token very very simple

play03:15

model um and and so you know that's why

play03:18

it was so easy to come up with this in

play03:20

1991 already because it's like the idea

play03:22

is very intuitive but for a long time

play03:25

what was really broken with this was the

play03:27

user interface um and and this I think a

play03:31

lot of people kind of misunderstand what

play03:33

Chad gbt was about that's really what

play03:35

Chad gbt fixed right so that in

play03:38

initially you had to come up with these

play03:39

very weird prompts in order to get your

play03:41

language model to do what you wanted it

play03:43

to do uh and humans are terrible at this

play03:46

right so so we're much better at sort of

play03:48

telling people or things around us what

play03:50

we want right so if we have a dog we say

play03:52

sit we don't prompt it in a very weird

play03:54

way so that it sits right and it's the

play03:57

same with the language model if you

play03:58

wanted to generate some R lyrics in the

play04:01

style of a pirate or Shakespeare or

play04:03

something then you tell it generate some

play04:05

R lyrics in the style of a pirate right

play04:07

so that kind of instruction data

play04:10

actually turns out to be super super

play04:12

rare in just web data so what you need

play04:14

to do is you need to fix the user

play04:16

interface to the language model and the

play04:18

the classic recipe for doing that is the

play04:21

the sequence basically that chat gbt

play04:22

used right so you promp the model in a

play04:24

specific way you instruction find in the

play04:26

model and do you do some alignment rhf

play04:29

uh whatever you do on top of that so

play04:31

that's the first thing so now you have a

play04:33

working language model with a working

play04:36

user interface so are we done then um

play04:40

obviously we're not right so so right

play04:42

now language models are are kind of

play04:43

taking the World by storm but if you

play04:45

talk to anyone especially in an

play04:47

Enterprise for example where they have

play04:48

very strict uh accuracy requirements

play04:51

they will tell you that they can't

play04:53

really productionize this yet um and the

play04:55

reason is because there are all these

play04:57

familiar problems probably a bunch of

play04:58

you are working on these problems right

play05:00

now uh around

play05:02

hallucination um so these models they

play05:04

kind of make up stuff very often with

play05:05

very high confidence which is uh even

play05:08

more scary in a way attribution so we

play05:11

don't really know why these models are

play05:12

saying what they're saying Stillness

play05:15

they go out of date and so this was a

play05:17

big problem with sort of chat GPT not

play05:19

knowing anything that happened after a

play05:20

certain cut off date and they keep

play05:22

updating it every once in a while but

play05:24

you want to have a system that's always

play05:25

completely up to date that never goes

play05:27

still um you want to be able to to

play05:29

revise the information in the system so

play05:32

uh if you're uh a European organization

play05:34

you have to worry about gdpr uh which

play05:36

means that you need to be able to remove

play05:38

information from the language model or

play05:40

maybe Revis facts uh which we don't

play05:42

really know how to do right so again

play05:44

this is a very interesting uh area of

play05:46

study for a lot of folks model editing

play05:49

um but so this is something that we

play05:51

really want to be able to fix and then

play05:53

there's this big question of how do you

play05:55

customize these models uh so different

play05:58

people have different use cases you have

play06:00

different data if you're a company or if

play06:01

you want to have a language model on

play06:03

your own data how do you make it work on

play06:05

your own data so one of the solutions uh

play06:08

that everybody has started using right

play06:11

now is to couple it to an external

play06:12

memory so that's really just rag right

play06:15

the uh we we can this whole lecture is

play06:17

basically about rag uh but the way to

play06:20

understand uh what is going on here is

play06:23

uh we have this generator just like

play06:25

before we have the input and a prom just

play06:27

like before but now uh instead of just

play06:29

those two things we give this additional

play06:32

context so we contextualize the language

play06:34

model using things we retrieve and and

play06:37

the retriever uh is is very often pretty

play06:40

simple it's just a query in a documents

play06:42

encoder um and then you get a bunch of

play06:44

documents you give them as context

play06:46

through the model so super simple

play06:49

architecture um and I think it's useful

play06:53

to think about it from the perspective

play06:54

of of these two separate paradigms uh so

play06:57

if you've ever taken an exam I'm sure

play06:59

you have right uh you can have a close

play07:01

book exam where you have to memorize all

play07:03

of this so you have to cram all the

play07:04

knowledge into your parameters your

play07:07

neurons uh or you have an open book exam

play07:09

where you have all of this information

play07:11

in the book that you can access when you

play07:12

do the exam uh so it's a very similar

play07:15

thing with rag right you can just make

play07:17

it an open book setting where you give

play07:18

it access to this external information

play07:21

Wikipedia or something else or basically

play07:23

the entire internet uh and then have the

play07:26

language model do its job without having

play07:27

to memorize all of it in it

play07:30

parameters um so the other I think

play07:32

useful distinction here is that uh

play07:35

cramming everything into your parameters

play07:36

that's the parametric approach right so

play07:39

U what we're doing with rag is we're

play07:40

adding this non-parametric retrieval

play07:43

component um so uh you might call this

play07:45

semi- parametric um if you want to give

play07:48

this a

play07:49

name all right so why why does that

play07:52

actually solve these issues and so the

play07:55

answer is basically that if you have

play07:57

this separate Index right this separate

play07:59

retriever you can swap it in you can

play08:01

swap it out you can replace it with a

play08:03

new index so you can really customize it

play08:06

and so you can customize your language

play08:07

model system for what the user really

play08:10

wants to see um and then obviously you

play08:13

can update this index so um it doesn't

play08:15

really go still and you can revise it if

play08:17

everything goes wrong if anything goes

play08:20

wrong uh the other thing you get is

play08:22

grounding right so that that's initially

play08:24

why I became interested in this kind of

play08:26

architecture because I was thinking a

play08:27

lot about grounding and multimodal and

play08:29

things like that and actually one really

play08:31

nice way to ground things is to find

play08:33

some other information that you can

play08:35

ground your generation in so you really

play08:37

want the language model to only say

play08:39

things that it has evidence for in this

play08:41

outer piece of text or even multimodal

play08:44

data that it retriev separately so if

play08:46

you do that then you get less

play08:47

hallucination because you can always

play08:49

point back to your Source it's always

play08:50

grounded in your Source um and you get

play08:53

attribution because you know why the

play08:54

model is saying what it's saying it's

play08:56

because it founded this thing here is

play08:59

that

play09:02

all right so um for the rest of this

play09:05

lecture we're going to talk about this

play09:06

this basic architecture um and so it

play09:10

kind of looks like a pretty simple thing

play09:12

right uh but there are actually lots and

play09:13

lots of questions you can ask about what

play09:16

what this system should really look like

play09:18

um and like this this doesn't even cover

play09:20

like half the questions you can ask so

play09:23

it really is about how how do we

play09:25

optimize this entire system right so we

play09:28

have the separate components the

play09:29

retriever the generator and then um

play09:32

there are things like this query encoder

play09:34

how do we encode queries how do we uh do

play09:37

the retrieval do we update the document

play09:39

encoder how do we actually uh Define a

play09:42

document right is it like a full

play09:44

document or is it a paragraph or a chunk

play09:46

or a sentence or a couple of words um so

play09:48

there are lots of questions to ask and

play09:51

and uh as you'll see there are lots of

play09:53

possible answers to these questions as

play09:55

well um so this is what we'll we'll

play09:58

cover

play10:00

um so there are lots of

play10:03

architectures um going into these

play10:05

questions and I think as we go through

play10:07

them it's useful for you to think about

play10:10

what happens during training time and

play10:12

what happens during test time right so

play10:14

during training time is really uh okay

play10:16

we have this language model we have this

play10:18

retriever um which one do we update how

play10:21

do we update them how do we train this

play10:23

entire system do we maybe not train it

play10:25

at all uh do we pre-train it from

play10:28

scratch do we initially I it with uh

play10:30

components that were already separately

play10:32

trained these are the kinds of questions

play10:34

that that you have to answer if you want

play10:35

to design a system like this and then

play10:38

during test time uh you have this entire

play10:41

system right so actually multiple models

play10:43

in a way uh that are working together um

play10:47

so so there's also different things you

play10:48

can do there right so give it different

play10:50

indices during test time or uh

play10:52

manipulate kind of how you're sampling

play10:54

things like

play10:55

that so um the starting point for all of

play10:59

this stuff I think if you ask someone

play11:00

now like what is rag they will think of

play11:02

this thing um so this is frozen rag

play11:06

basically uh there's no training here at

play11:09

all so going back to this question of

play11:10

train time test time there's only test

play11:12

time here train time happen separately

play11:14

with these kind of blackbox models that

play11:16

we don't necessarily have control over

play11:18

right so there's this document embedding

play11:20

model uh whatever is currently at the

play11:23

top of some open source uh leaderboard

play11:26

uh you use that to oop sorry uh to get

play11:29

some vectors that you then use to create

play11:32

this Vector database and then the vector

play11:34

database just does search and it gives

play11:36

the information from the search to the

play11:38

language model and it just passes it as

play11:41

uh as the context right so this is this

play11:44

only works because of in context

play11:46

learning um and you know I think as a as

play11:50

a machine learner myself this feels very

play11:52

inelegant um so what what this lecture

play11:55

is about is can we do better than than

play11:57

this Frozen

play11:59

thing um so let's let's start from the

play12:03

the left side of this like okay if we

play12:05

want to outperform this Frozen thing

play12:07

itself with just the vector database

play12:09

like what would that look like from a

play12:11

retrieval

play12:12

perspective um and the starting point

play12:15

for everything retrieval is is tfidf

play12:17

does everybody know what tfidf is no

play12:22

okay so so tfidf is basically a sparse

play12:25

retrieval method where you have a score

play12:27

function uh that that looks at documents

play12:30

and queries so D and Q and then there

play12:33

are basically two terms that matter one

play12:35

is the TF the term frequency and the

play12:37

other is the IDF the inverse document

play12:39

frequency so this inverse document

play12:41

frequency is actually a really nice idea

play12:43

from Karen spark Jones really underrated

play12:45

researcher she's done some amazing work

play12:48

um but the basic idea is that you want

play12:51

to look at the words that are very

play12:53

special so that don't occur in lots of

play12:54

different documents and so the overlap

play12:56

between the word the doesn't really

play12:58

matter matter right like the occurs

play13:00

everywhere so you want to have sort of

play13:02

the special words so that's what what

play13:04

tfidf does in a nutshell it gives you a

play13:06

score for document query overlap and

play13:10

then you can do all kinds of things here

play13:12

with how how you weigh it so there's all

play13:14

these weird different parameters like

play13:15

this B and things like that that allow

play13:18

you to make it better than just having

play13:20

the the tfidf score so there's a couple

play13:22

of tweaks you can do there so bm25

play13:25

actually in case you're wondering stands

play13:27

for best match 2

play13:29

so I I try to discover like where does

play13:31

the 25 actually come from uh that's

play13:34

because the the prior s the preceding 24

play13:37

experiments failed right so it's

play13:39

literally the 25th one that seemed to

play13:41

work and that's why it's called

play13:42

bm25 it's bizarre right but um um so so

play13:46

this is spars retrieval it's just

play13:48

counting words right so you have this

play13:50

massive massive Vector of all these word

play13:53

occurrences it's sparse because most

play13:55

words never occur right so it's sort of

play13:57

like a vector of uh vocabulary size

play14:01

dimensions so most of that is obviously

play14:03

zero um but so that's actually kind of a

play14:06

nice property if you want to do fast

play14:08

search on a CPU right because on a CPU

play14:10

sparse uh do product is very easy to

play14:13

compute so um this is used in in the

play14:16

system called uh Dr QA which is really

play14:19

one of the first neural instances of

play14:22

this open domain sort of open book

play14:24

question answering Paradigm um so you

play14:27

have a question like how many of

play14:29

warsaw's inhabitants blah blah uh so you

play14:32

want to ask basically Wikipedia what the

play14:34

answer is for this so then you have this

play14:36

document retriever based on the sparse

play14:38

so bm25 I think in this case uh

play14:41

retrieval methods you pass that to um at

play14:44

this I think this was still by lsdm at

play14:47

the time um a document reader model and

play14:50

then that model gives you the answer um

play14:54

so this I think is really the first

play14:56

instance of having sort of this

play14:57

separation between a retrieval and a

play14:59

generator system that you use for

play15:02

answering complicated questions based on

play15:03

sort of open domain

play15:05

knowledge um so after The Spar stuff um

play15:10

there was a bunch of work on dense

play15:11

retrieval and and so the advantage of

play15:14

dense retrieval so this is just like

play15:16

word and Benes basically vectors right

play15:18

they're they're dense now no longer

play15:19

sparse so they're much uh smaller in

play15:22

terms of dimensionality and the nice

play15:24

advantage of of dense retrieval is that

play15:27

it's not really about specific work

play15:28

right so uh if there're synonyms you can

play15:31

still um find the relevant document uh

play15:35

which you couldn't really do with a

play15:36

sparse representation right so that's

play15:38

really the advantage of DSE is that you

play15:40

get like semantic

play15:41

similarity um so you can do this over

play15:45

word embeddings that doesn't really work

play15:46

all that well but uh at the time that

play15:49

people started thinking about this ber

play15:50

was already out there and ber is really

play15:52

great for giving you a vector

play15:53

representation for an entire sequence of

play15:55

words right so a sentence representation

play15:57

or a passage representation

play15:59

so there are all these cool systems like

play16:01

Orca and uh DPR the dense passage

play16:04

retriever where um they essentially use

play16:08

the retrieval as a kind of latent

play16:09

variable in the system U and and the way

play16:12

to get the latent variable to to work to

play16:14

be good enough essentially to train the

play16:16

entire system is to pre-train the

play16:19

retriever on uh relevant information so

play16:21

for Ora they do something called inverse

play16:24

close uh so they do kind of a close task

play16:27

where you want to find

play16:29

um passages that are sort of relevant to

play16:31

the preceding passage and in DPR they

play16:34

just train it on on a supervised thing

play16:36

but really the core idea here is that uh

play16:38

as you can see in this graph here you

play16:40

can do better than bm25 if you add lots

play16:43

of documents and the way you compute

play16:45

this score function is much simpler it's

play16:47

just a d

play16:48

product right um so the nice thing about

play16:52

D products is that you can do them very

play16:55

very efficiently on the GPU as well um

play16:58

if you uh know what you're doing so what

play17:01

you really want to get at is maximum in

play17:04

product search mips right this is one of

play17:05

the kind of core ideas of a lot of this

play17:07

stuff um and you can do mips with Ann

play17:12

approximate near neighbor search um and

play17:14

so there's this this really uh brilliant

play17:17

piece of work out of there for my

play17:19

colleagues at the time uh called phas

play17:22

which really underlies all of these uh

play17:24

modern Vector databases right so like

play17:27

all the popular ones they sort of

play17:28

re-implementations of this face idea one

play17:30

is in like rust one is in go but it's

play17:32

all basically the same idea it's just

play17:34

face um and so so face really Powers a

play17:37

lot of this stuff um and whenever

play17:40

somebody tells you something about a

play17:41

vector database just think about face

play17:44

very fast do

play17:46

product um so obviously you can go

play17:49

beyond do product yes what is it what is

play17:53

face um so so it's an open source

play17:56

Library Facebook AI similar

play18:02

search no so it's just basic off the

play18:04

shelf Ann

play18:09

algorithms yeah so so there are all

play18:12

kinds of different I don't know if you

play18:13

do you know what like product

play18:14

quantization is and things like that so

play18:17

there they're basically so you have a

play18:18

bunch of vectors uh and you can just

play18:21

compute the full dot product which is

play18:23

sort of inefficient right so what you

play18:25

can do is try to compress uh subspaces

play18:28

of the vector and then just look at the

play18:30

kind of

play18:31

centroids um so this so you can quantize

play18:34

sub vectors of the full vector and then

play18:36

do much faster search over just the

play18:41

centroids it's good question any other

play18:46

questions um all right so so about this

play18:49

dot product idea right so so what we

play18:52

have here is some people call this a

play18:54

Siamese Network I guess it is right so

play18:56

you have two different bir models uh or

play18:59

whatever your encoder is here and then

play19:00

at the end you get these two vectors and

play19:02

then you just do do product so you get

play19:04

one single score but you can do all

play19:06

kinds of much fancier things if you if

play19:08

you're willing to give up on this buy

play19:10

encoder uh approach right um so really

play19:13

nice example from from one of your

play19:15

colleagues here at Stanford uh is

play19:17

Colbert um so what this does is is late

play19:21

interaction uh so so instead of just

play19:24

having this dot product here you have a

play19:26

kind of more complicated uh

play19:28

version of computing a score where you

play19:30

aggregate over sort of Maximum

play19:32

similarity scores between different

play19:34

words so I only recently actually

play19:36

discovered that this is called Colberg

play19:38

because of the late night show Colberg

play19:40

so it's sort of Omar's joke actually

play19:43

this name but just just so you know if

play19:45

you run into it um so um but but I think

play19:51

if if we look at kind of where the

play19:52

state-of-the-art has has been going now

play19:55

one of the nice things about these

play19:56

Vector databases is that they're super

play19:58

efficient right so dot product is much

play20:00

more efficient than this late

play20:01

interaction stuff especially if you do

play20:03

the approximate nearest neighbor search

play20:05

um but there's been some really cool

play20:07

work so things like Spade uh they

play20:11

basically have have sparse meat dents in

play20:14

a way so one of the big problems as I

play20:15

said with spars is that you can't really

play20:17

handle synonyms and things like that but

play20:19

what you could do is take a dense model

play20:22

Like a Bird model look at kind of this

play20:24

this one word in your sequence try to

play20:27

see which other words in the same slot

play20:29

so that gives you the synonyms uh so now

play20:32

you can give all these synonyms to a

play20:34

sparse uh vector and then you can just

play20:36

do Spar doll product and so have a much

play20:39

much more efficient way to do search uh

play20:42

without sort of giving up on all the the

play20:44

cool stuff that you get from a dense

play20:46

representation um so that's one thing

play20:49

and this other idea I really like uh is

play20:51

called Dragon um so this I think is

play20:54

really the the the best generalized D

play20:57

dense retriever so if you want to take

play20:58

something off the shelf right now and

play20:59

just go to hugging face or something

play21:01

then this dragon or Dragon plus is

play21:03

probably the thing you want to use for a

play21:05

dense Retriever and the way they train

play21:07

this is is through this Progressive data

play21:10

augmentation strategy to make them the

play21:12

model better and better over time by

play21:13

sampling very difficult negatives um and

play21:16

that gives you very good uh

play21:19

representations um and and so the other

play21:21

thing about this I think this is the

play21:23

only only sort of final point about uh

play21:26

retrieval in general is that is that

play21:27

what we see happening right now if you

play21:29

look at sort of the developer Community

play21:31

around rag is that they're all doing

play21:32

hybrid search right now uh so you can

play21:35

actually just combine the search results

play21:37

from your sparse bm25 or whatever thing

play21:40

or spade and you can combine them with

play21:42

your dragon uh and then you get uh this

play21:45

ranking that works even better uh so

play21:47

then you kind of get Best of Both Worlds

play21:48

but then you get all these questions

play21:50

about how do you combine the

play21:52

results um any any questions on on this

play21:56

part oh can you hear me

play21:59

yes oh sorry um on the earlier slide uh

play22:02

was there has there been any work on um

play22:04

Benchmark how much less hallucination

play22:07

rag incurs over a closed book question

play22:10

answering for example directly asking

play22:12

the large language model the question

play22:14

has there been any benchmarking studies

play22:16

in this yeah so there there's a great

play22:18

paper if I can say so myself on the fact

play22:21

that retrieval augmentation reduces

play22:23

hallucination uh it's from 2021 I think

play22:26

um so so yeah you can just F find if you

play22:29

literally look for retrieval

play22:30

augmentation reduces hallucination then

play22:32

you'll find the paper uh thank

play22:43

you yeah so so uh very often you want to

play22:47

have um an very precise word overlap for

play22:51

things where you don't want to have the

play22:53

synonyms or the kind of nearest

play22:54

neighbors right so um if there's like a

play22:57

brand name name or or something like

play22:59

that then like let's say the brand is

play23:01

apple right you don't want to find stuff

play23:03

about pairs right so that's what you

play23:05

would do with a dense retriever um so so

play23:08

it really kind of depends on what you

play23:11

want to use it for that's why hybrid is

play23:13

probably the way to

play23:14

go it's a good

play23:17

question with the

play23:19

dance it's

play23:21

um it's contextualized that but

play23:24

shouldn't it realize Apple the company

play23:26

would be different from no so so if they

play23:29

were actually contextualized then yes

play23:31

but but very often it's a a frozen

play23:33

retrieval system right that's one of the

play23:35

problems with all the Frozen rag

play23:41

stuff I might be missing very

play23:44

B refering to the factors that

play23:48

you're factors that you're using is

play23:52

or uh no so so the the the the sort of

play23:58

document and the query that they're the

play24:00

same right so they're either sparse or

play24:02

they're dense but so if they're sparse

play24:04

the components of the vector are are

play24:06

literally the other

play24:09

work you just Oneal when

play24:12

you're the thing that

play24:16

creates uh how are you getting so it's

play24:20

literally counts right so so basically

play24:23

it's a one big Matrix of documents as

play24:26

rows and the columns are the words in

play24:28

the documents and then you just count

play24:30

how often a word occurs in a document

play24:33

right so that's as

play24:35

far also

play24:39

refering yeah and so so in the field we

play24:42

call them sparse sparse embeddings or

play24:45

sparse retrieval because most of that

play24:47

Vector is zero right because most wordss

play24:50

don't occur in that

play24:53

document does that make sense

play24:56

yeah

play24:58

cool um so um let's talk about uh doing

play25:04

slightly better so so going back to

play25:05

Stephen's question about okay we we have

play25:07

this kind of retrieval thing but like

play25:09

how do we actually make this retriever

play25:11

good for the context that is going to be

play25:13

used in right so can we contextualize

play25:15

the retriever for the generator uh even

play25:18

if it's it's a generator where we might

play25:20

not have access to the weights so it

play25:22

could be a gp4 model we just send it to

play25:24

some API we get some stuff back um

play25:28

and so uh one paper I really like is

play25:30

called replug um so just just to kind of

play25:33

explain what this looks like so you have

play25:35

this context you have a retriever that

play25:37

we do the the standard retrieval set

play25:39

with this is a dense retriever um and

play25:42

now sorry um and now you uh compute the

play25:45

the likelihood so basically just

play25:47

normalize the scores that you get for

play25:49

for the topk documents to get a

play25:52

distribution here and then uh you give

play25:54

each one of the retrieve documents

play25:57

separately to this generator to your

play25:59

language model so you can look at the

play26:02

perplexity of the correct answer for

play26:04

that language model right so now we have

play26:06

these two probability distributions or

play26:08

two likelihoods essentially and we can

play26:10

minimize the KL Divergence to make sure

play26:13

that we can actually uh retrieve the

play26:15

documents that lead to the lowest

play26:17

perplexity on the right answer for the

play26:19

language model um so super simple idea

play26:23

uh works really really well uh and the

play26:26

nice thing about this is is completely

play26:28

uh agnostic of what happens Upstream

play26:30

right so this will work for any sort of

play26:32

encoder decoder for any language model

play26:35

um what what you need is a perplexity

play26:38

score uh but for most language models

play26:40

you can get that not necessarily all of

play26:42

them so that's one thing and then

play26:44

there's this other really nice approach

play26:47

um what you what parameters are you

play26:50

changing so so in the retriever you're

play26:53

you're literally updating the uh the the

play26:56

dense representations

play26:58

right so your encoder basically for your

play27:00

dense representation that's good

play27:01

question we'll get more um so there's

play27:05

this another paper uh on in context

play27:07

retrieval augmented language models

play27:09

where the whole paper is basically about

play27:12

just doing bm25 and just giving stuff

play27:15

directly to the context of the language

play27:16

model and things kind of work so it's

play27:18

it's sort of Frozen rag but even even

play27:21

more primitive in a way where the the

play27:23

retriever is uh this very old sparse

play27:26

algorithm but it works really really

play27:27

well um but then they have this really

play27:30

awesome section where they they show

play27:32

that you can just have this uh ranker on

play27:35

top of the bm25 results um and you can

play27:38

backdrop into this ranker so now you

play27:40

still keep the language model completely

play27:42

fixed uh so that's sort of this part of

play27:45

the the loss here uh so you have kind of

play27:47

a stop gradient on the parameters data

play27:49

that's just your language model but now

play27:51

you have this uh this kind of rank

play27:54

function here that you can back propop

play27:56

into right so that's your ranker is

play27:58

basically can be a bir model or anything

play28:00

like that that works on top of the

play28:01

things you initially retrieve from your

play28:03

bm25 and now you have this bir reer

play28:05

ranker that you can backrop into um so

play28:09

this also works really really nice so

play28:11

we're slowly progressing towards having

play28:13

a system that is much more optimized for

play28:16

being properly uh retrieval augmented in

play28:19

a way where it's useful and and

play28:20

contextualized for what you want to use

play28:22

it

play28:23

for um so uh yeah just to point out kind

play28:26

of what that looks like with this ranker

play28:28

so you just have this extra step

play28:29

essentially right so we have our

play28:31

retriever then we have our ranker then

play28:33

we have our generator and our

play28:38

output no not

play28:41

necessarily um so so so for this one you

play28:44

do yeah but so for replug you don't

play28:47

right yeah yeah yeah yeah yeah so

play28:52

basically yeah you need to get do apis

play28:54

provide not all of them um some of them

play28:57

do right but but yeah there are all

play28:59

kinds of tricks you can do on top of

play29:01

that

play29:02

yeah um so

play29:04

so basically the question is how do we

play29:07

get sort of gradients flowing into this

play29:09

right so if you don't actually have

play29:10

access to the full parameters of model

play29:13

so that you can backrop all the way

play29:14

through it then you can uh do a

play29:17

reinforce style loss on on the retrieval

play29:20

and then you just pass the kind of log

play29:22

likelihood if you if you have access to

play29:23

that or some other kind of blackbox

play29:26

function

play29:31

all right so um I the next thing you can

play29:35

do uh is to optimize both the Retriever

play29:38

and the generator um and and so this

play29:41

really uh start starts getting to the

play29:43

the proper kind of contextualization of

play29:45

the entire architecture where you want

play29:47

everything to work together right so

play29:49

rather than having this Frozen thing

play29:50

where everything is basically not aware

play29:52

that the other part exists right it's

play29:54

like two halves of the brain they're not

play29:55

talking to each other one is your

play29:57

retriever that is your language model

play29:58

there's no connection they're just like

play30:00

sort of like something is thrown over

play30:01

the fence and then you hope for the best

play30:03

uh so instead of that we have everything

play30:05

much closer and learning together um so

play30:09

um one of the the first um ways of doing

play30:13

this with a generator uh was rag

play30:15

retrieval augmented generation uh which

play30:17

we did at ver in 2020 um and and it's

play30:22

very similar to what we've already seen

play30:23

we basically have this retriever here

play30:25

that works over different documents you

play30:27

get some score function uh that gets

play30:29

given to this generator um that that

play30:32

generates answer and now you want to

play30:34

backdrop all the way and update your

play30:36

generator as well right so in the

play30:38

previous two architectures we saw you

play30:40

keep the generator fixed you backdrop

play30:42

into your retriever but here we update

play30:45

everything well not exactly everything

play30:47

as you'll see but we'll we'll also

play30:49

update the the part of the Retriever and

play30:52

the

play30:53

generator um so in this rag model uh we

play30:56

actually have two different ways of

play30:58

doing this and this this is probably

play31:00

something that when we talk about this

play31:03

uh if you think about this long enough

play31:04

then you'll you'll think like okay but

play31:06

when actually do I need to retrieve like

play31:08

do I do I retrieve every time I generate

play31:11

a new token or do I just retrieve once

play31:13

and then generate an entire sequence

play31:16

right or maybe I want to retrieve every

play31:18

end uh tokens right so these are hyper

play31:21

prams or maybe I want to learn when to

play31:22

retreat as as we'll see that's also

play31:24

something people have done um so are are

play31:27

two different ways to do it um and and

play31:30

what we do in this paper basic the whole

play31:32

point of the paper is that this Frozen

play31:34

thing doesn't really work all that well

play31:37

right so I think what people Call Rag

play31:39

now is is usually refer refers to the

play31:42

Frozen thing uh but the whole paper

play31:44

basically would never have been accepted

play31:46

anywhere if we had just done the Frozen

play31:47

thing right the whole point of the paper

play31:49

is that you want to uh optimize it and

play31:52

so at my company contextual we call this

play31:55

Frozen thing Frankenstein's monster

play31:57

because it's really like you Cobble

play31:58

together these different pieces right

play32:00

you sort of yeah it's it's really like

play32:02

Frankenstein you just put it together

play32:04

and then it sort of walks you know uh

play32:05

but it doesn't really have a soul it

play32:07

doesn't really actually work it's not

play32:08

the real thing um so that's great for

play32:12

for everyone here I think because there

play32:14

are so many opportunities to do better

play32:15

than what what most people are using

play32:17

right

play32:18

now um so one of the limitations of of

play32:22

the original rag architecture is that it

play32:25

only supports a very small okay but so

play32:28

if you have lots and lots of documents

play32:30

uh then the problem is that you have to

play32:32

fit all of them in the context but how

play32:34

do you really get that uh to fit right

play32:38

so one thing you can do is you you first

play32:41

encode uh things so that you get one

play32:43

single representation or only the few s

play32:46

of top level representations then you

play32:48

concatenate those and then you just feed

play32:50

them to the decoder so this is FID

play32:52

Fusion in decoder um and as you can see

play32:55

the scales to a much higher uh number of

play32:58

of passages uh and that uh leads to

play33:01

corresponding improvements in uh the

play33:04

scores that you care

play33:06

about uh so that's a really cool idea

play33:08

and so so we're we're slowly moving

play33:10

towards more decoder only architectures

play33:13

right so in rag we have this bark model

play33:15

it's sort of an encoder decoder

play33:16

architecture but here you just have this

play33:18

decoder that does some fancy attention

play33:21

over stuff that you retrieved before um

play33:24

and and so another like pure decoder

play33:28

language model architecture um is this

play33:31

one

play33:32

KLM which I think is is very elegant in

play33:35

its simplicity so it's basically you

play33:37

just have a normal language model but uh

play33:40

you interpolate the normal language

play33:42

model weights with uh things that you

play33:45

retrieved um so basically you have some

play33:48

sort of prompt right so like Obama's

play33:50

birthplace is you go to your big Corpus

play33:52

you find similar things you look at the

play33:55

words that come next to the similar

play33:57

things uh you uh rank that thing you

play34:00

sample your top K you renormalize that

play34:03

so now you have a bunch of scores and

play34:05

now you can just interpolate between

play34:07

your retrieved kind of non-parametric

play34:10

memory scores and your parametric

play34:12

language model scores so this is very

play34:14

late Fusion in a sense right you at the

play34:16

very end you combine these two uh and it

play34:18

allows you to re reweight the pure

play34:20

language model probabilities or

play34:22

likelihoods um so this works really well

play34:25

and it scills especially well if you

play34:27

have a huge uh retrieval Corpus so if

play34:30

you have trillions and trillions of

play34:32

tokens in there you could have a much

play34:34

smaller language model that does not

play34:36

that much heavy lifting because you can

play34:37

really rely on this big Source Corpus

play34:40

that you're working from and so that

play34:42

idea was uh exploited by this paper

play34:45

called retro out of Deep Mind where uh

play34:49

they showed that you can have a 25 times

play34:51

smaller retrieval augmented language

play34:53

model trained from scratch so really

play34:55

pre-trained uh entirely from stretch

play34:57

that outperforms this 25 times bigger uh

play35:00

language model on the same data in terms

play35:02

of perplexity which is pretty impressive

play35:05

right so this architecture is much more

play35:06

efficient than a parametric model

play35:09

because you can rely on this external

play35:11

memory so if your external memory is big

play35:13

enough uh you can get pretty huge gains

play35:17

so there was a lot of excitement about

play35:19

retro when it was announced uh but it's

play35:21

a deep mind paper so there's really no

play35:23

open source nothing really to validate

play35:26

that this actually Works um and so very

play35:29

recently there has been a bit of work

play35:31

from Nvidia called retro

play35:33

Plus+ um where they have this hybrid

play35:36

between the Retro architecture and then

play35:39

they do basically Rags sort of they put

play35:41

the top one or the topk results in the

play35:44

context of the language model after all

play35:46

so it's sort of a crossover between Rag

play35:48

and retro and they show some really nice

play35:51

results here but I I think it's sort of

play35:53

pointing to this uh big flaw I think is

play35:56

that why is there still no good open

play35:58

source retro

play35:59

model that probably tells you something

play36:02

about whether it actually really works I

play36:04

I spent a lot of time in my career

play36:06

trying to reproduce deep mind papers

play36:08

that didn't necessarily always work uh

play36:11

and so I I think the the same is true

play36:14

for retro um and that's why we need to

play36:17

do this in context rag on top of retro

play36:19

to actually get it to

play36:21

work but could it just be a true book

play36:24

thing because you're searing onook

play36:28

yeah but so

play36:31

that no so the the doing retrieval over

play36:34

that to over that big Corpus is not that

play36:37

difficult actually yeah um so so they're

play36:40

even like distributed pH packages you

play36:43

can just do everything yourself so yeah

play36:46

so in terms of comput it's it's actually

play36:48

not that hard anymore to to reproduce

play36:50

something like this uh but I've tried

play36:53

several times and it it's not really

play36:55

reproducible

play36:57

so the only way to get it to work is if

play36:58

you do this in context rag on top of the

play37:00

Retro thing and then as you can see here

play37:02

in the results then it actually gives

play37:04

you a gain over the pure GPT model right

play37:06

so it starts from a GPT and then they

play37:08

kind of retrofit as they call it the GPT

play37:12

model so in short I think there's still

play37:14

a lot of work to be done in pre-training

play37:16

these systems really from scratch uh and

play37:18

retro kind of showed that it might be

play37:20

possible but we don't necessarily know

play37:22

exactly how to do it the right way and

play37:24

this is really one of the interesting

play37:26

open

play37:27

questions um any questions on

play37:33

that

play37:38

online no okay then we'll move on um so

play37:45

um let's go all the way with the

play37:47

contextualization now right so so with

play37:50

retro and with rag what we actually did

play37:53

is we only updated the query encoder uh

play37:56

so updating the document encoder is very

play38:00

expensive so one of the first papers

play38:03

actually kind of the the OG of the the

play38:05

non-frozen dense retrieval augmented

play38:07

methods is this uh paper called realm

play38:10

this is really like Visionary work this

play38:12

was basically the first uh uh kind of

play38:16

version that did this properly where

play38:18

they updated it all the way including

play38:20

the document encoder um so can can

play38:23

someone explain to me why it's expensive

play38:25

to update the document en

play38:30

coder so let's say we have a trillion

play38:32

tokens in our Corpus right and now so

play38:36

now we go all the way so we basically do

play38:38

a forward pass we get a gradient at the

play38:40

end now we back propagate the gradient

play38:42

through the retriever we update the

play38:44

query encoder now we have to update the

play38:46

document encoder so what do we then need

play38:48

to do after we've updated the document

play38:50

encoder we need to re-encode the entire

play38:53

internet right so basically every single

play38:56

gradient update we have to re-encode

play38:58

whatever our index is which so if this

play39:01

is like trillions of tokens it's like

play39:02

re-encoding the internet after every

play39:04

batch update so that's not very

play39:12

efficient

play39:15

change

play39:17

Stuff AC have

play39:20

some

play39:23

predictable

play39:25

yeah

play39:27

yeah that's one one way to do it uh so

play39:29

so there there are a bunch of different

play39:30

ways to update the the document encoder

play39:33

so what they do in realm is they

play39:35

basically do it for Te batches then they

play39:39

stop they re-encode the entire internet

play39:41

and then they train again uh so it's

play39:43

sort of asynchronous updates they have

play39:45

this very fancy sort of sharding

play39:47

mechanisms where they take down uh

play39:50

certain parts of their entire index uh

play39:52

and then update them kind of on the Fly

play39:55

uh so you can do it is just very

play39:57

expensive so one one of the things that

play39:59

a lot of people have been thinking about

play40:00

not exactly theora idea but but similar

play40:02

versions of that um are around like can

play40:06

can you make it more efficient so that

play40:07

you don't have to do do this

play40:11

asynchronously um so one of the

play40:13

downsides of this realm uh architecture

play40:16

is that it's really just a bird model

play40:18

but then you do this retrieval

play40:19

augmentation on a bird model with other

play40:21

bird models so it's not really

play40:22

generative it's not really gen in the

play40:25

modern Paradigm but if you want to read

play40:27

like one paper uh on this topic like

play40:30

this is a very good one to

play40:31

read uh the other one that is is really

play40:34

really good to read uh is this paper

play40:37

called Atlas uh so Atlas is um uh so

play40:41

this is out of fair um with a bunch of

play40:44

folks the folks who did like Rag and the

play40:46

folks who did FID and uh a really a

play40:49

brilliant set of people and and this is

play40:51

really a comprehensive uh analysis of

play40:54

everything that's happening in this Arch

play40:56

ecture so the first question they really

play40:58

look at is how do we train this

play41:00

retriever so we've seen a couple of

play41:01

versions of this um but uh which one

play41:05

actually works better they haven't

play41:06

really been compared in a head-to-head

play41:08

setting uh so one thing is we have this

play41:10

FID Styles s vention distillation uh so

play41:14

that's really too complicated to go uh

play41:16

into detail here but the others are

play41:18

actually very simple um so one is this

play41:21

loss we've basically seen before right

play41:24

uh so we've seen this I think with the

play41:26

in context rag one right so we have a

play41:28

stop gradient on the language model and

play41:30

then we update the retriever the other

play41:32

one is what we've seen with replug so

play41:35

this is basically exactly the replug

play41:37

loss right so we have the K Divergence

play41:39

of the um the documents and and sort of

play41:43

the Improvement that you see when you

play41:44

give it that document uh the other thing

play41:47

they have is basically the inverse of

play41:49

that one so if I take this one document

play41:52

out how does that affect my uh my

play41:55

perplexity of the language model right

play41:58

um and so this one I think is actually

play42:01

quite elegant because that really gets

play42:03

to like how valuable is this one single

play42:05

document for me answering this question

play42:08

correctly um so uh they compare all of

play42:12

these different versions and uh what you

play42:14

can see is that uh the the kind of

play42:17

replug style loss and this leave one out

play42:19

loss they perform a lot better than all

play42:21

of these others so this fixed retriever

play42:23

or no joint pre-training these are

play42:25

really kind of the Baseline sort of

play42:27

Frozen rag models or close book uh and

play42:30

as you can see you can do really a lot

play42:32

better uh if you optimize things and so

play42:35

this leave one outing is probably the

play42:38

best I would say um so then the other

play42:40

question is how do you actually like

play42:42

train that entire system like what data

play42:44

or what tasks do you train this on so

play42:46

they also uh experiment with a bunch of

play42:49

different versions uh so one is uh doing

play42:52

prefix LM if you're familiar with that

play42:54

uh so they basically take a chunk that

play42:57

occurs somewhere on the internet and

play42:59

then they predict the next Chunk from

play43:02

that chunk right so it's really like

play43:04

sentence to sentence so maybe like skip

play43:06

thought back in the day but now you have

play43:08

this retrieval step where you predict

play43:09

the next sentence uh then they just do T

play43:13

T5 Styles sort of D noising so that's

play43:15

Mass language modeling if you're

play43:16

familiar with T5 um and then they have

play43:19

this title to section generation piece

play43:21

so um I think the takeaway from this

play43:23

table is basically that whatever you do

play43:25

here so they're using T5 model so

play43:28

whatever you do here needs to be the

play43:29

same that your uh language model expects

play43:32

um so for T5 that's T5 style

play43:35

loss um and then uh the the the next

play43:39

sort of final question that they look

play43:40

into going back to to what we talked

play43:42

about how exactly do we update this

play43:45

retriever uh so do we have to update the

play43:47

document encoder or do we maybe have to

play43:50

do some sort of reranking uh or do we

play43:52

maybe just update the query um and and

play43:55

quite surprising L I think they find

play43:57

that just updating the query so like in

play43:59

the original rad paper is actually

play44:01

already basically good enough in many

play44:04

cases so so that's nice because it's

play44:07

much more efficient if you don't have to

play44:08

update your documents all the time uh I

play44:11

think the the real question here though

play44:13

is like uh how good is your document

play44:15

representation to begin with so you need

play44:18

to have very very high quality embedding

play44:20

model for this to work if you don't have

play44:22

that then this will not work but if you

play44:24

do have that then you get a very nice

play44:26

kind of query side fine-tuning

play44:31

thing U so the the atlas paper is about

play44:35

trying to do F shop um sort of language

play44:38

modeling tasks so it's how how many

play44:40

examples are given in the

play44:45

context um yeah so so the main takeaway

play44:49

um here is that if you compare like the

play44:51

Close book equivalent model to the

play44:53

retrieval augmented model uh you see

play44:56

very big

play44:58

improvements that's really the only

play45:00

takeaway of of this entire

play45:02

section um but I I think that that's

play45:06

really saying something uh in terms of

play45:09

what we should be thinking about um how

play45:11

how much time do I have

play45:14

in

play45:15

still okay okay all right other

play45:21

questions are the documents in the

play45:24

training step same as

play45:29

yeah so they can be different um in so

play45:33

in Atlas the athlet basically tries

play45:35

everything uh so they also try to see

play45:37

what happens if I train this on

play45:39

Wikipedia But I swap in like a sort of

play45:42

Comm and crawl index um and I think so

play45:45

in Atlas but also in retro domain

play45:47

finding is just the more the better uh

play45:50

so it's really just like the bigger your

play45:52

index the more likely you're you are to

play45:54

find the exact right thing um and then

play45:58

make the right

play46:04

prediction any other questions on this

play46:07

oh yeah uh sorry I this is a question

play46:09

about the generator in the I guess uh

play46:12

the rag system so um recently I saw a

play46:17

paper on mistal 7B so it introduces a

play46:20

lot of these uh new architectural

play46:22

changes like the sliding window

play46:23

attention to handle longer sequence is

play46:26

at a smaller cost and the group query

play46:28

attention for faster inference I'd like

play46:30

I'd like to like know your thoughts on

play46:33

designing a generator specifically for

play46:36

rag uh leveraging for example where

play46:38

mystal 7B currently is because for

play46:41

example like the sliding window

play46:43

attention I could see how that could be

play46:44

adapted to the rag

play46:47

case yeah so so maybe your read on sort

play46:49

of what makes mol's special is a bit

play46:52

different from mine so I I don't think

play46:53

that the sliding attention window thing

play46:55

is actually that interesting the reason

play46:57

mrol works so well is because it's

play46:58

trained on a lot of data uh and you can

play47:01

do that more efficiently because you

play47:02

have sliding window attention so you

play47:03

don't need to attend to everything um

play47:07

but uh so to answer your question I I

play47:10

guess you're asking sort of about the

play47:11

architecture of the generator if you

play47:14

know that there's going to be a

play47:15

retriever so I I I think uh that's

play47:18

basically what retro tried to do right

play47:20

so um retro actually some of the people

play47:24

on the Retro paper are at Mistral now uh

play47:27

so they they have this uh C chunk cross

play47:30

attention idea here so you basically

play47:32

have a language model but the way it

play47:34

does the tension over the things you

play47:36

retrieve in your retro um architecture

play47:41

uh you they they kind of get integrated

play47:43

into a model not using the standard

play47:45

detention mechanism but using this

play47:48

slightly different chunk cross

play47:50

detention oh okay so I think the the

play47:53

sliding window Attention Point I was

play47:55

trying to get get at was that uh it uses

play47:57

a fixed window so that whenever you're

play48:00

doing the query key computation in the

play48:02

attent with the query vectors and the

play48:04

key vectors you're using a fixed window

play48:07

attention so I think my idea was to

play48:10

actually one use a dynamic window

play48:13

because for example the rag case um if

play48:16

you use a fixed window when you're doing

play48:18

attenion it it is possible that you

play48:21

actually are leaving you you're only

play48:23

looking at a fixed uh span of

play48:26

information so if you could maybe adapt

play48:28

mistel so that you could make it better

play48:31

for the ride case and and for example

play48:33

the making the fixed window size the

play48:35

dynamic window uh yeah yeah I think it's

play48:39

an interesting idea so so for me uh the

play48:42

the what m is doing with with the

play48:44

sliding window that's basically like a

play48:46

conet right so we had all these

play48:48

convolutional like light comp Nets where

play48:51

where we would have word embeddings and

play48:52

you would do convolutions over it and

play48:54

then pull uh and then you would still

play48:56

get the information out so it's not that

play48:58

the sliding window prohibits you from

play49:00

looking earlier it's just that that

play49:02

happens higher up in your Transformer

play49:04

sort of yeah

play49:07

yeah so I think that definitely is an

play49:10

interesting direction to to think in

play49:12

yeah yeah so I think um it's like not

play49:15

too crazy to say are there any

play49:17

architectural changes that we can

play49:19

introduce into these 7 billion parameter

play49:21

models so that they could be better

play49:23

adapted to the rag case

play49:27

yeah so there there there might be yeah

play49:30

I I think one one question is just how

play49:32

do you how do you do the attention over

play49:33

things you've retrieved which I think is

play49:35

what

play49:37

you're yeah

play49:39

thanks so just to make sure I understand

play49:42

so I mean in this retro model you're

play49:45

retrieving in each

play49:47

block and when you talk about putting

play49:50

the retrieval in the context are you

play49:53

saying that you only do it at the

play49:54

beginning you don't do it

play49:57

yeah so so in context so this is it's

play50:00

not exactly every layer sort of so it's

play50:02

every token right so every um every step

play50:05

basically not every block so doesn't

play50:09

make sense so it's not every layer that

play50:12

you do to retrieval yeah so every step

play50:16

right um so so this is kind of like like

play50:19

what rag token is so you retrieve every

play50:21

token you so you generate and then you

play50:24

can retrieve again or in the case of

play50:26

retro you can generate like a chunk and

play50:28

then you retrieve chunks again uh if you

play50:31

look at the in context case you retrieve

play50:33

once at the beginning and then you give

play50:36

it you're say that during this

play50:41

nobody yeah but so the so the in Contex

play50:44

thing um so so here you don't actually

play50:48

give it as context at all like directly

play50:51

to the model right so here you get you

play50:53

let the decoder kind of tend over

play50:56

it

play51:02

yeah so I don't think cross attention

play51:05

really works yeah

play51:10

yeah other

play51:13

questions yeah we

play51:15

inside the the training of the retriever

play51:18

is not so necessary because of the

play51:21

large uh so I'm wondering what inside of

play51:24

the T like what cases are really need

play51:29

toiz update or anyway updates

play51:34

those yeah so you do want to update the

play51:36

retriever right but but only part of the

play51:38

retriever is necessary to be updated for

play51:41

a lot of these these cases um but so so

play51:46

I I think it uh so these are very

play51:48

specific data sets right natural

play51:50

questions wizard of Wikipedia and fever

play51:52

so they're really very uh kind of

play51:54

knowledge intens tasks uh so in that

play51:57

case if you already have a very good

play51:59

system like DPR that is specifically

play52:01

pre-trained for those tasks then you

play52:04

only need to update the query encoder

play52:06

but so I would expect that if you move

play52:08

Beyond this to kind of General language

play52:10

modeling things like like retro then you

play52:13

probably do want to update the document

play52:15

encoder at least in a way where you can

play52:17

scale

play52:18

it so that in the this part that is very

play52:23

much in

play52:33

as long as we have a good opal knowledge

play52:36

of what of the maybe the documents by

play52:39

those uh good

play52:43

models yeah but so you need to learn how

play52:45

to kind of query into that Index right

play52:48

so if you if you don't do that uh then

play52:51

then yeah you don't get really good

play52:53

performance so that's sort of like your

play52:54

close book performance right if you just

play52:57

have the language model and you're just

play52:59

like what what does the parametric model

play53:01

on its own without the retrieval what

play53:03

does it actually know as you can see

play53:05

there there are pretty big gaps there

play53:11

right other questions otherwise I will

play53:14

cover other

play53:17

questions no uh hello yeah go for it a

play53:21

quick question like so uh what about

play53:24

like more here at retrieval like I

play53:26

suppose there will be messes trying to

play53:28

not just retrieve a single chunk but

play53:30

some kind of like groups of chunks or

play53:31

something or summarized versions there

play53:34

there's been some interesting work on on

play53:36

doing that uh where you first tried to

play53:38

find so you can have multiple indices

play53:40

and they can kind of cascade right so

play53:41

first you want to find the relevant

play53:43

document so you have some document

play53:44

representation and then within that

play53:46

document you want to find the relevant

play53:48

chunk uh so you can do it sort of that

play53:50

direction you can also do it in reverse

play53:52

I think I I have something on the slide

play53:54

there where you can find the chunk and

play53:56

then sort of expand uh the context

play53:59

around it and then give that to the

play54:00

language model um so I think yeah there

play54:04

are all kinds of interesting things you

play54:05

can do

play54:07

there cool H thanks I guess another

play54:10

thing just like do can you compare rag

play54:13

versus like long context L efforts so

play54:16

there are lot of things like on around

play54:18

just having really long context and

play54:20

extreme it could replace rag but I know

play54:22

like if your takes yeah so so my my uh

play54:26

so everybody understands this question

play54:28

right so there there's there's a trend

play54:30

where we want to have very long context

play54:32

language model so that basically you can

play54:34

like take Harry Potter or something just

play54:36

put it into context and then ask a

play54:38

question like what is the name of like

play54:40

Harry Potter's owl or something right

play54:42

and then it can just attend over the

play54:43

entire thing um so attending over all of

play54:47

Harry Potter to answer that one question

play54:49

is super inefficient right uh so most of

play54:52

Harry Potter has nothing to do with the

play54:54

AL uh so but you are still kind of

play54:56

reading it if you do it with the long

play54:58

context window um so that's why I think

play55:01

the doing it the rag way where you have

play55:02

this non-parametric component is a much

play55:05

more efficient way to solve this problem

play55:07

and if you actually look at the

play55:09

literature on Long context Windows uh

play55:11

the way they they solve the problem of

play55:14

scaling the attenion mechanism is by

play55:16

making it very sparse uh so they're

play55:19

basically turning it so that's a

play55:20

different kind of spars but they're

play55:22

turning it into a non-parametric

play55:23

retrieval problem uh kind of behind the

play55:26

scenes so they're not they're not

play55:27

actually all that different if you want

play55:29

to scale long context then you're going

play55:30

to move towards a rag style

play55:34

architecture good

play55:38

thanks all right um so let's talk about

play55:41

some other interesting questions so one

play55:44

thing and I already alluded to this is

play55:47

when do we actually retrieve so very if

play55:49

we're doing like if we want to uh like

play55:51

retrieve every token that's also very

play55:54

inefficient because I probably don't

play55:56

have to retrieve to generate

play55:58

the right I can probably do that on my

play56:00

own with the language model is of a

play56:02

wayte to go and retrieve stuff but if I

play56:05

only retrieve once at the beginning of

play56:07

the sequence that's probably also not

play56:08

great right so so what we ideally want

play56:11

to be able to do is to say okay

play56:13

sometimes I want to retrieve sometimes I

play56:15

don't want to retrieve and I'm going to

play56:16

learn when I want to kind of expend the

play56:19

the compute Budget on doing the

play56:21

retrieval um so a nice paper where they

play56:24

have a stab at this is called flare for

play56:26

active retrieval augmentation where they

play56:28

basically have the language model decide

play56:31

uh when it should do a search and what

play56:33

it should do to search for um so so I I

play56:37

think this fits in a general Trend that

play56:39

you can see in the field around kind of

play56:41

Agents right so we can talk a little bit

play56:43

more about that too um so this other uh

play56:47

question that that I think we also kind

play56:49

of covered already here is how do we

play56:51

train this at scale right so we can do

play56:52

these asynchronous updates we can do

play56:54

reer rankers we can do query side only

play56:57

there's this really nice paper uh which

play56:59

is quite close I think to the idea you

play57:01

proposed uh where you first use bm25 to

play57:05

create a a batch basically where

play57:07

everything is very similar uh in terms

play57:10

of what you've retrieved and now you uh

play57:13

have this kind of inbatch update so it's

play57:16

it's sort of like a ranker where you

play57:17

encode the information that is just in

play57:19

your batch using this other model and

play57:22

now you can update this model on the fly

play57:24

so you don't have to worry too much

play57:25

about doing the full kind of documents

play57:27

side update um and again here what

play57:30

really matters is like how big is your

play57:32

index if you have an amazing index you

play57:33

can basically solve any problem just by

play57:35

looking it up right so rather than

play57:38

cramming it into your parameters you can

play57:40

just find it

play57:43

um this is a really nice paper uh called

play57:46

Silo so one one of the interesting

play57:48

things I think that's going to happen in

play57:50

the next year or two around language

play57:53

models is there and you've seen this

play57:54

already there's a bunch of like lawsuits

play57:56

against open Ai and other places around

play57:58

where does the data exactly come from um

play58:02

so one uh very elegant solution I think

play58:04

is to have a rag system that you train

play58:06

on data that you know is safe so you can

play58:09

train that thing on Wikipedia But now

play58:12

during test time you can give it a data

play58:14

store that has maybe slightly riskier uh

play58:17

information in it so this massive index

play58:20

of all the stuff on the internet

play58:21

including some things that are maybe um

play58:25

risk uh you can still have them in your

play58:27

index but your language model uh your

play58:29

retrieval augmented language model I

play58:31

should say you know that that thing is

play58:33

safe because it was strin on data that

play58:34

is public domain uh so that's what they

play58:36

do in Silo and they show that that works

play58:38

really well so that's uh one possible

play58:42

solution to to a lot of the the kind of

play58:44

compliance and legal risk around

play58:45

language model

play58:48

deployments um there's a great paper and

play58:51

also from one of your colleagues um

play58:54

around uh contexts getting lost in the

play58:57

middle I think this is also kind of a

play58:58

fascinating phenomenon this is on a

play59:00

frozen rag system um but U language

play59:05

models are very similar to humans in

play59:07

what things they pay attention to so if

play59:09

you give them a bunch of things that you

play59:11

retrieved what what they will look at

play59:13

are like the first things you list and

play59:15

the last things you list and they will

play59:16

sort of ignore the middle um so if it

play59:19

actually respected the rank function

play59:21

then then this curve would go down all

play59:23

the way right but it sort of go goes up

play59:26

um so I I I think that's a a very

play59:28

interesting observation which kind of

play59:30

shows that how brittle uh these these

play59:33

systems can be right so if you have a

play59:35

frozen rag system it can be very very

play59:37

brittle where like the order of the

play59:39

retreat context matters a lot in whether

play59:41

you get the right answer or

play59:44

not work on treating this as re problem

play59:48

sense

play59:50

ofor like specifically going for

play59:53

interpration out VOR that's going to

play59:56

inter prodct with just the right maybe

play60:00

you can tune for the particular

play60:04

dat yeah so what what I just described

play60:06

someone asked like how how do you

play60:08

actually so I said there are other ways

play60:10

to do this and then the question was how

play60:12

do you do that so the way you do that is

play60:13

using reinforce um so yeah there has

play60:17

been work on doing that um so some of

play60:20

the older papers were playing with this

play60:21

but one one of the big problems with uh

play60:25

so I think the replug solution isort of

play60:27

more elegant uh for solving that problem

play60:31

because you actually of use signal from

play60:33

the language model and if you just do

play60:34

reinforce it's very high variant so

play60:36

you're uh it's it's going to be super

play60:38

finicky if you don't want to destroy

play60:40

your

play60:42

index but people have tried it

play60:47

though um so um uh there's some some

play60:51

really nice work from open AI where they

play60:54

they basically basically show and again

play60:55

we're sort of like thinking more and

play60:57

more about agents here right uh where

play61:00

they show something very similar to the

play61:02

flare result from earlier with active

play61:03

retrieval that doesn't necessarily have

play61:05

to be some index that you own it can be

play61:07

just some some web search right um and

play61:10

obviously in this case you don't really

play61:12

have access to the web search

play61:13

necessarily so Bing or whatever they use

play61:15

here is not going to update its

play61:17

parameters uh but I just wanted to kind

play61:19

of put this in your mind like this is

play61:21

another thing you can do right and if we

play61:24

take this really to the general form uh

play61:27

then you can think of language models as

play61:29

just tool users um so rather than just

play61:32

retrieval augmenting language models we

play61:34

can tool augment language models and

play61:36

retrieval is just one of the many tools

play61:38

that language models have access to we

play61:40

can have uh rankers and things on top of

play61:43

the outputs of these tools um and so one

play61:45

of the the big questions I think uh is

play61:48

how do you actually get the system to to

play61:50

learn stuff right so we're going to need

play61:52

our help if we want this system to

play61:54

really learn learn how to take these

play61:55

actions uh

play61:57

properly

play61:58

um um and and so yeah this has been

play62:01

taken to to the extreme in this uh sort

play62:04

of self rag architecture where they have

play62:06

this sort of retrieval step and it's

play62:07

active and then you criticize it and

play62:09

then you uh basically do some natural

play62:11

language inference uh and all of that

play62:13

just with one language model to answer

play62:16

uh the

play62:17

questions um so the other missing piece

play62:20

so I'm just kind of going through a

play62:22

bunch of open questions uh that that

play62:24

people have looked at uh but feel free

play62:26

to interrupt me if there's anything you

play62:27

want to know um but so instruction

play62:30

tuning we established at the beginning

play62:32

of the lecture that this is pretty

play62:33

important for getting things to work so

play62:35

fixing the user interface um but the

play62:39

instruction tuning has almost always

play62:41

only happened on the language model and

play62:43

not on the entire system so I think one

play62:45

of the interesting uh things that people

play62:47

are looking at now with with things like

play62:49

RIT and instruct retro is how can we

play62:51

instruction fine to an entire retrieval

play62:53

augmented system so all the way into the

play62:55

retrieval step can we generate data so

play62:58

that that also follows the instructions

play63:00

properly which currently doesn't happen

play63:02

in any of these model

play63:04

architectures um and then finally I I

play63:07

think I would be remiss if I if I didn't

play63:09

really talk about what people call

play63:11

Advanced rag so so like the developer

play63:13

Community has been really doing some

play63:15

awesome stuff uh so like Frameworks like

play63:18

llama index and Lang chain and there's

play63:19

all these open source Vector databases

play63:21

like groma and wv8 and they're all sort

play63:24

of about making rag really easy but this

play63:26

is all Frozen rag right but even with

play63:29

frozen rag you can really do incredible

play63:31

things um so uh we mentioned some of

play63:34

these already so child parent recursive

play63:36

retriever so you find small small parts

play63:38

and then you give the big parts around

play63:40

it to the language model you can do

play63:42

hybrid search where we use reciprocal

play63:44

rank Fusion so we have like different

play63:45

search results that we then combine

play63:48

before we give the final thing to the

play63:49

language model there's zero shot like

play63:52

large language model ranker so basically

play63:54

the score function is not doesn't come

play63:56

from your retrieval it comes directly

play63:58

from the language model um and then uh

play64:01

hypothetical document and Bets which I

play64:02

think is a really cool idea so you just

play64:05

uh basically you fix hallucination

play64:07

through hallucination uh so you get a

play64:10

question then you let the language model

play64:12

hallucinate a bunch of possible answers

play64:14

then you go and search for nearest

play64:16

neighbors to the possible answers and

play64:17

you give those as context and then it

play64:19

gives the right answer based on that

play64:21

right so it's really like hallucinating

play64:23

answers and I think it's a brilliant

play64:26

solution um so there's a lot of stuff

play64:28

happening in in the kind of Frozen rack

play64:31

Community uh to that I think is very

play64:33

interesting to look at um so uh just to

play64:37

wrap up kind of looking at the future of

play64:40

this stuff uh there are still lots of

play64:42

very interesting open questions so if

play64:44

you're a student thinking about how to

play64:46

solve any of these I think you can have

play64:49

quite a lot of impact um so how how

play64:53

exactly do we do like pre-training of

play64:55

this architecture and do we even need to

play64:56

pre-train I think even retro kind of

play64:59

shows that you don't necessarily have to

play65:00

pre-train so but maybe there's something

play65:02

wrong with how we um how we do that what

play65:05

do skating laws look like so I think

play65:07

there's a really interesting question

play65:08

here around if I have a huge index and a

play65:11

very rich encoder of all the information

play65:13

in that index maybe I can move so

play65:16

basically decouple all the memorization

play65:18

to this index so I have a language model

play65:20

that doesn't know anything it just

play65:22

speaks English it just sort of re on top

play65:24

but it has no knowledge because that

play65:26

always comes from this retriever if you

play65:28

can do something like that then you get

play65:29

very interesting scaling tradeoffs right

play65:31

so you can have a tiny language model

play65:33

and and do your retrieval uh to do a lot

play65:36

of the heavy lifting with your retrieval

play65:38

which is nice because that's a cach

play65:40

computation right so you can just you

play65:42

already have the the embeddings you just

play65:44

need to do the dop product so it's much

play65:46

more efficient than kind of self

play65:48

attention in the language model um can

play65:51

we move Beyond bu encoder so Vector

play65:53

databases um I I like people who build

play65:56

Vector databases but I'm not sure how

play65:58

long we're going to keep Vector

play66:00

databases um because u i I think rer

play66:04

rankers probably work just as well and

play66:06

bm25 is much more efficient than a

play66:08

vector database um so I I don't really

play66:13

see why we need dedicated Vector

play66:15

databases and so what we're seeing but

play66:17

maybe this is a bit of a critique of uh

play66:20

maybe silicon value investment

play66:22

strategies and things like that but a

play66:23

lot of these

play66:24

um um Vector database companies are

play66:27

basically becoming database companies

play66:28

now so they are adding all this Spar

play66:30

stuff because the the densing is not

play66:32

enough um and as it turns out there are

play66:34

a lot of pretty good uh sparse databases

play66:38

out there already like postgress and

play66:39

things like that and they're also all

play66:41

adding vectors uh to their databases so

play66:45

uh I think that's all going to kind of

play66:46

coales into

play66:50

databases um so um I think there are so

play66:54

interesting things to look at for kind

play66:56

of the data so alluding to this

play66:57

instruction problem can we generate much

play67:00

better data for training rag systems

play67:03

synthetically uh and then I think

play67:05

there's this massive open question

play67:06

around how we actually measure whether

play67:08

the rag system is any good so right now

play67:10

we just look at Downstream performance

play67:13

um um which is sort of okay but if you

play67:15

mess up the retrieval it's very hard to

play67:17

measure um but how to how to measure

play67:20

whether your retrieval is right is also

play67:22

very difficult so there are some

play67:23

Frameworks where they try to take like

play67:25

the harmonic mean of your retrieval

play67:27

accuracy and your language model

play67:29

accuracy uh but I think those are also

play67:31

very shy because we don't really have

play67:33

very good uh data sets to measure that

play67:35

on so I think that's that's a very cool

play67:37

problem to work on as well um so the

play67:41

other problem that I personally am

play67:43

always very excited about is

play67:45

multimodality um and so why would we

play67:48

stop with rack systems with just text

play67:51

right so you can do the same thing with

play67:53

images uh you can augment language

play67:55

models with vision so we did this work

play67:57

on lens where we have a language model

play68:00

enhanced to see uh where you can just

play68:02

give kind of a computer vision pipeline

play68:05

just like a retrieval Pipeline and give

play68:07

that to a frozen language model and pass

play68:09

it to the context and that system

play68:11

actually is an amazing visual question

play68:13

answering system it's close to

play68:15

state-of-the-art uh sort of flamingo

play68:17

from Deep Mind which is also very hard

play68:19

to reproduce because there's no open

play68:21

source version of that um

play68:24

so so we've done some early work on this

play68:26

in in 2021 uh where we have this cross

play68:29

modal retrieval and there's some uh more

play68:32

recent workout of fair where they also

play68:34

look at this so I think that's really

play68:36

like if you look at the trend in the

play68:37

field like multimodality with GPD 4V and

play68:40

things like that is really a Hot Topic

play68:41

so everything is kind of going in that

play68:43

direction uh so it's an interesting

play68:45

thing to think

play68:47

about um so overall I think um it would

play68:51

be nice if everybody sort of moves away

play68:53

from from rag 1.0 to Frozen Frankenstein

play68:56

Rag and moves towards this much more

play68:58

kind of optimized version rag 2.0 so

play69:01

it's really about systems over models

play69:03

right it's not just your language model

play69:05

and your Retriever and they're kind of

play69:06

separate it's about thinking from the

play69:08

from a systems perspective about the

play69:10

entire thing and the problem you're

play69:11

trying to solve and so I think that

play69:14

really is the way that in deep learning

play69:16

things have always progressed where if

play69:17

you optimize the system end to end

play69:20

that's always going to win out like back

play69:21

in the day in computer vision or NLP we

play69:23

have like parsers and scam parsers and

play69:25

all this kind of stuff and all that just

play69:27

doesn't exist anymore now because we

play69:30

optimize the system end to endend U so

play69:32

that's what's going to happen here too U

play69:35

so if we take that to the extreme like

play69:36

there's a chunker thing in your

play69:38

documents right like put cutting it up

play69:39

into pieces like you could backdrop into

play69:41

that like why not somebody should really

play69:44

do that um and so yeah I I think like

play69:48

trading off cost and quality uh and zero

play69:50

shop domain generalization that's really

play69:52

like where this stuff is going to come

play69:53

in so language models right now they're

play69:55

amazing but very often they're way too

play69:57

expensive for being deployed somewhere

play69:59

where you can actually make money from

play70:01

them if you're in a company um so what

play70:03

you want to do is make it much more

play70:05

efficient and have the right cost

play70:07

quality tradeoff and the the easiest way

play70:09

I can think of is to do it through

play70:10

retrieval augmentation but obviously I'm

play70:12

I'm very biased um so uh yeah that that

play70:16

was all I had actually um so if you're

play70:18

interested in this I'm I'm at Stanford

play70:20

so I can work with you on research

play70:23

projects on these topics or if you want

play70:25

you can also join contextual because we

play70:27

work on this stuff every day thank

play70:30

you well um sorry I had a question from

play70:35

earlier yeah I think you said something

play70:37

really uh really I think really super

play70:40

helpful earlier about Mel 7B you talked

play70:42

about you compared the sliding window

play70:44

attention to convolutional neural

play70:46

networks and I do see the parallel

play70:48

because with convolutional neural

play70:49

networks you have uh several layers of

play70:51

several different layers of

play70:52

convolutional layers and the top

play70:54

convolution layers are able to see um a

play70:57

larger receptive field than the bottom

play70:58

convolution layers and um and with

play71:01

convolution layers you're able to tune

play71:03

the um filter sizes and the stride so

play71:07

you're able to see a different receptive

play71:09

field and I was wondering if you could

play71:11

see that same innovation in mistal 7B by

play71:14

tuning um because you have different

play71:16

Transformer layers and each Transformer

play71:18

layer will have a span over a different

play71:19

set of tokens and if you can tune I

play71:21

guess the Transformer architecture the

play71:23

way you tune those convolution layers

play71:25

the filter sizes the receptive field

play71:27

perhaps we can do some optimization in

play71:29

the Transformer realm that we have

play71:31

already done in convolution layers yeah

play71:34

I I think that so that's a good idea

play71:36

there's there's a great paper on light

play71:38

convolutions I think from Michael Ali

play71:40

and David G and a bunch of people where

play71:43

it's basically uh this this came out at

play71:46

exactly the same time as the Transformer

play71:48

and the Transformer is slightly more

play71:49

optimized for GPU computation but the

play71:52

the computional model was actually

play71:54

slightly better than the Transformer um

play71:57

so I it's definitely worth exploring

play72:00

okay cool

play72:04

thanks advant the re ranker

play72:07

with that does that

play72:12

advantages TR that yeah so it depends on

play72:15

the problem I I I think what you

play72:17

probably want to do is is sort of cast a

play72:19

white net with bm25 and then just narrow

play72:23

it down with then search uh so you you

play72:25

often see that kind of as a two-stage

play72:27

process where the first one is kind of

play72:28

noisy you can add noise actually to your

play72:31

retrieval and then you use the dense one

play72:33

to filter it

play72:35

down yeah everyone's trying to maybe

play72:39

adap their models to

play72:42

own domain specific area like I think

play72:46

there are many two ways project one way

play72:48

is to use instru tuning in learning way

play72:52

or B tuning like

play72:53

meth and another way is just the main

play72:56

topic of this lecture is using rual or

play73:01

so I'm Wonder besides the low cost

play73:03

advantage of theal AED way do you think

play73:07

the capacity or the quality of augmented

play73:11

can be with those

play73:13

T learning yeah so I I think actually

play73:17

what what's going to happen is that all

play73:19

of this will come together right so so

play73:22

if you train things like end to end rag

play73:25

2.0 style then you can also fine-tune

play73:27

that system on some use case end to

play73:30

endend right so what why would you just

play73:33

take the retrieval augmented system if

play73:35

you can also F tune it on the thing you

play73:37

care about so I think in the end

play73:38

everybody's going to do all of those

play73:40

things and then there's questions like

play73:42

how do you do that efficiently so that's

play73:43

why you would use adapter or things like

play73:48

that think there was another

play73:52

question I'm curious about Hardware you

play73:54

say it's going to become database kind

play73:56

of thing respons database but what about

play74:00

retrieval hardware and you SM because

play74:05

we've thought so much of the you know

play74:07

the Le part but what about because it's

play74:11

hug trillions said so you have any ideas

play74:15

just a database problem so I don't know

play74:17

if I'm allowed to say this exactly

play74:19

actually but uh so one of the the

play74:23

biggest chip manufacturers that recently

play74:26

their stock has done really well they

play74:27

have some dedicated retrieval Hardware

play74:30

coming out I think soon or it might

play74:31

already be

play74:33

out um so yeah so yeah that

play74:37

like very efficient uh dense retrieval

play74:40

is a very big

play74:46

business are

play74:51

questions Sol

play74:58

um yes I I think I think so if you take

play75:01

it to the extreme so one of the big

play75:03

problems right now is that that if you

play75:05

contextualize an existing language model

play75:07

that already

play75:08

hallucinates then then it's going to be

play75:10

kind of hard to get rid of the

play75:11

hallucination right so if you do replug

play75:13

on

play75:14

gp4 gp4 might still hallucinate so you

play75:18

it could basically just ignore all the

play75:19

stuff you retrieved and just do whatever

play75:21

it wants anyway uh so that's one of the

play75:23

reasons why you want to train the system

play75:25

end to end and if you take that to the

play75:26

extreme where like I said right if you

play75:28

can just have the language model only

play75:31

reason and speak so it knows English and

play75:33

reasoning but it has no knowledge which

play75:35

all comes from somewhere else then then

play75:38

you can't lose an so it's really all

play75:40

grounded in whatever is in your

play75:47

index but they're so they're they're

play75:49

about hallucination I I'm sort of

play75:51

frustrated that a lot of people in the

play75:53

field misunderstand what hallucination

play75:55

even means right so a lot of people are

play75:57

conflating hallucination with

play75:58

correctness or incorrectness so they're

play76:00

like oh the model made a mistake it

play76:02

hallucinated it's like no it made a

play76:04

mistake that's different from

play76:06

hallucination hallucination I think is

play76:07

very specific kind of I retrieved

play76:10

something so I have some sort of

play76:11

counterfactual ground truth and what I'm

play76:14

saying uh does not correspond to that

play76:16

ground

play76:17

truth um and so yeah I think there's a

play76:22

bunch of folks that stand for also

play76:23

working on better like measurements of

play76:25

hallucination and definitions and things

play76:27

like

play76:30

that understanding correctly your of

play76:33

hallucination only sense in

play76:36

cont yeah of some ground truth right so

play76:40

so Hallucination is is really like there

play76:43

there is something that is true right so

play76:45

so if we're talking about like

play76:47

hallucination yeah so if we're talking

play76:48

about just general parametric language

play76:50

models then sort of the ground truth is

play76:52

whatever we can consider to be true

play76:56

right but we had to word for like

play76:59

language models making mistakes before

play77:01

it was called making

play77:06

mistakes yeah

play77:08

ground I guess you're solving the house

play77:12

question on that path are you working on

play77:15

on

play77:17

ground you

play77:19

know never been president everything

play77:26

this yeah so so I I like the sort of

play77:29

Silo mention there as well so I I think

play77:32

the whole point is that you can you can

play77:35

have different indices and different

play77:36

definitions of ground truth and so um I

play77:39

think you could say I only trust the

play77:42

archive or I only trust like peer review

play77:44

papers and not just archive uh and so

play77:47

you can make decisions in your

play77:49

architecture during test time about what

play77:50

You' Define as ground truth

play77:53

and I also think actually that uh and

play77:57

there's a bunch of work I think

play77:58

happening on this right now you can

play77:59

control for how how grounded you want to

play78:01

be in your ground TR so uh that's

play78:05

another kind of misconception about

play78:06

hallucinations like sometimes

play78:08

hallucinations are actually good right

play78:10

if you have a creative writing assistant

play78:11

and you wanted to come up with some cool

play78:13

new ideas you want the language model to

play78:15

hallucinate uh so I I think what you

play78:18

want to have is kind of a tunable knob

play78:19

where you say like now you can

play78:21

hallucinate and now maybe you should

play78:22

like really tell me the truth

play78:30

only anything

play78:38

else control

play78:41

yeah yeah so but the temperature that's

play78:44

just about how you sample right so how

play78:46

flat your your distribution is

play78:50

sample

play78:51

yeah

play78:53

yes but so even if you have a low

play78:55

temperature it can still come up with

play78:57

random stuff right so it just says that

play79:00

then you're very likely to do like

play79:01

greedy sampling um so so I I think what

play79:05

you want to get at is is something more

play79:07

sophisticated than

play79:14

that lots of interesting questions yeah

play79:17

I like the question thank again for the

play79:19

great

play79:21

than

Rate This

5.0 / 5 (0 votes)

Related Tags
人工智能语言模型检索增强学术讲座技术创新Stanford企业应用多模态系统优化未来趋势
Do you need a summary in English?