Stanford CS25: V3 I Retrieval Augmented Language Models
Summary
TLDR本次讲座深入探讨了增强型语言模型(Retrieval-Augmented Generation, RAG)的前沿发展和面临的挑战。演讲者首先回顾了语言模型的历史和发展,强调了RAG在提高语言理解和生成能力方面的重要性。随后,他详细分析了RAG的工作原理、优势以及存在的问题,如幻觉、属性和时效性。演讲者还讨论了如何通过改进检索器和生成器的协同工作来优化RAG系统,并对未来的研究方向提出了展望,包括多模态性和系统端到端的优化。
Takeaways
- 🎓 讲座嘉宾是Contextual AI公司的CEO,同时也是斯坦福大学符号系统系的兼职教授,他在机器学习和自然语言处理(NLP)领域有着深厚的背景和专业经验。
- 📈 语言模型的时代已经到来,但目前企业级应用仍面临准确性、幻觉、归因和数据更新等挑战。
- 🔍 检索增强(Retrieval Augmentation)是一种新兴的研究方向,通过结合外部记忆(如文档数据库)来增强语言模型的能力。
- 🌐 通过外部检索,语言模型可以实现‘开卷考试’式的运作,无需将所有信息记忆在参数中,提高了效率和灵活性。
- 🔄 RAG(Retrieval-Augmented Generation)是一种结合了检索器和生成器的架构,可以在生成文本时动态地检索相关信息。
- 📚 讲座中提到了一些具体的技术如TF-IDF、BM25、DPR等,这些技术在文档检索和信息检索领域有着广泛的应用。
- 🔧 在构建检索增强型语言模型时,需要考虑如何优化整个系统,包括检索器、生成器和查询编码器等组件。
- 🔍 讲座强调了评估检索增强系统的有效性,需要考虑训练数据、任务类型和更新策略等多方面因素。
- 🌟 未来研究方向包括多模态检索、系统级优化、以及如何更好地控制和测量语言模型的‘幻觉’现象。
- 🚀 讲座提出了对现有语言模型的一些创新思考,包括使用增强检索来提高效率、生成更好的训练数据、以及探索新的硬件和软件架构。
Q & A
什么是语言模型?
-语言模型是一种用于计算一系列词汇出现概率的模型,它可以用来预测下一个词或者生成文本。语言模型在自然语言处理领域有着广泛的应用,如机器翻译、语音识别等。
语言模型的历史有多久了?
-语言模型的概念并不是近几年才有的,实际上这个概念已经有几十年的历史了。最早的神经网络语言模型可以追溯到1991年。
什么是RAG(Retrieval-Augmented Generation)?
-RAG是一种结合了检索和生成的模型,它通过检索相关信息来辅助生成任务,如文本生成。这种模型通常包含一个检索器(Retriever)和一个生成器(Generator),检索器负责从大量数据中检索相关信息,生成器则基于这些信息生成响应或文本。
RAG模型如何解决语言模型的静态问题?
-RAG模型通过引入外部记忆(如文档数据库)来解决语言模型的静态问题。这样,模型可以访问最新的信息,而不必局限于训练时的数据,从而保证了信息的时效性和准确性。
RAG模型如何减少生成过程中的幻觉问题?
-RAG模型通过检索真实世界的数据作为上下文信息来减少幻觉问题。这样,生成的内容可以基于真实可靠的信息,减少了模型凭空捏造信息的可能性。
在RAG模型中,如何理解参数式(parametric)和非参数式(non-parametric)的区分?
-参数式方法指的是模型通过其参数(例如神经网络的权重)来存储和处理所有知识。而非参数式方法则是指模型通过外部数据(如检索到的文档)来辅助决策和生成过程,而不是仅仅依赖于模型内部的参数。
什么是Tfidf?
-Tfidf(Term Frequency-Inverse Document Frequency)是一种用于信息检索和文本挖掘的常用加权技术。它通过计算一个词在文档中的频率(TF)以及该词在所有文档中出现的频率的逆(IDF),来评估一个词对于文档的重要性。
BM25是什么?
-BM25是一种基于概率检索框架的排名函数,用于估计文档与用户查询的相关性。它是目前信息检索领域中最广泛使用的排名算法之一。
在RAG模型中,如何理解闭卷考试(closed book)和开卷考试(open book)的比喻?
-闭卷考试比喻的是传统的语言模型,它需要将所有知识记忆在模型的参数中。而开卷考试比喻的是RAG模型,它允许模型在生成回答时访问外部信息,就像在考试时可以查阅书籍和笔记一样。
RAG模型中的检索器(Retriever)和生成器(Generator)是如何协同工作的?
-在RAG模型中,检索器首先根据输入的问题或上下文检索出相关的信息或文档,然后将这些信息作为上下文提供给生成器。生成器基于这些上下文信息生成回答或文本。两者的协同工作使得模型能够结合内部参数和外部信息来完成任务。
RAG模型在实际应用中有哪些优势?
-RAG模型结合了检索和生成的优势,可以提供更加准确和时效的信息,减少生成内容的幻觉问题。此外,它还可以通过外部记忆来更新和修订信息,使得模型能够适应不断变化的数据和知识。
Outlines
🎓 讲座介绍与背景
本次讲座是本季度的最后一讲,邀请了Contextual AI公司的CEO,同时也是斯坦福大学符号系统系的兼职教授,以及之前在Hooking Face和Facebook AI研究部门担任研究负责人的嘉宾。嘉宾拥有剑桥大学的博士学位和硕士学位,以及阿姆斯特丹大学的逻辑学硕士学位,并在本科阶段学习了哲学和认知人工智能。他的工作重点是机器学习和自然语言处理(NLP),特别是开发更好的语言理解和生成模型,以及更好的评估工具。
🤖 语言模型的发展与现状
讲座首先介绍了语言模型的历史和发展,指出这是一个非常古老的概念,并非由OpenAI发明。通过大量数据训练,现代的语言模型展现出强大的能力,但同时也存在问题,如生成内容的真实性和时效性。为了解决这些问题,研究者们正在探索如何通过外部记忆和检索增强(Retrieval Augmentation, RAG)来改进语言模型。
🔍 检索增强的基本概念与架构
讲座详细讨论了检索增强的基本概念和架构,包括生成器、检索器和查询编码器的角色。通过将检索到的外部信息作为上下文传递给语言模型,可以提高模型的准确性和时效性。同时,讲座也探讨了在训练和测试阶段对这些组件的不同处理方式,以及如何通过优化整个系统来提高性能。
📚 从稀疏到密集:检索方法的演进
讲座回顾了从稀疏检索(如TF-IDF和BM25)到密集检索(如基于向量的DPR和ORA)的发展历程。密集检索通过向量表示提供了更好的语义相似性,而稀疏检索则依赖于词频计数。现代的检索系统往往采用混合方法,结合稀疏和密集检索的优点,以提高检索的准确性和效率。
🌟 RAG的优化与未来方向
讲座探讨了如何优化RAG系统,包括如何更好地训练检索器、如何更新文档编码器以及如何通过端到端的训练来提高系统的整体性能。同时,提出了一些开放性问题,如如何更高效地训练大规模RAG系统、如何生成更好的训练数据、如何衡量RAG系统的有效性,以及如何将多模态性整合到RAG中。
💡 互动环节:问题与讨论
在互动环节中,嘉宾回答了关于RAG系统、模型架构、训练策略和硬件优化等多个方面的问题。讨论了如何通过不同的训练方法和硬件支持来提高RAG系统的性能,以及如何处理模型生成内容的真实性和创造性。
Mindmap
Keywords
💡语言模型
💡NLP
💡多模态
💡检索增强
💡用户界面
💡生成对抗网络(GAN)
💡开放域问答
💡稀疏表示
💡密集表示
💡上下文理解
💡模型评估
Highlights
我们正处于语言模型的时代,语言模型的概念并非OpenAI首创,而是几十年前就已经存在。
语言模型的基本思想很直观,即对输入序列进行概率分解,预测下一个词。
ChatGPT的出现改进了语言模型的用户界面,使得人们可以更自然地与模型交互。
当前语言模型面临的挑战包括幻觉问题、归因问题以及信息过时问题。
检索增强(Retrieval Augmentation)是一种新兴的语言模型架构,通过结合外部记忆来增强模型。
RAG(Retrieval-Augmented Generation)是一种结合了检索器和生成器的架构,可以端到端地优化整个系统。
通过优化RAG系统,可以实现更高效的语言模型,减少计算资源的消耗。
在训练RAG系统时,可以采用不同的策略来更新检索器和生成器,以适应不同的应用场景。
RAG系统可以通过主动检索来决定何时进行检索,从而更智能地分配计算资源。
通过使用不同的索引和数据集,RAG系统可以适应不同的领域和需求。
RAG系统的发展强调了从系统层面而非单一模型层面进行优化的重要性。
未来的RAG系统可能会包括多模态能力,如结合视觉信息来增强语言模型。
RAG系统的一个关键研究方向是如何生成更好的训练数据,以提高模型性能。
衡量RAG系统性能的挑战在于如何准确评估检索的准确性和语言模型的输出。
RAG系统的发展可能会推动专用检索硬件的出现,以提高检索效率。
RAG系统的发展趋势是从单一模型转向系统优化,实现成本与质量之间的最佳平衡。
Transcripts
hey guys welcome to the our last lecture
um of this quarter and we're very happy
to have a daa here he's a the CEO of
contextual AI um the Enterprise llm
company as well as an Adjunct professor
in symbolic systems here at Stanford and
previously he was the head of research
at hooking face and before that a
research scientist Facebook AI research
uh he received his PhD and masters from
the University of Cambridge um as well
as a master in logic from the University
of Amsterdam and studied philosophy and
cognitive AI in undergrad um and his
work focuses on machine learning as well
as NLP specifically on developing better
models for language understanding and
generation and better tools for
evaluation and Ben yeah give it up for
Adella
right thank you so I guess I have to
sort of stand here in the corner so
people can see me on the zoom as well um
yeah thanks so much for having me here
um so I asked Steph what I should talk
about there were a couple of things I
could talk about multimodality or
evaluation uh and this was the preferred
topic I guess uh because the others were
already covered um so yeah I'm I'm very
happy to talk to you about everything
retrieval augmentation um I think this
is really one of the topics right now in
our field um so I I'll just give you an
overview of what's been happening and
what I think are the interesting
questions to think about um so first of
all obviously in case you've missed it
we are in the age of language models um
and I just wanted to do a quick poll
here in this not not super big audience
I guess there's more people on the zoom
but uh who invented language
models if if you thought open AI then
I'm angry with you right so so actually
uh this is a very very old idea so the
idea is just you you take a sequence and
you factorize out the token
probabilities right and um so it wasn't
invented by open AI it's not like a few
years old it's actually several decades
old uh so I'm bringing this up because I
was talking to someone and they were
like open AI invented language models
and I was like you're kidding me right
um so um I I I went back to the
literature and this is the oldest one I
could find actually 1991 first neural
language model um there's a very nice
paper from 2003 from
pjo where they they actually have like
word embeddings and everything already
in there uh so obviously these are LMS
not llms um and as it turns out if you
make them really big and you
parameterize them with these massive uh
neural Nets then you get something
really powerful that really shows
emerging uh properties right and that's
why we're all excited in this stuff um
so if we think about this from like a
classic CS perspective there's input
output right there's this kind of thing
in the middle it's the generator so uh
we take a sequence the input sequence
and then the the task of the model is to
predict the next token very very simple
model um and and so you know that's why
it was so easy to come up with this in
1991 already because it's like the idea
is very intuitive but for a long time
what was really broken with this was the
user interface um and and this I think a
lot of people kind of misunderstand what
Chad gbt was about that's really what
Chad gbt fixed right so that in
initially you had to come up with these
very weird prompts in order to get your
language model to do what you wanted it
to do uh and humans are terrible at this
right so so we're much better at sort of
telling people or things around us what
we want right so if we have a dog we say
sit we don't prompt it in a very weird
way so that it sits right and it's the
same with the language model if you
wanted to generate some R lyrics in the
style of a pirate or Shakespeare or
something then you tell it generate some
R lyrics in the style of a pirate right
so that kind of instruction data
actually turns out to be super super
rare in just web data so what you need
to do is you need to fix the user
interface to the language model and the
the classic recipe for doing that is the
the sequence basically that chat gbt
used right so you promp the model in a
specific way you instruction find in the
model and do you do some alignment rhf
uh whatever you do on top of that so
that's the first thing so now you have a
working language model with a working
user interface so are we done then um
obviously we're not right so so right
now language models are are kind of
taking the World by storm but if you
talk to anyone especially in an
Enterprise for example where they have
very strict uh accuracy requirements
they will tell you that they can't
really productionize this yet um and the
reason is because there are all these
familiar problems probably a bunch of
you are working on these problems right
now uh around
hallucination um so these models they
kind of make up stuff very often with
very high confidence which is uh even
more scary in a way attribution so we
don't really know why these models are
saying what they're saying Stillness
they go out of date and so this was a
big problem with sort of chat GPT not
knowing anything that happened after a
certain cut off date and they keep
updating it every once in a while but
you want to have a system that's always
completely up to date that never goes
still um you want to be able to to
revise the information in the system so
uh if you're uh a European organization
you have to worry about gdpr uh which
means that you need to be able to remove
information from the language model or
maybe Revis facts uh which we don't
really know how to do right so again
this is a very interesting uh area of
study for a lot of folks model editing
um but so this is something that we
really want to be able to fix and then
there's this big question of how do you
customize these models uh so different
people have different use cases you have
different data if you're a company or if
you want to have a language model on
your own data how do you make it work on
your own data so one of the solutions uh
that everybody has started using right
now is to couple it to an external
memory so that's really just rag right
the uh we we can this whole lecture is
basically about rag uh but the way to
understand uh what is going on here is
uh we have this generator just like
before we have the input and a prom just
like before but now uh instead of just
those two things we give this additional
context so we contextualize the language
model using things we retrieve and and
the retriever uh is is very often pretty
simple it's just a query in a documents
encoder um and then you get a bunch of
documents you give them as context
through the model so super simple
architecture um and I think it's useful
to think about it from the perspective
of of these two separate paradigms uh so
if you've ever taken an exam I'm sure
you have right uh you can have a close
book exam where you have to memorize all
of this so you have to cram all the
knowledge into your parameters your
neurons uh or you have an open book exam
where you have all of this information
in the book that you can access when you
do the exam uh so it's a very similar
thing with rag right you can just make
it an open book setting where you give
it access to this external information
Wikipedia or something else or basically
the entire internet uh and then have the
language model do its job without having
to memorize all of it in it
parameters um so the other I think
useful distinction here is that uh
cramming everything into your parameters
that's the parametric approach right so
U what we're doing with rag is we're
adding this non-parametric retrieval
component um so uh you might call this
semi- parametric um if you want to give
this a
name all right so why why does that
actually solve these issues and so the
answer is basically that if you have
this separate Index right this separate
retriever you can swap it in you can
swap it out you can replace it with a
new index so you can really customize it
and so you can customize your language
model system for what the user really
wants to see um and then obviously you
can update this index so um it doesn't
really go still and you can revise it if
everything goes wrong if anything goes
wrong uh the other thing you get is
grounding right so that that's initially
why I became interested in this kind of
architecture because I was thinking a
lot about grounding and multimodal and
things like that and actually one really
nice way to ground things is to find
some other information that you can
ground your generation in so you really
want the language model to only say
things that it has evidence for in this
outer piece of text or even multimodal
data that it retriev separately so if
you do that then you get less
hallucination because you can always
point back to your Source it's always
grounded in your Source um and you get
attribution because you know why the
model is saying what it's saying it's
because it founded this thing here is
that
all right so um for the rest of this
lecture we're going to talk about this
this basic architecture um and so it
kind of looks like a pretty simple thing
right uh but there are actually lots and
lots of questions you can ask about what
what this system should really look like
um and like this this doesn't even cover
like half the questions you can ask so
it really is about how how do we
optimize this entire system right so we
have the separate components the
retriever the generator and then um
there are things like this query encoder
how do we encode queries how do we uh do
the retrieval do we update the document
encoder how do we actually uh Define a
document right is it like a full
document or is it a paragraph or a chunk
or a sentence or a couple of words um so
there are lots of questions to ask and
and uh as you'll see there are lots of
possible answers to these questions as
well um so this is what we'll we'll
cover
um so there are lots of
architectures um going into these
questions and I think as we go through
them it's useful for you to think about
what happens during training time and
what happens during test time right so
during training time is really uh okay
we have this language model we have this
retriever um which one do we update how
do we update them how do we train this
entire system do we maybe not train it
at all uh do we pre-train it from
scratch do we initially I it with uh
components that were already separately
trained these are the kinds of questions
that that you have to answer if you want
to design a system like this and then
during test time uh you have this entire
system right so actually multiple models
in a way uh that are working together um
so so there's also different things you
can do there right so give it different
indices during test time or uh
manipulate kind of how you're sampling
things like
that so um the starting point for all of
this stuff I think if you ask someone
now like what is rag they will think of
this thing um so this is frozen rag
basically uh there's no training here at
all so going back to this question of
train time test time there's only test
time here train time happen separately
with these kind of blackbox models that
we don't necessarily have control over
right so there's this document embedding
model uh whatever is currently at the
top of some open source uh leaderboard
uh you use that to oop sorry uh to get
some vectors that you then use to create
this Vector database and then the vector
database just does search and it gives
the information from the search to the
language model and it just passes it as
uh as the context right so this is this
only works because of in context
learning um and you know I think as a as
a machine learner myself this feels very
inelegant um so what what this lecture
is about is can we do better than than
this Frozen
thing um so let's let's start from the
the left side of this like okay if we
want to outperform this Frozen thing
itself with just the vector database
like what would that look like from a
retrieval
perspective um and the starting point
for everything retrieval is is tfidf
does everybody know what tfidf is no
okay so so tfidf is basically a sparse
retrieval method where you have a score
function uh that that looks at documents
and queries so D and Q and then there
are basically two terms that matter one
is the TF the term frequency and the
other is the IDF the inverse document
frequency so this inverse document
frequency is actually a really nice idea
from Karen spark Jones really underrated
researcher she's done some amazing work
um but the basic idea is that you want
to look at the words that are very
special so that don't occur in lots of
different documents and so the overlap
between the word the doesn't really
matter matter right like the occurs
everywhere so you want to have sort of
the special words so that's what what
tfidf does in a nutshell it gives you a
score for document query overlap and
then you can do all kinds of things here
with how how you weigh it so there's all
these weird different parameters like
this B and things like that that allow
you to make it better than just having
the the tfidf score so there's a couple
of tweaks you can do there so bm25
actually in case you're wondering stands
for best match 2
so I I try to discover like where does
the 25 actually come from uh that's
because the the prior s the preceding 24
experiments failed right so it's
literally the 25th one that seemed to
work and that's why it's called
bm25 it's bizarre right but um um so so
this is spars retrieval it's just
counting words right so you have this
massive massive Vector of all these word
occurrences it's sparse because most
words never occur right so it's sort of
like a vector of uh vocabulary size
dimensions so most of that is obviously
zero um but so that's actually kind of a
nice property if you want to do fast
search on a CPU right because on a CPU
sparse uh do product is very easy to
compute so um this is used in in the
system called uh Dr QA which is really
one of the first neural instances of
this open domain sort of open book
question answering Paradigm um so you
have a question like how many of
warsaw's inhabitants blah blah uh so you
want to ask basically Wikipedia what the
answer is for this so then you have this
document retriever based on the sparse
so bm25 I think in this case uh
retrieval methods you pass that to um at
this I think this was still by lsdm at
the time um a document reader model and
then that model gives you the answer um
so this I think is really the first
instance of having sort of this
separation between a retrieval and a
generator system that you use for
answering complicated questions based on
sort of open domain
knowledge um so after The Spar stuff um
there was a bunch of work on dense
retrieval and and so the advantage of
dense retrieval so this is just like
word and Benes basically vectors right
they're they're dense now no longer
sparse so they're much uh smaller in
terms of dimensionality and the nice
advantage of of dense retrieval is that
it's not really about specific work
right so uh if there're synonyms you can
still um find the relevant document uh
which you couldn't really do with a
sparse representation right so that's
really the advantage of DSE is that you
get like semantic
similarity um so you can do this over
word embeddings that doesn't really work
all that well but uh at the time that
people started thinking about this ber
was already out there and ber is really
great for giving you a vector
representation for an entire sequence of
words right so a sentence representation
or a passage representation
so there are all these cool systems like
Orca and uh DPR the dense passage
retriever where um they essentially use
the retrieval as a kind of latent
variable in the system U and and the way
to get the latent variable to to work to
be good enough essentially to train the
entire system is to pre-train the
retriever on uh relevant information so
for Ora they do something called inverse
close uh so they do kind of a close task
where you want to find
um passages that are sort of relevant to
the preceding passage and in DPR they
just train it on on a supervised thing
but really the core idea here is that uh
as you can see in this graph here you
can do better than bm25 if you add lots
of documents and the way you compute
this score function is much simpler it's
just a d
product right um so the nice thing about
D products is that you can do them very
very efficiently on the GPU as well um
if you uh know what you're doing so what
you really want to get at is maximum in
product search mips right this is one of
the kind of core ideas of a lot of this
stuff um and you can do mips with Ann
approximate near neighbor search um and
so there's this this really uh brilliant
piece of work out of there for my
colleagues at the time uh called phas
which really underlies all of these uh
modern Vector databases right so like
all the popular ones they sort of
re-implementations of this face idea one
is in like rust one is in go but it's
all basically the same idea it's just
face um and so so face really Powers a
lot of this stuff um and whenever
somebody tells you something about a
vector database just think about face
very fast do
product um so obviously you can go
beyond do product yes what is it what is
face um so so it's an open source
Library Facebook AI similar
search no so it's just basic off the
shelf Ann
algorithms yeah so so there are all
kinds of different I don't know if you
do you know what like product
quantization is and things like that so
there they're basically so you have a
bunch of vectors uh and you can just
compute the full dot product which is
sort of inefficient right so what you
can do is try to compress uh subspaces
of the vector and then just look at the
kind of
centroids um so this so you can quantize
sub vectors of the full vector and then
do much faster search over just the
centroids it's good question any other
questions um all right so so about this
dot product idea right so so what we
have here is some people call this a
Siamese Network I guess it is right so
you have two different bir models uh or
whatever your encoder is here and then
at the end you get these two vectors and
then you just do do product so you get
one single score but you can do all
kinds of much fancier things if you if
you're willing to give up on this buy
encoder uh approach right um so really
nice example from from one of your
colleagues here at Stanford uh is
Colbert um so what this does is is late
interaction uh so so instead of just
having this dot product here you have a
kind of more complicated uh
version of computing a score where you
aggregate over sort of Maximum
similarity scores between different
words so I only recently actually
discovered that this is called Colberg
because of the late night show Colberg
so it's sort of Omar's joke actually
this name but just just so you know if
you run into it um so um but but I think
if if we look at kind of where the
state-of-the-art has has been going now
one of the nice things about these
Vector databases is that they're super
efficient right so dot product is much
more efficient than this late
interaction stuff especially if you do
the approximate nearest neighbor search
um but there's been some really cool
work so things like Spade uh they
basically have have sparse meat dents in
a way so one of the big problems as I
said with spars is that you can't really
handle synonyms and things like that but
what you could do is take a dense model
Like a Bird model look at kind of this
this one word in your sequence try to
see which other words in the same slot
so that gives you the synonyms uh so now
you can give all these synonyms to a
sparse uh vector and then you can just
do Spar doll product and so have a much
much more efficient way to do search uh
without sort of giving up on all the the
cool stuff that you get from a dense
representation um so that's one thing
and this other idea I really like uh is
called Dragon um so this I think is
really the the the best generalized D
dense retriever so if you want to take
something off the shelf right now and
just go to hugging face or something
then this dragon or Dragon plus is
probably the thing you want to use for a
dense Retriever and the way they train
this is is through this Progressive data
augmentation strategy to make them the
model better and better over time by
sampling very difficult negatives um and
that gives you very good uh
representations um and and so the other
thing about this I think this is the
only only sort of final point about uh
retrieval in general is that is that
what we see happening right now if you
look at sort of the developer Community
around rag is that they're all doing
hybrid search right now uh so you can
actually just combine the search results
from your sparse bm25 or whatever thing
or spade and you can combine them with
your dragon uh and then you get uh this
ranking that works even better uh so
then you kind of get Best of Both Worlds
but then you get all these questions
about how do you combine the
results um any any questions on on this
part oh can you hear me
yes oh sorry um on the earlier slide uh
was there has there been any work on um
Benchmark how much less hallucination
rag incurs over a closed book question
answering for example directly asking
the large language model the question
has there been any benchmarking studies
in this yeah so there there's a great
paper if I can say so myself on the fact
that retrieval augmentation reduces
hallucination uh it's from 2021 I think
um so so yeah you can just F find if you
literally look for retrieval
augmentation reduces hallucination then
you'll find the paper uh thank
you yeah so so uh very often you want to
have um an very precise word overlap for
things where you don't want to have the
synonyms or the kind of nearest
neighbors right so um if there's like a
brand name name or or something like
that then like let's say the brand is
apple right you don't want to find stuff
about pairs right so that's what you
would do with a dense retriever um so so
it really kind of depends on what you
want to use it for that's why hybrid is
probably the way to
go it's a good
question with the
dance it's
um it's contextualized that but
shouldn't it realize Apple the company
would be different from no so so if they
were actually contextualized then yes
but but very often it's a a frozen
retrieval system right that's one of the
problems with all the Frozen rag
stuff I might be missing very
B refering to the factors that
you're factors that you're using is
or uh no so so the the the the sort of
document and the query that they're the
same right so they're either sparse or
they're dense but so if they're sparse
the components of the vector are are
literally the other
work you just Oneal when
you're the thing that
creates uh how are you getting so it's
literally counts right so so basically
it's a one big Matrix of documents as
rows and the columns are the words in
the documents and then you just count
how often a word occurs in a document
right so that's as
far also
refering yeah and so so in the field we
call them sparse sparse embeddings or
sparse retrieval because most of that
Vector is zero right because most wordss
don't occur in that
document does that make sense
yeah
cool um so um let's talk about uh doing
slightly better so so going back to
Stephen's question about okay we we have
this kind of retrieval thing but like
how do we actually make this retriever
good for the context that is going to be
used in right so can we contextualize
the retriever for the generator uh even
if it's it's a generator where we might
not have access to the weights so it
could be a gp4 model we just send it to
some API we get some stuff back um
and so uh one paper I really like is
called replug um so just just to kind of
explain what this looks like so you have
this context you have a retriever that
we do the the standard retrieval set
with this is a dense retriever um and
now sorry um and now you uh compute the
the likelihood so basically just
normalize the scores that you get for
for the topk documents to get a
distribution here and then uh you give
each one of the retrieve documents
separately to this generator to your
language model so you can look at the
perplexity of the correct answer for
that language model right so now we have
these two probability distributions or
two likelihoods essentially and we can
minimize the KL Divergence to make sure
that we can actually uh retrieve the
documents that lead to the lowest
perplexity on the right answer for the
language model um so super simple idea
uh works really really well uh and the
nice thing about this is is completely
uh agnostic of what happens Upstream
right so this will work for any sort of
encoder decoder for any language model
um what what you need is a perplexity
score uh but for most language models
you can get that not necessarily all of
them so that's one thing and then
there's this other really nice approach
um what you what parameters are you
changing so so in the retriever you're
you're literally updating the uh the the
dense representations
right so your encoder basically for your
dense representation that's good
question we'll get more um so there's
this another paper uh on in context
retrieval augmented language models
where the whole paper is basically about
just doing bm25 and just giving stuff
directly to the context of the language
model and things kind of work so it's
it's sort of Frozen rag but even even
more primitive in a way where the the
retriever is uh this very old sparse
algorithm but it works really really
well um but then they have this really
awesome section where they they show
that you can just have this uh ranker on
top of the bm25 results um and you can
backdrop into this ranker so now you
still keep the language model completely
fixed uh so that's sort of this part of
the the loss here uh so you have kind of
a stop gradient on the parameters data
that's just your language model but now
you have this uh this kind of rank
function here that you can back propop
into right so that's your ranker is
basically can be a bir model or anything
like that that works on top of the
things you initially retrieve from your
bm25 and now you have this bir reer
ranker that you can backrop into um so
this also works really really nice so
we're slowly progressing towards having
a system that is much more optimized for
being properly uh retrieval augmented in
a way where it's useful and and
contextualized for what you want to use
it
for um so uh yeah just to point out kind
of what that looks like with this ranker
so you just have this extra step
essentially right so we have our
retriever then we have our ranker then
we have our generator and our
output no not
necessarily um so so so for this one you
do yeah but so for replug you don't
right yeah yeah yeah yeah yeah so
basically yeah you need to get do apis
provide not all of them um some of them
do right but but yeah there are all
kinds of tricks you can do on top of
that
yeah um so
so basically the question is how do we
get sort of gradients flowing into this
right so if you don't actually have
access to the full parameters of model
so that you can backrop all the way
through it then you can uh do a
reinforce style loss on on the retrieval
and then you just pass the kind of log
likelihood if you if you have access to
that or some other kind of blackbox
function
all right so um I the next thing you can
do uh is to optimize both the Retriever
and the generator um and and so this
really uh start starts getting to the
the proper kind of contextualization of
the entire architecture where you want
everything to work together right so
rather than having this Frozen thing
where everything is basically not aware
that the other part exists right it's
like two halves of the brain they're not
talking to each other one is your
retriever that is your language model
there's no connection they're just like
sort of like something is thrown over
the fence and then you hope for the best
uh so instead of that we have everything
much closer and learning together um so
um one of the the first um ways of doing
this with a generator uh was rag
retrieval augmented generation uh which
we did at ver in 2020 um and and it's
very similar to what we've already seen
we basically have this retriever here
that works over different documents you
get some score function uh that gets
given to this generator um that that
generates answer and now you want to
backdrop all the way and update your
generator as well right so in the
previous two architectures we saw you
keep the generator fixed you backdrop
into your retriever but here we update
everything well not exactly everything
as you'll see but we'll we'll also
update the the part of the Retriever and
the
generator um so in this rag model uh we
actually have two different ways of
doing this and this this is probably
something that when we talk about this
uh if you think about this long enough
then you'll you'll think like okay but
when actually do I need to retrieve like
do I do I retrieve every time I generate
a new token or do I just retrieve once
and then generate an entire sequence
right or maybe I want to retrieve every
end uh tokens right so these are hyper
prams or maybe I want to learn when to
retreat as as we'll see that's also
something people have done um so are are
two different ways to do it um and and
what we do in this paper basic the whole
point of the paper is that this Frozen
thing doesn't really work all that well
right so I think what people Call Rag
now is is usually refer refers to the
Frozen thing uh but the whole paper
basically would never have been accepted
anywhere if we had just done the Frozen
thing right the whole point of the paper
is that you want to uh optimize it and
so at my company contextual we call this
Frozen thing Frankenstein's monster
because it's really like you Cobble
together these different pieces right
you sort of yeah it's it's really like
Frankenstein you just put it together
and then it sort of walks you know uh
but it doesn't really have a soul it
doesn't really actually work it's not
the real thing um so that's great for
for everyone here I think because there
are so many opportunities to do better
than what what most people are using
right
now um so one of the limitations of of
the original rag architecture is that it
only supports a very small okay but so
if you have lots and lots of documents
uh then the problem is that you have to
fit all of them in the context but how
do you really get that uh to fit right
so one thing you can do is you you first
encode uh things so that you get one
single representation or only the few s
of top level representations then you
concatenate those and then you just feed
them to the decoder so this is FID
Fusion in decoder um and as you can see
the scales to a much higher uh number of
of passages uh and that uh leads to
corresponding improvements in uh the
scores that you care
about uh so that's a really cool idea
and so so we're we're slowly moving
towards more decoder only architectures
right so in rag we have this bark model
it's sort of an encoder decoder
architecture but here you just have this
decoder that does some fancy attention
over stuff that you retrieved before um
and and so another like pure decoder
language model architecture um is this
one
KLM which I think is is very elegant in
its simplicity so it's basically you
just have a normal language model but uh
you interpolate the normal language
model weights with uh things that you
retrieved um so basically you have some
sort of prompt right so like Obama's
birthplace is you go to your big Corpus
you find similar things you look at the
words that come next to the similar
things uh you uh rank that thing you
sample your top K you renormalize that
so now you have a bunch of scores and
now you can just interpolate between
your retrieved kind of non-parametric
memory scores and your parametric
language model scores so this is very
late Fusion in a sense right you at the
very end you combine these two uh and it
allows you to re reweight the pure
language model probabilities or
likelihoods um so this works really well
and it scills especially well if you
have a huge uh retrieval Corpus so if
you have trillions and trillions of
tokens in there you could have a much
smaller language model that does not
that much heavy lifting because you can
really rely on this big Source Corpus
that you're working from and so that
idea was uh exploited by this paper
called retro out of Deep Mind where uh
they showed that you can have a 25 times
smaller retrieval augmented language
model trained from scratch so really
pre-trained uh entirely from stretch
that outperforms this 25 times bigger uh
language model on the same data in terms
of perplexity which is pretty impressive
right so this architecture is much more
efficient than a parametric model
because you can rely on this external
memory so if your external memory is big
enough uh you can get pretty huge gains
so there was a lot of excitement about
retro when it was announced uh but it's
a deep mind paper so there's really no
open source nothing really to validate
that this actually Works um and so very
recently there has been a bit of work
from Nvidia called retro
Plus+ um where they have this hybrid
between the Retro architecture and then
they do basically Rags sort of they put
the top one or the topk results in the
context of the language model after all
so it's sort of a crossover between Rag
and retro and they show some really nice
results here but I I think it's sort of
pointing to this uh big flaw I think is
that why is there still no good open
source retro
model that probably tells you something
about whether it actually really works I
I spent a lot of time in my career
trying to reproduce deep mind papers
that didn't necessarily always work uh
and so I I think the the same is true
for retro um and that's why we need to
do this in context rag on top of retro
to actually get it to
work but could it just be a true book
thing because you're searing onook
yeah but so
that no so the the doing retrieval over
that to over that big Corpus is not that
difficult actually yeah um so so they're
even like distributed pH packages you
can just do everything yourself so yeah
so in terms of comput it's it's actually
not that hard anymore to to reproduce
something like this uh but I've tried
several times and it it's not really
reproducible
so the only way to get it to work is if
you do this in context rag on top of the
Retro thing and then as you can see here
in the results then it actually gives
you a gain over the pure GPT model right
so it starts from a GPT and then they
kind of retrofit as they call it the GPT
model so in short I think there's still
a lot of work to be done in pre-training
these systems really from scratch uh and
retro kind of showed that it might be
possible but we don't necessarily know
exactly how to do it the right way and
this is really one of the interesting
open
questions um any questions on
that
online no okay then we'll move on um so
um let's go all the way with the
contextualization now right so so with
retro and with rag what we actually did
is we only updated the query encoder uh
so updating the document encoder is very
expensive so one of the first papers
actually kind of the the OG of the the
non-frozen dense retrieval augmented
methods is this uh paper called realm
this is really like Visionary work this
was basically the first uh uh kind of
version that did this properly where
they updated it all the way including
the document encoder um so can can
someone explain to me why it's expensive
to update the document en
coder so let's say we have a trillion
tokens in our Corpus right and now so
now we go all the way so we basically do
a forward pass we get a gradient at the
end now we back propagate the gradient
through the retriever we update the
query encoder now we have to update the
document encoder so what do we then need
to do after we've updated the document
encoder we need to re-encode the entire
internet right so basically every single
gradient update we have to re-encode
whatever our index is which so if this
is like trillions of tokens it's like
re-encoding the internet after every
batch update so that's not very
efficient
change
Stuff AC have
some
predictable
yeah
yeah that's one one way to do it uh so
so there there are a bunch of different
ways to update the the document encoder
so what they do in realm is they
basically do it for Te batches then they
stop they re-encode the entire internet
and then they train again uh so it's
sort of asynchronous updates they have
this very fancy sort of sharding
mechanisms where they take down uh
certain parts of their entire index uh
and then update them kind of on the Fly
uh so you can do it is just very
expensive so one one of the things that
a lot of people have been thinking about
not exactly theora idea but but similar
versions of that um are around like can
can you make it more efficient so that
you don't have to do do this
asynchronously um so one of the
downsides of this realm uh architecture
is that it's really just a bird model
but then you do this retrieval
augmentation on a bird model with other
bird models so it's not really
generative it's not really gen in the
modern Paradigm but if you want to read
like one paper uh on this topic like
this is a very good one to
read uh the other one that is is really
really good to read uh is this paper
called Atlas uh so Atlas is um uh so
this is out of fair um with a bunch of
folks the folks who did like Rag and the
folks who did FID and uh a really a
brilliant set of people and and this is
really a comprehensive uh analysis of
everything that's happening in this Arch
ecture so the first question they really
look at is how do we train this
retriever so we've seen a couple of
versions of this um but uh which one
actually works better they haven't
really been compared in a head-to-head
setting uh so one thing is we have this
FID Styles s vention distillation uh so
that's really too complicated to go uh
into detail here but the others are
actually very simple um so one is this
loss we've basically seen before right
uh so we've seen this I think with the
in context rag one right so we have a
stop gradient on the language model and
then we update the retriever the other
one is what we've seen with replug so
this is basically exactly the replug
loss right so we have the K Divergence
of the um the documents and and sort of
the Improvement that you see when you
give it that document uh the other thing
they have is basically the inverse of
that one so if I take this one document
out how does that affect my uh my
perplexity of the language model right
um and so this one I think is actually
quite elegant because that really gets
to like how valuable is this one single
document for me answering this question
correctly um so uh they compare all of
these different versions and uh what you
can see is that uh the the kind of
replug style loss and this leave one out
loss they perform a lot better than all
of these others so this fixed retriever
or no joint pre-training these are
really kind of the Baseline sort of
Frozen rag models or close book uh and
as you can see you can do really a lot
better uh if you optimize things and so
this leave one outing is probably the
best I would say um so then the other
question is how do you actually like
train that entire system like what data
or what tasks do you train this on so
they also uh experiment with a bunch of
different versions uh so one is uh doing
prefix LM if you're familiar with that
uh so they basically take a chunk that
occurs somewhere on the internet and
then they predict the next Chunk from
that chunk right so it's really like
sentence to sentence so maybe like skip
thought back in the day but now you have
this retrieval step where you predict
the next sentence uh then they just do T
T5 Styles sort of D noising so that's
Mass language modeling if you're
familiar with T5 um and then they have
this title to section generation piece
so um I think the takeaway from this
table is basically that whatever you do
here so they're using T5 model so
whatever you do here needs to be the
same that your uh language model expects
um so for T5 that's T5 style
loss um and then uh the the the next
sort of final question that they look
into going back to to what we talked
about how exactly do we update this
retriever uh so do we have to update the
document encoder or do we maybe have to
do some sort of reranking uh or do we
maybe just update the query um and and
quite surprising L I think they find
that just updating the query so like in
the original rad paper is actually
already basically good enough in many
cases so so that's nice because it's
much more efficient if you don't have to
update your documents all the time uh I
think the the real question here though
is like uh how good is your document
representation to begin with so you need
to have very very high quality embedding
model for this to work if you don't have
that then this will not work but if you
do have that then you get a very nice
kind of query side fine-tuning
thing U so the the atlas paper is about
trying to do F shop um sort of language
modeling tasks so it's how how many
examples are given in the
context um yeah so so the main takeaway
um here is that if you compare like the
Close book equivalent model to the
retrieval augmented model uh you see
very big
improvements that's really the only
takeaway of of this entire
section um but I I think that that's
really saying something uh in terms of
what we should be thinking about um how
how much time do I have
in
still okay okay all right other
questions are the documents in the
training step same as
yeah so they can be different um in so
in Atlas the athlet basically tries
everything uh so they also try to see
what happens if I train this on
Wikipedia But I swap in like a sort of
Comm and crawl index um and I think so
in Atlas but also in retro domain
finding is just the more the better uh
so it's really just like the bigger your
index the more likely you're you are to
find the exact right thing um and then
make the right
prediction any other questions on this
oh yeah uh sorry I this is a question
about the generator in the I guess uh
the rag system so um recently I saw a
paper on mistal 7B so it introduces a
lot of these uh new architectural
changes like the sliding window
attention to handle longer sequence is
at a smaller cost and the group query
attention for faster inference I'd like
I'd like to like know your thoughts on
designing a generator specifically for
rag uh leveraging for example where
mystal 7B currently is because for
example like the sliding window
attention I could see how that could be
adapted to the rag
case yeah so so maybe your read on sort
of what makes mol's special is a bit
different from mine so I I don't think
that the sliding attention window thing
is actually that interesting the reason
mrol works so well is because it's
trained on a lot of data uh and you can
do that more efficiently because you
have sliding window attention so you
don't need to attend to everything um
but uh so to answer your question I I
guess you're asking sort of about the
architecture of the generator if you
know that there's going to be a
retriever so I I I think uh that's
basically what retro tried to do right
so um retro actually some of the people
on the Retro paper are at Mistral now uh
so they they have this uh C chunk cross
attention idea here so you basically
have a language model but the way it
does the tension over the things you
retrieve in your retro um architecture
uh you they they kind of get integrated
into a model not using the standard
detention mechanism but using this
slightly different chunk cross
detention oh okay so I think the the
sliding window Attention Point I was
trying to get get at was that uh it uses
a fixed window so that whenever you're
doing the query key computation in the
attent with the query vectors and the
key vectors you're using a fixed window
attention so I think my idea was to
actually one use a dynamic window
because for example the rag case um if
you use a fixed window when you're doing
attenion it it is possible that you
actually are leaving you you're only
looking at a fixed uh span of
information so if you could maybe adapt
mistel so that you could make it better
for the ride case and and for example
the making the fixed window size the
dynamic window uh yeah yeah I think it's
an interesting idea so so for me uh the
the what m is doing with with the
sliding window that's basically like a
conet right so we had all these
convolutional like light comp Nets where
where we would have word embeddings and
you would do convolutions over it and
then pull uh and then you would still
get the information out so it's not that
the sliding window prohibits you from
looking earlier it's just that that
happens higher up in your Transformer
sort of yeah
yeah so I think that definitely is an
interesting direction to to think in
yeah yeah so I think um it's like not
too crazy to say are there any
architectural changes that we can
introduce into these 7 billion parameter
models so that they could be better
adapted to the rag case
yeah so there there there might be yeah
I I think one one question is just how
do you how do you do the attention over
things you've retrieved which I think is
what
you're yeah
thanks so just to make sure I understand
so I mean in this retro model you're
retrieving in each
block and when you talk about putting
the retrieval in the context are you
saying that you only do it at the
beginning you don't do it
yeah so so in context so this is it's
not exactly every layer sort of so it's
every token right so every um every step
basically not every block so doesn't
make sense so it's not every layer that
you do to retrieval yeah so every step
right um so so this is kind of like like
what rag token is so you retrieve every
token you so you generate and then you
can retrieve again or in the case of
retro you can generate like a chunk and
then you retrieve chunks again uh if you
look at the in context case you retrieve
once at the beginning and then you give
it you're say that during this
nobody yeah but so the so the in Contex
thing um so so here you don't actually
give it as context at all like directly
to the model right so here you get you
let the decoder kind of tend over
it
yeah so I don't think cross attention
really works yeah
yeah other
questions yeah we
inside the the training of the retriever
is not so necessary because of the
large uh so I'm wondering what inside of
the T like what cases are really need
toiz update or anyway updates
those yeah so you do want to update the
retriever right but but only part of the
retriever is necessary to be updated for
a lot of these these cases um but so so
I I think it uh so these are very
specific data sets right natural
questions wizard of Wikipedia and fever
so they're really very uh kind of
knowledge intens tasks uh so in that
case if you already have a very good
system like DPR that is specifically
pre-trained for those tasks then you
only need to update the query encoder
but so I would expect that if you move
Beyond this to kind of General language
modeling things like like retro then you
probably do want to update the document
encoder at least in a way where you can
scale
it so that in the this part that is very
much in
as long as we have a good opal knowledge
of what of the maybe the documents by
those uh good
models yeah but so you need to learn how
to kind of query into that Index right
so if you if you don't do that uh then
then yeah you don't get really good
performance so that's sort of like your
close book performance right if you just
have the language model and you're just
like what what does the parametric model
on its own without the retrieval what
does it actually know as you can see
there there are pretty big gaps there
right other questions otherwise I will
cover other
questions no uh hello yeah go for it a
quick question like so uh what about
like more here at retrieval like I
suppose there will be messes trying to
not just retrieve a single chunk but
some kind of like groups of chunks or
something or summarized versions there
there's been some interesting work on on
doing that uh where you first tried to
find so you can have multiple indices
and they can kind of cascade right so
first you want to find the relevant
document so you have some document
representation and then within that
document you want to find the relevant
chunk uh so you can do it sort of that
direction you can also do it in reverse
I think I I have something on the slide
there where you can find the chunk and
then sort of expand uh the context
around it and then give that to the
language model um so I think yeah there
are all kinds of interesting things you
can do
there cool H thanks I guess another
thing just like do can you compare rag
versus like long context L efforts so
there are lot of things like on around
just having really long context and
extreme it could replace rag but I know
like if your takes yeah so so my my uh
so everybody understands this question
right so there there's there's a trend
where we want to have very long context
language model so that basically you can
like take Harry Potter or something just
put it into context and then ask a
question like what is the name of like
Harry Potter's owl or something right
and then it can just attend over the
entire thing um so attending over all of
Harry Potter to answer that one question
is super inefficient right uh so most of
Harry Potter has nothing to do with the
AL uh so but you are still kind of
reading it if you do it with the long
context window um so that's why I think
the doing it the rag way where you have
this non-parametric component is a much
more efficient way to solve this problem
and if you actually look at the
literature on Long context Windows uh
the way they they solve the problem of
scaling the attenion mechanism is by
making it very sparse uh so they're
basically turning it so that's a
different kind of spars but they're
turning it into a non-parametric
retrieval problem uh kind of behind the
scenes so they're not they're not
actually all that different if you want
to scale long context then you're going
to move towards a rag style
architecture good
thanks all right um so let's talk about
some other interesting questions so one
thing and I already alluded to this is
when do we actually retrieve so very if
we're doing like if we want to uh like
retrieve every token that's also very
inefficient because I probably don't
have to retrieve to generate
the right I can probably do that on my
own with the language model is of a
wayte to go and retrieve stuff but if I
only retrieve once at the beginning of
the sequence that's probably also not
great right so so what we ideally want
to be able to do is to say okay
sometimes I want to retrieve sometimes I
don't want to retrieve and I'm going to
learn when I want to kind of expend the
the compute Budget on doing the
retrieval um so a nice paper where they
have a stab at this is called flare for
active retrieval augmentation where they
basically have the language model decide
uh when it should do a search and what
it should do to search for um so so I I
think this fits in a general Trend that
you can see in the field around kind of
Agents right so we can talk a little bit
more about that too um so this other uh
question that that I think we also kind
of covered already here is how do we
train this at scale right so we can do
these asynchronous updates we can do
reer rankers we can do query side only
there's this really nice paper uh which
is quite close I think to the idea you
proposed uh where you first use bm25 to
create a a batch basically where
everything is very similar uh in terms
of what you've retrieved and now you uh
have this kind of inbatch update so it's
it's sort of like a ranker where you
encode the information that is just in
your batch using this other model and
now you can update this model on the fly
so you don't have to worry too much
about doing the full kind of documents
side update um and again here what
really matters is like how big is your
index if you have an amazing index you
can basically solve any problem just by
looking it up right so rather than
cramming it into your parameters you can
just find it
um this is a really nice paper uh called
Silo so one one of the interesting
things I think that's going to happen in
the next year or two around language
models is there and you've seen this
already there's a bunch of like lawsuits
against open Ai and other places around
where does the data exactly come from um
so one uh very elegant solution I think
is to have a rag system that you train
on data that you know is safe so you can
train that thing on Wikipedia But now
during test time you can give it a data
store that has maybe slightly riskier uh
information in it so this massive index
of all the stuff on the internet
including some things that are maybe um
risk uh you can still have them in your
index but your language model uh your
retrieval augmented language model I
should say you know that that thing is
safe because it was strin on data that
is public domain uh so that's what they
do in Silo and they show that that works
really well so that's uh one possible
solution to to a lot of the the kind of
compliance and legal risk around
language model
deployments um there's a great paper and
also from one of your colleagues um
around uh contexts getting lost in the
middle I think this is also kind of a
fascinating phenomenon this is on a
frozen rag system um but U language
models are very similar to humans in
what things they pay attention to so if
you give them a bunch of things that you
retrieved what what they will look at
are like the first things you list and
the last things you list and they will
sort of ignore the middle um so if it
actually respected the rank function
then then this curve would go down all
the way right but it sort of go goes up
um so I I I think that's a a very
interesting observation which kind of
shows that how brittle uh these these
systems can be right so if you have a
frozen rag system it can be very very
brittle where like the order of the
retreat context matters a lot in whether
you get the right answer or
not work on treating this as re problem
sense
ofor like specifically going for
interpration out VOR that's going to
inter prodct with just the right maybe
you can tune for the particular
dat yeah so what what I just described
someone asked like how how do you
actually so I said there are other ways
to do this and then the question was how
do you do that so the way you do that is
using reinforce um so yeah there has
been work on doing that um so some of
the older papers were playing with this
but one one of the big problems with uh
so I think the replug solution isort of
more elegant uh for solving that problem
because you actually of use signal from
the language model and if you just do
reinforce it's very high variant so
you're uh it's it's going to be super
finicky if you don't want to destroy
your
index but people have tried it
though um so um uh there's some some
really nice work from open AI where they
they basically basically show and again
we're sort of like thinking more and
more about agents here right uh where
they show something very similar to the
flare result from earlier with active
retrieval that doesn't necessarily have
to be some index that you own it can be
just some some web search right um and
obviously in this case you don't really
have access to the web search
necessarily so Bing or whatever they use
here is not going to update its
parameters uh but I just wanted to kind
of put this in your mind like this is
another thing you can do right and if we
take this really to the general form uh
then you can think of language models as
just tool users um so rather than just
retrieval augmenting language models we
can tool augment language models and
retrieval is just one of the many tools
that language models have access to we
can have uh rankers and things on top of
the outputs of these tools um and so one
of the the big questions I think uh is
how do you actually get the system to to
learn stuff right so we're going to need
our help if we want this system to
really learn learn how to take these
actions uh
properly
um um and and so yeah this has been
taken to to the extreme in this uh sort
of self rag architecture where they have
this sort of retrieval step and it's
active and then you criticize it and
then you uh basically do some natural
language inference uh and all of that
just with one language model to answer
uh the
questions um so the other missing piece
so I'm just kind of going through a
bunch of open questions uh that that
people have looked at uh but feel free
to interrupt me if there's anything you
want to know um but so instruction
tuning we established at the beginning
of the lecture that this is pretty
important for getting things to work so
fixing the user interface um but the
instruction tuning has almost always
only happened on the language model and
not on the entire system so I think one
of the interesting uh things that people
are looking at now with with things like
RIT and instruct retro is how can we
instruction fine to an entire retrieval
augmented system so all the way into the
retrieval step can we generate data so
that that also follows the instructions
properly which currently doesn't happen
in any of these model
architectures um and then finally I I
think I would be remiss if I if I didn't
really talk about what people call
Advanced rag so so like the developer
Community has been really doing some
awesome stuff uh so like Frameworks like
llama index and Lang chain and there's
all these open source Vector databases
like groma and wv8 and they're all sort
of about making rag really easy but this
is all Frozen rag right but even with
frozen rag you can really do incredible
things um so uh we mentioned some of
these already so child parent recursive
retriever so you find small small parts
and then you give the big parts around
it to the language model you can do
hybrid search where we use reciprocal
rank Fusion so we have like different
search results that we then combine
before we give the final thing to the
language model there's zero shot like
large language model ranker so basically
the score function is not doesn't come
from your retrieval it comes directly
from the language model um and then uh
hypothetical document and Bets which I
think is a really cool idea so you just
uh basically you fix hallucination
through hallucination uh so you get a
question then you let the language model
hallucinate a bunch of possible answers
then you go and search for nearest
neighbors to the possible answers and
you give those as context and then it
gives the right answer based on that
right so it's really like hallucinating
answers and I think it's a brilliant
solution um so there's a lot of stuff
happening in in the kind of Frozen rack
Community uh to that I think is very
interesting to look at um so uh just to
wrap up kind of looking at the future of
this stuff uh there are still lots of
very interesting open questions so if
you're a student thinking about how to
solve any of these I think you can have
quite a lot of impact um so how how
exactly do we do like pre-training of
this architecture and do we even need to
pre-train I think even retro kind of
shows that you don't necessarily have to
pre-train so but maybe there's something
wrong with how we um how we do that what
do skating laws look like so I think
there's a really interesting question
here around if I have a huge index and a
very rich encoder of all the information
in that index maybe I can move so
basically decouple all the memorization
to this index so I have a language model
that doesn't know anything it just
speaks English it just sort of re on top
but it has no knowledge because that
always comes from this retriever if you
can do something like that then you get
very interesting scaling tradeoffs right
so you can have a tiny language model
and and do your retrieval uh to do a lot
of the heavy lifting with your retrieval
which is nice because that's a cach
computation right so you can just you
already have the the embeddings you just
need to do the dop product so it's much
more efficient than kind of self
attention in the language model um can
we move Beyond bu encoder so Vector
databases um I I like people who build
Vector databases but I'm not sure how
long we're going to keep Vector
databases um because u i I think rer
rankers probably work just as well and
bm25 is much more efficient than a
vector database um so I I don't really
see why we need dedicated Vector
databases and so what we're seeing but
maybe this is a bit of a critique of uh
maybe silicon value investment
strategies and things like that but a
lot of these
um um Vector database companies are
basically becoming database companies
now so they are adding all this Spar
stuff because the the densing is not
enough um and as it turns out there are
a lot of pretty good uh sparse databases
out there already like postgress and
things like that and they're also all
adding vectors uh to their databases so
uh I think that's all going to kind of
coales into
databases um so um I think there are so
interesting things to look at for kind
of the data so alluding to this
instruction problem can we generate much
better data for training rag systems
synthetically uh and then I think
there's this massive open question
around how we actually measure whether
the rag system is any good so right now
we just look at Downstream performance
um um which is sort of okay but if you
mess up the retrieval it's very hard to
measure um but how to how to measure
whether your retrieval is right is also
very difficult so there are some
Frameworks where they try to take like
the harmonic mean of your retrieval
accuracy and your language model
accuracy uh but I think those are also
very shy because we don't really have
very good uh data sets to measure that
on so I think that's that's a very cool
problem to work on as well um so the
other problem that I personally am
always very excited about is
multimodality um and so why would we
stop with rack systems with just text
right so you can do the same thing with
images uh you can augment language
models with vision so we did this work
on lens where we have a language model
enhanced to see uh where you can just
give kind of a computer vision pipeline
just like a retrieval Pipeline and give
that to a frozen language model and pass
it to the context and that system
actually is an amazing visual question
answering system it's close to
state-of-the-art uh sort of flamingo
from Deep Mind which is also very hard
to reproduce because there's no open
source version of that um
so so we've done some early work on this
in in 2021 uh where we have this cross
modal retrieval and there's some uh more
recent workout of fair where they also
look at this so I think that's really
like if you look at the trend in the
field like multimodality with GPD 4V and
things like that is really a Hot Topic
so everything is kind of going in that
direction uh so it's an interesting
thing to think
about um so overall I think um it would
be nice if everybody sort of moves away
from from rag 1.0 to Frozen Frankenstein
Rag and moves towards this much more
kind of optimized version rag 2.0 so
it's really about systems over models
right it's not just your language model
and your Retriever and they're kind of
separate it's about thinking from the
from a systems perspective about the
entire thing and the problem you're
trying to solve and so I think that
really is the way that in deep learning
things have always progressed where if
you optimize the system end to end
that's always going to win out like back
in the day in computer vision or NLP we
have like parsers and scam parsers and
all this kind of stuff and all that just
doesn't exist anymore now because we
optimize the system end to endend U so
that's what's going to happen here too U
so if we take that to the extreme like
there's a chunker thing in your
documents right like put cutting it up
into pieces like you could backdrop into
that like why not somebody should really
do that um and so yeah I I think like
trading off cost and quality uh and zero
shop domain generalization that's really
like where this stuff is going to come
in so language models right now they're
amazing but very often they're way too
expensive for being deployed somewhere
where you can actually make money from
them if you're in a company um so what
you want to do is make it much more
efficient and have the right cost
quality tradeoff and the the easiest way
I can think of is to do it through
retrieval augmentation but obviously I'm
I'm very biased um so uh yeah that that
was all I had actually um so if you're
interested in this I'm I'm at Stanford
so I can work with you on research
projects on these topics or if you want
you can also join contextual because we
work on this stuff every day thank
you well um sorry I had a question from
earlier yeah I think you said something
really uh really I think really super
helpful earlier about Mel 7B you talked
about you compared the sliding window
attention to convolutional neural
networks and I do see the parallel
because with convolutional neural
networks you have uh several layers of
several different layers of
convolutional layers and the top
convolution layers are able to see um a
larger receptive field than the bottom
convolution layers and um and with
convolution layers you're able to tune
the um filter sizes and the stride so
you're able to see a different receptive
field and I was wondering if you could
see that same innovation in mistal 7B by
tuning um because you have different
Transformer layers and each Transformer
layer will have a span over a different
set of tokens and if you can tune I
guess the Transformer architecture the
way you tune those convolution layers
the filter sizes the receptive field
perhaps we can do some optimization in
the Transformer realm that we have
already done in convolution layers yeah
I I think that so that's a good idea
there's there's a great paper on light
convolutions I think from Michael Ali
and David G and a bunch of people where
it's basically uh this this came out at
exactly the same time as the Transformer
and the Transformer is slightly more
optimized for GPU computation but the
the computional model was actually
slightly better than the Transformer um
so I it's definitely worth exploring
okay cool
thanks advant the re ranker
with that does that
advantages TR that yeah so it depends on
the problem I I I think what you
probably want to do is is sort of cast a
white net with bm25 and then just narrow
it down with then search uh so you you
often see that kind of as a two-stage
process where the first one is kind of
noisy you can add noise actually to your
retrieval and then you use the dense one
to filter it
down yeah everyone's trying to maybe
adap their models to
own domain specific area like I think
there are many two ways project one way
is to use instru tuning in learning way
or B tuning like
meth and another way is just the main
topic of this lecture is using rual or
so I'm Wonder besides the low cost
advantage of theal AED way do you think
the capacity or the quality of augmented
can be with those
T learning yeah so I I think actually
what what's going to happen is that all
of this will come together right so so
if you train things like end to end rag
2.0 style then you can also fine-tune
that system on some use case end to
endend right so what why would you just
take the retrieval augmented system if
you can also F tune it on the thing you
care about so I think in the end
everybody's going to do all of those
things and then there's questions like
how do you do that efficiently so that's
why you would use adapter or things like
that think there was another
question I'm curious about Hardware you
say it's going to become database kind
of thing respons database but what about
retrieval hardware and you SM because
we've thought so much of the you know
the Le part but what about because it's
hug trillions said so you have any ideas
just a database problem so I don't know
if I'm allowed to say this exactly
actually but uh so one of the the
biggest chip manufacturers that recently
their stock has done really well they
have some dedicated retrieval Hardware
coming out I think soon or it might
already be
out um so yeah so yeah that
like very efficient uh dense retrieval
is a very big
business are
questions Sol
um yes I I think I think so if you take
it to the extreme so one of the big
problems right now is that that if you
contextualize an existing language model
that already
hallucinates then then it's going to be
kind of hard to get rid of the
hallucination right so if you do replug
on
gp4 gp4 might still hallucinate so you
it could basically just ignore all the
stuff you retrieved and just do whatever
it wants anyway uh so that's one of the
reasons why you want to train the system
end to end and if you take that to the
extreme where like I said right if you
can just have the language model only
reason and speak so it knows English and
reasoning but it has no knowledge which
all comes from somewhere else then then
you can't lose an so it's really all
grounded in whatever is in your
index but they're so they're they're
about hallucination I I'm sort of
frustrated that a lot of people in the
field misunderstand what hallucination
even means right so a lot of people are
conflating hallucination with
correctness or incorrectness so they're
like oh the model made a mistake it
hallucinated it's like no it made a
mistake that's different from
hallucination hallucination I think is
very specific kind of I retrieved
something so I have some sort of
counterfactual ground truth and what I'm
saying uh does not correspond to that
ground
truth um and so yeah I think there's a
bunch of folks that stand for also
working on better like measurements of
hallucination and definitions and things
like
that understanding correctly your of
hallucination only sense in
cont yeah of some ground truth right so
so Hallucination is is really like there
there is something that is true right so
so if we're talking about like
hallucination yeah so if we're talking
about just general parametric language
models then sort of the ground truth is
whatever we can consider to be true
right but we had to word for like
language models making mistakes before
it was called making
mistakes yeah
ground I guess you're solving the house
question on that path are you working on
on
ground you
know never been president everything
this yeah so so I I like the sort of
Silo mention there as well so I I think
the whole point is that you can you can
have different indices and different
definitions of ground truth and so um I
think you could say I only trust the
archive or I only trust like peer review
papers and not just archive uh and so
you can make decisions in your
architecture during test time about what
You' Define as ground truth
and I also think actually that uh and
there's a bunch of work I think
happening on this right now you can
control for how how grounded you want to
be in your ground TR so uh that's
another kind of misconception about
hallucinations like sometimes
hallucinations are actually good right
if you have a creative writing assistant
and you wanted to come up with some cool
new ideas you want the language model to
hallucinate uh so I I think what you
want to have is kind of a tunable knob
where you say like now you can
hallucinate and now maybe you should
like really tell me the truth
only anything
else control
yeah yeah so but the temperature that's
just about how you sample right so how
flat your your distribution is
sample
yeah
yes but so even if you have a low
temperature it can still come up with
random stuff right so it just says that
then you're very likely to do like
greedy sampling um so so I I think what
you want to get at is is something more
sophisticated than
that lots of interesting questions yeah
I like the question thank again for the
great
than
تصفح المزيد من مقاطع الفيديو ذات الصلة
Trying to make LLMs less stubborn in RAG (DSPy optimizer tested with knowledge graphs)
【生成式AI導論 2024】第4講:訓練不了人工智慧?你可以訓練你自己 (中) — 拆解問題與使用工具
3. Cognitive Architectures
Webinar - Supply Chain Optimization: A Robust Supply That Minimizes Costs
Stream of Search (SoS): Learning to Search in Language
Ilya Sutskever | The birth of AGI will subvert everything |AI can help humans but also cause trouble
5.0 / 5 (0 votes)