Trying to make LLMs less stubborn in RAG (DSPy optimizer tested with knowledge graphs)
Summary
TLDR本视频探讨了如何通过结合外部信息和内部知识来提高大型语言模型的准确性。研究发现,尽管RAG(Retrieval-Augmented Generation)模型可以提高准确性,但其效果依赖于模型的信心和提示技术。视频还讨论了使用不同的提示技术如何影响语言模型遵循外部知识,并介绍了如何通过实体链接器来验证信息,减少模型的幻觉现象。此外,还展示了如何通过优化提示和集成知识图谱来改进语言模型的检索能力,以及如何通过DSP(Data-Structure Pipeline)优化器和自定义的提示模板来提高模型遵循知识图谱数据的能力。
Takeaways
- 🧠 研究显示,大型语言模型在处理外部信息和内部知识时,倾向于依赖自己的知识,特别是当它们对外部信息的正确性不够自信时。
- 🔍 研究比较了GPD 4、GPD 3.5和Mistral 7B三个模型,发现GPD 4在使用外部信息时最为可靠,其次是GPD 2.5,Mistral 7B排在第三位。
- 📈 研究指出,尽管RAG(Retrieval-Augmented Generation)可以提高准确性,但其有效性取决于模型的自信度和提示技术。
- 🤖 观众的评论启发了研究,即语言模型可能认为自己的内部知识比外部知识更正确,这影响了它们对问题的回答。
- 🔄 研究中使用了不同的提示技术,如Lan chain和Lalex,这些技术可以影响语言模型如何遵循外部知识。
- 🛠️ 提出了使用名为ThePi的框架来自动调整提示,以改善语言模型遵循外部知识的能力。
- 🔗 实体链接是验证信息的一种方法,通过将文本中的词语映射到知识图中的实体,并验证答案的有效性。
- 📚 知识图谱可以提供验证答案的依据,通过链接事实回到默认的知识图谱来确认信息的有效性。
- 🔧 在LLM(Large Language Models)系统中添加实体链接器可以检查答案的正确性,并帮助过滤语言模型产生的虚构信息。
- 📈 通过在DSP(Data-Seeking Pipeline)RAG管道中集成实体类型检查和知识图谱数据,可以提高输出的准确性。
- 📝 优化程序可能不总是可靠的,有时需要手动调整提示,以确保语言模型严格遵循知识图谱中的外部知识。
Q & A
为什么大型语言模型(LLMs)在提供知识图谱数据时仍然可能无法正确整合信息?
-根据视频脚本,LLMs可能倾向于依赖自己的内部知识,即使外部知识来源被认为更准确。这可能是由于模型对外部知识的信任度不足,或者由于模型的自信程度和提示技术的影响。
视频提到的研究是如何比较不同语言模型在处理外部信息时的可靠性的?
-该研究比较了GPD 4、GPD 3.5和Mistral 7B三个模型。研究发现,GPD 4在使用外部信息时是最可靠的模型,其次是GPD 2.5,最后是Mistral 7B。所有模型在自信外部知识不够准确时,都倾向于坚持自己的知识。
什么是RAG系统,它在提高语言模型准确性方面的作用是什么?
-RAG系统是一种检索增强的生成模型,它可以通过结合外部知识源来提高语言模型的准确性。然而,其有效性取决于模型的自信程度和提示技术。
视频脚本中提到的ThePi框架是什么,它如何帮助改善LLMs的流程?
-ThePi是一个模块化的框架,用于改进LLMs的流程。它包含一个优化器,该优化器应用自举技术来创建和完善示例,这个过程可以自动创建基于特定指标自我改进的提示。
实体链接在LLMs系统中的作用是什么?
-实体链接是一个通过将文本中的词语映射和识别到知识图中的实体来验证信息的过程。它不仅可以帮助检查答案的正确类型,还可以帮助过滤LLMs在幻觉时编造的信息。
如何通过实体链接者改进LLMs的输出?
-通过在LLMs系统中添加实体链接者,可以验证答案的类型,并确保答案与知识图中的信息一致,从而减少幻觉并提高输出的准确性。
在设计自定义的DS Py RAG管道时,为什么要整合知识图谱数据?
-整合知识图谱数据可以提供更全面的信息,因为知识图谱通过连接数据点来组织信息。这有助于语言模型基于精炼的查询从向量数据库中检索相关信息,并最终通过结合向量数据库和知识图谱的元数据来增强答案。
DS Py RAG管道中使用的两个指标是什么,它们如何工作?
-DS Py RAG管道中使用的两个指标是:检查答案的实体类型是否与实体链接者中的类型匹配,以及评估答案是否与知识图谱中的数据一致。如果答案的实体类型匹配或为是非类型答案(不需要实体类型),则分数增加一;然后评估答案是否与知识图谱中的上下文一致。
为什么即使提供了地面真实情况,优化程序可能也无法成功地使语言模型坚持外部知识?
-即使提供了地面真实情况,由于语言模型自身的不可预测性和自我导向的提示管道可能不够可靠,优化程序可能无法成功地使语言模型坚持外部知识。
手动调整提示与自动化提示管道相比,在使语言模型坚持外部知识方面有哪些优势?
-手动调整提示可以更具体地指导语言模型严格遵循知识图谱中的地面真实情况,尤其是在内部模式与外部知识存在冲突时。这可以减少幻觉,并提供与知识图谱数据完全一致的答案。
视频脚本中提到的“graph RAG”是什么,它与DS Py RAG有何不同?
-视频脚本中提到的“graph RAG”是接下来要探讨的主题,尽管脚本中没有详细说明,但可以推测它可能是一种不同的RAG系统,用于处理知识图谱和向量数据库的组合,与DS Py RAG相比,它可能有不同的优化和处理方法。
Outlines
🧠 大型语言模型的幻觉减少与知识平衡
在大型语言模型中,通过检索Mana生成的rag来减少幻觉被认为是一种有效的方法。然而,之前的视频中出现了一些不合逻辑的推理。观众提出了一个问题:这些模型是否因为内部知识库而忽视外部知识。本段讨论了一项及时的研究,该研究比较了GPD 4、GPD 3.5和Mistral 7B模型在平衡外部信息和内部知识方面的表现。研究发现,所有模型在自信外部知识不够准确时,倾向于坚持自己的知识。研究还指出,不同的提示技术可以影响模型如何遵循外部知识。此外,还探讨了实体链接器在验证信息和过滤幻觉方面的作用,以及如何将实体链接器集成到LLM管道中以提高输出质量。
🔍 知识图谱与向量数据库的结合优化
本段内容回顾了如何设计自定义的DS py rack管道,以整合知识图谱数据。首先,通过知识图谱细化原始问题,然后语言模型根据细化的查询从向量数据库检索相关信息,并最终结合向量数据库和知识图谱的元数据来增强答案。为了优化DSP Optimizer,使用了两个指标:检查答案的实体类型是否与知识图谱匹配,以及答案是否与知识图谱提供的内容一致。通过这些指标,可以评估管道是否严格遵循知识图谱中的数据。此外,还讨论了知识图谱如何提供两个人之间关系的详细信息和证据,以及如何通过调用自然语言API来获取这些信息。
🛠️ 手动调整提示模板以提高模型遵循外部知识的能力
在本段中,讨论了如何通过手动调整提示模板来提高语言模型遵循知识图谱中外部知识的能力。展示了不同的提示模板对模型输出的影响,包括一个没有明确指示模型严格遵循知识图谱的模板,导致模型输出出现幻觉。接着,提出了一个改进的模板,要求模型在内部模式与外部知识发生冲突时,严格遵循知识图谱。结果显示,这种改进的模板能够使模型的输出与知识图谱数据完全一致。最后,作者表达了对DSP框架的看法,认为它是一个精英框架,但学习曲线陡峭,对于能够成功使用它的程序员来说,可能意味着他们是顶尖的5%。
Mindmap
Keywords
💡幻觉
💡知识图谱(Knowledge Graph)
💡实体链接(Entity Linking)
💡语言模型(Language Model, LM)
💡检索增强(Retrieval-Augmented)
💡自我提升提示(Self-Improving Prompts)
💡自举(Bootstrapping)
💡向量数据库(Vector Database)
💡优化器(Optimizer)
💡地面真实(Ground Truth)
Highlights
Mana生成rag被认为是减少大型语言模型幻觉的有效方法。
先前视频中发现,即使提供了知识图谱(KG)数据,大型语言模型(LLMs)仍可能无法正确整合信息。
观众评论提出,LLMs可能认为其内部知识比外部知识更正确,因此不遵循外部知识。
研究探讨了语言模型如何平衡外部信息与内部知识,发现GPD 4最可靠,其次是GPD 2.5和Mistral 7B。
所有模型倾向于坚持自己的知识,如果它们确信外部知识不正确。
研究表明,尽管Rack可以提高准确性,但其有效性取决于模型的信心和提示技术。
研究未包括不同嵌入模型与Elen的配对效果。
使用不同的提示技术可以影响LLMs如何遵循外部知识。
ThePi是一个用于改进LLM管道的模块化框架,具有应用自举的优化器。
自举是一种通过重复抽样和测试数据以找到最佳性能模式的技术。
实体链接是验证信息的过程,通过将文本中的单词映射到知识图谱中的实体。
使用Diff B知识图谱,因为它拥有地球上最大的经过验证的信息源网络。
实体链接器可以防止LLMs产生错误的信息。
更新后的DSPo Rack管道包括答案类型有效性检查,以改善输出。
实体类型检查和引导语言模型严格遵循知识图谱的外部数据。
设计自定义DSPy Rack管道,与知识图谱数据集成。
使用两个指标来评估DSP优化器:实体类型检查和知识图谱数据一致性。
知识图谱可以清楚地显示两个人之间的关系及其证据。
知识图谱作为推理部分的真理基础。
自动化提示管道可能不可靠,语言模型的自我指导推理能力不可预测。
手动调整提示,使语言模型更严格地遵循外部知识。
ThePi框架学习曲线陡峭,手动定制提示模板可能仍然需要。
Transcripts
retrieval of Mana generation rag has
been seen as an effective approach to
reduce hallucination in large language
models however in our previous videos we
saw something weird like this this
reasoning does not really make sense to
me at all even if we provide our kg data
it somehow still doesn't incorporate
correctly even providing the correct
context and the ground truth to llms
they somehow still didn't want to follow
and we got this interesting comment from
our viewer what if these llms think they
already know the answer to the question
because they think their internal
knowledge is more correct than the
external Knowledge from our RX system
and this research how fateful are rack
systems just came timely This research
looks at how language models balance
external information with their internal
knowledge it Compares GPD 4 GPD 3.5 and
mistro 7B GPD 4 is found to be the most
reliable model when using external
information with GPD 2.5 as second and
mral 7B at the third place regardless of
their difference all models tend to
stick to their own knowledge if they're
confident that the external knowledge is
less correct and this study concludes
that while rack can still enhance
accuracy its Effectiveness depends on
the model's confidence and the prompting
technique it's insightful to know lm's
tendency to fall back on their internal
patterns I was also trying to find out
in the study if pairing elens with
different embedding models would make
any difference as you previously found
out the different combinations of
language models and embedd models can
lead to quite different results such as
who are the other Founders Elon
co-founder SpaceX with PayPal that's
weird study doesn't seem to include such
information but it highlighted the use
of different prompting techniques can
influence how llms follow external
knowledge as Lan chain and L andex were
used in this particular study I wonder
if thep which is a framework for
autotune prompts can effectively make
the different parents of LMS and Betty
models better follow external knowledge
in the rack system just to recap thepi
is a modular framework to improve llm
pipelines it has an Optimizer that
applies bootstrapping to create and
refine examples bootstrapping is a
technique that repeatedly samples and
tests data to find the best performance
patterns this process automatically
creates self-improving prompts based on
specific metrics before we go into the
details of setting metrics for the
optimizer in the pipeline remember
seeing the strange answer of PayPal
being a co-founder of SpaceX previously
we later found out that the entity link
feature from bib natural language API
can exactly prevent this from happening
entity linking is a process to validate
information by mapping and identifying
words in your text to entities in a
Knowledge Graph and in this video we're
using the diff B Knowledge Graph because
it has the largest network of verified
information sources on planet Earth so
natural language API can be used to
extract entities and relationships but
it also can verify how valid the answer
is by linking facts back to the default
Knowledge Graph PayPal is being
categorized as these entity types also
with a confidence score and here you see
the clickable link which actually
further lead to the page about PayPal
and all the other relevant information
here which means this information is
valid but let's type pip Piper pip Piper
is a madeup company name from the sitcom
Silicon Valley while the NP can
correctly identify this is probably an
organization name there's no link here
which means it can't be mapped back to
this Bond Knowledge Graph because it's
not a valid company well they do have a
valid LinkedIn page without these fake
profiles Richard Hendrick thees Jared
another dases and Ed Chambers Jared done
go for
Chambers adding an entity Linker in LM
Bas systems will look like this this
validation step not only checks the
correct answer types but also can help
filter made up information when llms are
hallucinating so now we kind of see what
entity Linker can do for us we updated
our dspo rack Pipeline with this answer
type validity check to see if it can
improve the output from the question
that we saw last time so let's actually
look at the obvious difference between
our basic rack and the one with entity
Linker when we update with the step to
check entity type this is closer to the
right answer and when we look at the
reasoning part language model was guided
to find specifically the information
regarding person as you can see here
Elon Musk is a person gym control is
also a person the correct answer should
be Elon is a sole founder of SpaceX but
at least this is closer to the right
answer compared to the previous one with
PayPal which to some degree proves that
this is a step that we should consider
integrating in our llm pipeline as a
step to validate so what we are going to
do is to have both entity type checks
and guiding language models to stick
strictly with the external data from our
knowledge graph so here's a brief recap
of how we previously designed our custom
DS py rack pipeline integrating with
Knowledge Graph data first it refines
the original question with our knowledge
graph because knowledge graphs can
provide more comprehensive contacts as
they organize data points with
connections and then the language model
will further retrieve relevant
information from our Vector database
based on this refined query and finally
it would enhance the answer by combining
information from Vector database and
metadata from our knowledge graph as we
mentioned previously we need metrics for
DSP Optimizer so the two metrics we're
using here first check the entity type
and the second is to assess how
effective the pipeline is following the
data from Knowledge Graph so if the
entity type of the answer is MCH matched
in our entity Linker or it's a yes no
type of answer which entity type is not
required we'll increase the score by one
and then move on to the second metric
which is to assess if the answer aligns
or is consistent with the context from
our knowledge graph so now we have our
metrics ready and we also have our
training data set with just a few
examples besides question we probably
will feel confused like why page
contacts and answer are pretty much the
same so the intention here is the final
answer should strictly align with the
ground truth provided from our knowledge
graph because if you still remember in
our theas pyra Pipeline with knowledge
graphs there's a step that language
model will retrieve and generate answer
from our Vector database but the vector
based answer should further get
validated as being consistent with
information from the knowledge graph
that's why you see here there's a custom
metric as H context and the correct
answer should always align with this
metric
[Music]
here so the nice thing about knowledge
graphs is you actually can see the
details of the relationship between
these two people with the evidence here
so this is the evidence from the article
that showcases the interaction between
Elon Musk and Mark Zuckerberg and you
can easily get this as the property for
relationships in your knowledge draft by
just calling the def natural language
API and it's free so now you see what we
have in our knowledge graph which serves
as the ground truth when we go into the
reasoning part here the knowledge graph
context is being considered now we got
this enriched query let's see what the
enhanced output looks like you can see
that more specific passages regarding
the martial arts match are being
retrieved it further incorporates how
their relationship evolves this is
probably a clear example to illustrate
what it means to enhance language models
retrieval ability by bringing in
knowledge grafts and I think this is
also a good example of how knowledge
graphs can be combined with vector-based
[Music]
rack so here's what the ground True
Looks like in our knowledge graph
there's only one relationship regarding
who found this SpaceX and this tells us
that elamus is the sole founder this
relationship is being supported by the
evidence here from the text in the
Wikipedia page and that's why we will
get an empty list this is the ground
truth that no co-founders of SpaceX will
be returned here we provided the ground
truth but the optimized program did not
seem to successfully make GP 3.5 stick
to the external knowledge see what it
did and we can also look at the third
program that we optimize on the answer
was not even close to being relevant so
here's example of how the automated
prompt pipeline just went a little bit
too far this is our original question
and also we provide a ground truth our
original query
somehow just got optimized as who else
has co-founded companies with Elon Musk
BX for some reason just got rid of even
if now we have Optimizer in the pipeline
The self-directed Prompt pipeline by
language models themselves may not be
that reliable plus by Nature they're
just unpredictable so to what degree
should we rely on their self-directed
reasoning
[Music]
ability so now that we saw the
performance of the DSP pipeline can be
quite language model or embedding model
dependent for the less performing
combinations one is L 3an expert and the
other one is gp2.5 plus .002 now I'm
coming back to manually tweaking prompts
to make language models stick more to
the external knowledge as to research
how faithful are RX Syms also pointed
out that the forcefulness of prompt
templates can influence the degree of
how the language models are following
the external knowledge this is our first
prompt template where the language model
was not specifically instructed to
strictly follow the ground truth from
Knowledge Graph and Hallucination is
obvious here well this could be some
existing patterns in the language model
if we go on to the next template where
we specifically instruct Elms to
strictly follow the ground truth and
Knowledge Graph especially if there's
conflicts between their internal
patterns and external knowledge as you
can see here and see what we got this
perfectly aligns with our kg data in
this specific use case if we want
language models to stick strictly to
external knowledge manually customizing
proms may still be needed you can
literally see the difference between
this answer and the previous one in my
personal opinion I think dpip is an
elite framework at least for me
throughout the process the learning
curve was quite steep for me so if
you're having a great time with it and
reaching better results
congratulations you are probably being
verified as one of the top 5% Elite
programmers by this Elite framework so
good for you but so far we're taking a
break from this because we are moving on
to graph rack so I'll see you in the
next one
5.0 / 5 (0 votes)