Trying to make LLMs less stubborn in RAG (DSPy optimizer tested with knowledge graphs)

Diffbot
26 May 202411:14

Summary

TLDR本视频探讨了如何通过结合外部信息和内部知识来提高大型语言模型的准确性。研究发现,尽管RAG(Retrieval-Augmented Generation)模型可以提高准确性,但其效果依赖于模型的信心和提示技术。视频还讨论了使用不同的提示技术如何影响语言模型遵循外部知识,并介绍了如何通过实体链接器来验证信息,减少模型的幻觉现象。此外,还展示了如何通过优化提示和集成知识图谱来改进语言模型的检索能力,以及如何通过DSP(Data-Structure Pipeline)优化器和自定义的提示模板来提高模型遵循知识图谱数据的能力。

Takeaways

  • 🧠 研究显示,大型语言模型在处理外部信息和内部知识时,倾向于依赖自己的知识,特别是当它们对外部信息的正确性不够自信时。
  • 🔍 研究比较了GPD 4、GPD 3.5和Mistral 7B三个模型,发现GPD 4在使用外部信息时最为可靠,其次是GPD 2.5,Mistral 7B排在第三位。
  • 📈 研究指出,尽管RAG(Retrieval-Augmented Generation)可以提高准确性,但其有效性取决于模型的自信度和提示技术。
  • 🤖 观众的评论启发了研究,即语言模型可能认为自己的内部知识比外部知识更正确,这影响了它们对问题的回答。
  • 🔄 研究中使用了不同的提示技术,如Lan chain和Lalex,这些技术可以影响语言模型如何遵循外部知识。
  • 🛠️ 提出了使用名为ThePi的框架来自动调整提示,以改善语言模型遵循外部知识的能力。
  • 🔗 实体链接是验证信息的一种方法,通过将文本中的词语映射到知识图中的实体,并验证答案的有效性。
  • 📚 知识图谱可以提供验证答案的依据,通过链接事实回到默认的知识图谱来确认信息的有效性。
  • 🔧 在LLM(Large Language Models)系统中添加实体链接器可以检查答案的正确性,并帮助过滤语言模型产生的虚构信息。
  • 📈 通过在DSP(Data-Seeking Pipeline)RAG管道中集成实体类型检查和知识图谱数据,可以提高输出的准确性。
  • 📝 优化程序可能不总是可靠的,有时需要手动调整提示,以确保语言模型严格遵循知识图谱中的外部知识。

Q & A

  • 为什么大型语言模型(LLMs)在提供知识图谱数据时仍然可能无法正确整合信息?

    -根据视频脚本,LLMs可能倾向于依赖自己的内部知识,即使外部知识来源被认为更准确。这可能是由于模型对外部知识的信任度不足,或者由于模型的自信程度和提示技术的影响。

  • 视频提到的研究是如何比较不同语言模型在处理外部信息时的可靠性的?

    -该研究比较了GPD 4、GPD 3.5和Mistral 7B三个模型。研究发现,GPD 4在使用外部信息时是最可靠的模型,其次是GPD 2.5,最后是Mistral 7B。所有模型在自信外部知识不够准确时,都倾向于坚持自己的知识。

  • 什么是RAG系统,它在提高语言模型准确性方面的作用是什么?

    -RAG系统是一种检索增强的生成模型,它可以通过结合外部知识源来提高语言模型的准确性。然而,其有效性取决于模型的自信程度和提示技术。

  • 视频脚本中提到的ThePi框架是什么,它如何帮助改善LLMs的流程?

    -ThePi是一个模块化的框架,用于改进LLMs的流程。它包含一个优化器,该优化器应用自举技术来创建和完善示例,这个过程可以自动创建基于特定指标自我改进的提示。

  • 实体链接在LLMs系统中的作用是什么?

    -实体链接是一个通过将文本中的词语映射和识别到知识图中的实体来验证信息的过程。它不仅可以帮助检查答案的正确类型,还可以帮助过滤LLMs在幻觉时编造的信息。

  • 如何通过实体链接者改进LLMs的输出?

    -通过在LLMs系统中添加实体链接者,可以验证答案的类型,并确保答案与知识图中的信息一致,从而减少幻觉并提高输出的准确性。

  • 在设计自定义的DS Py RAG管道时,为什么要整合知识图谱数据?

    -整合知识图谱数据可以提供更全面的信息,因为知识图谱通过连接数据点来组织信息。这有助于语言模型基于精炼的查询从向量数据库中检索相关信息,并最终通过结合向量数据库和知识图谱的元数据来增强答案。

  • DS Py RAG管道中使用的两个指标是什么,它们如何工作?

    -DS Py RAG管道中使用的两个指标是:检查答案的实体类型是否与实体链接者中的类型匹配,以及评估答案是否与知识图谱中的数据一致。如果答案的实体类型匹配或为是非类型答案(不需要实体类型),则分数增加一;然后评估答案是否与知识图谱中的上下文一致。

  • 为什么即使提供了地面真实情况,优化程序可能也无法成功地使语言模型坚持外部知识?

    -即使提供了地面真实情况,由于语言模型自身的不可预测性和自我导向的提示管道可能不够可靠,优化程序可能无法成功地使语言模型坚持外部知识。

  • 手动调整提示与自动化提示管道相比,在使语言模型坚持外部知识方面有哪些优势?

    -手动调整提示可以更具体地指导语言模型严格遵循知识图谱中的地面真实情况,尤其是在内部模式与外部知识存在冲突时。这可以减少幻觉,并提供与知识图谱数据完全一致的答案。

  • 视频脚本中提到的“graph RAG”是什么,它与DS Py RAG有何不同?

    -视频脚本中提到的“graph RAG”是接下来要探讨的主题,尽管脚本中没有详细说明,但可以推测它可能是一种不同的RAG系统,用于处理知识图谱和向量数据库的组合,与DS Py RAG相比,它可能有不同的优化和处理方法。

Outlines

00:00

🧠 大型语言模型的幻觉减少与知识平衡

在大型语言模型中,通过检索Mana生成的rag来减少幻觉被认为是一种有效的方法。然而,之前的视频中出现了一些不合逻辑的推理。观众提出了一个问题:这些模型是否因为内部知识库而忽视外部知识。本段讨论了一项及时的研究,该研究比较了GPD 4、GPD 3.5和Mistral 7B模型在平衡外部信息和内部知识方面的表现。研究发现,所有模型在自信外部知识不够准确时,倾向于坚持自己的知识。研究还指出,不同的提示技术可以影响模型如何遵循外部知识。此外,还探讨了实体链接器在验证信息和过滤幻觉方面的作用,以及如何将实体链接器集成到LLM管道中以提高输出质量。

05:01

🔍 知识图谱与向量数据库的结合优化

本段内容回顾了如何设计自定义的DS py rack管道,以整合知识图谱数据。首先,通过知识图谱细化原始问题,然后语言模型根据细化的查询从向量数据库检索相关信息,并最终结合向量数据库和知识图谱的元数据来增强答案。为了优化DSP Optimizer,使用了两个指标:检查答案的实体类型是否与知识图谱匹配,以及答案是否与知识图谱提供的内容一致。通过这些指标,可以评估管道是否严格遵循知识图谱中的数据。此外,还讨论了知识图谱如何提供两个人之间关系的详细信息和证据,以及如何通过调用自然语言API来获取这些信息。

10:02

🛠️ 手动调整提示模板以提高模型遵循外部知识的能力

在本段中,讨论了如何通过手动调整提示模板来提高语言模型遵循知识图谱中外部知识的能力。展示了不同的提示模板对模型输出的影响,包括一个没有明确指示模型严格遵循知识图谱的模板,导致模型输出出现幻觉。接着,提出了一个改进的模板,要求模型在内部模式与外部知识发生冲突时,严格遵循知识图谱。结果显示,这种改进的模板能够使模型的输出与知识图谱数据完全一致。最后,作者表达了对DSP框架的看法,认为它是一个精英框架,但学习曲线陡峭,对于能够成功使用它的程序员来说,可能意味着他们是顶尖的5%。

Mindmap

Keywords

💡幻觉

幻觉指的是在没有外部刺激的情况下产生的虚假感知体验。在视频脚本中,幻觉被用来比喻大型语言模型(LLM)在没有正确外部信息的情况下,生成不准确或不真实的信息。例如,当模型在没有正确上下文和真实信息的情况下,仍然坚持自己的知识,产生了错误的推理。

💡知识图谱(Knowledge Graph)

知识图谱是一种结构化的语义知识库,它通过图谱的形式存储实体之间的关系。在视频中,知识图谱被用来提供外部信息,帮助语言模型更准确地回答问题。例如,通过知识图谱,可以验证语言模型生成的答案是否与已知的事实一致。

💡实体链接(Entity Linking)

实体链接是将文本中的词语映射到知识图谱中的实体的过程。在视频中,实体链接用于验证信息,通过将文本中的词语与知识图谱中的实体进行匹配,来确定答案的有效性。例如,当提到“PayPal”时,实体链接可以帮助确认它是否与知识图谱中的相应实体匹配。

💡语言模型(Language Model, LM)

语言模型是一种用于生成或理解自然语言文本的机器学习模型。在视频中,语言模型被用来探讨它们如何处理和整合外部知识。例如,研究比较了不同版本的语言模型(如GPD 4、GPD 3.5和Mistral 7B)在利用外部信息时的表现。

💡检索增强(Retrieval-Augmented)

检索增强是指将外部信息检索技术与语言模型结合,以提高模型的性能。在视频中,检索增强被用来减少语言模型的幻觉问题,通过结合知识图谱和向量数据库来增强模型的准确性。

💡自我提升提示(Self-Improving Prompts)

自我提升提示是指根据特定指标自动创建和优化的提示,用于改善语言模型的表现。在视频中,提到了使用优化器来应用自举技术,自动创建基于特定指标的自我提升提示。

💡自举(Bootstrapping)

自举是一种通过重复抽样和测试数据来找到最佳性能模式的技术。在视频中,自举被用于DSP优化器,通过不断抽样和测试来创建和优化自我提升的提示。

💡向量数据库(Vector Database)

向量数据库是一种存储数据的数据库,其中数据以向量的形式表示,便于进行高效的相似性搜索。在视频中,向量数据库被用来根据知识图谱中的查询检索相关信息。

💡优化器(Optimizer)

优化器是一种用于改进系统性能的工具,它可以自动调整参数以提高效率。在视频中,优化器被用于DSP管道中,通过自举技术来创建和优化提示,以提高语言模型的准确性。

💡地面真实(Ground Truth)

地面真实是指在数据分析中用于验证模型输出的准确信息或数据。在视频中,地面真实由知识图谱提供,用于评估语言模型生成的答案是否准确。例如,知识图谱提供了关于SpaceX创始人的准确信息,即Elon Musk是唯一的创始人。

Highlights

Mana生成rag被认为是减少大型语言模型幻觉的有效方法。

先前视频中发现,即使提供了知识图谱(KG)数据,大型语言模型(LLMs)仍可能无法正确整合信息。

观众评论提出,LLMs可能认为其内部知识比外部知识更正确,因此不遵循外部知识。

研究探讨了语言模型如何平衡外部信息与内部知识,发现GPD 4最可靠,其次是GPD 2.5和Mistral 7B。

所有模型倾向于坚持自己的知识,如果它们确信外部知识不正确。

研究表明,尽管Rack可以提高准确性,但其有效性取决于模型的信心和提示技术。

研究未包括不同嵌入模型与Elen的配对效果。

使用不同的提示技术可以影响LLMs如何遵循外部知识。

ThePi是一个用于改进LLM管道的模块化框架,具有应用自举的优化器。

自举是一种通过重复抽样和测试数据以找到最佳性能模式的技术。

实体链接是验证信息的过程,通过将文本中的单词映射到知识图谱中的实体。

使用Diff B知识图谱,因为它拥有地球上最大的经过验证的信息源网络。

实体链接器可以防止LLMs产生错误的信息。

更新后的DSPo Rack管道包括答案类型有效性检查,以改善输出。

实体类型检查和引导语言模型严格遵循知识图谱的外部数据。

设计自定义DSPy Rack管道,与知识图谱数据集成。

使用两个指标来评估DSP优化器:实体类型检查和知识图谱数据一致性。

知识图谱可以清楚地显示两个人之间的关系及其证据。

知识图谱作为推理部分的真理基础。

自动化提示管道可能不可靠,语言模型的自我指导推理能力不可预测。

手动调整提示,使语言模型更严格地遵循外部知识。

ThePi框架学习曲线陡峭,手动定制提示模板可能仍然需要。

Transcripts

play00:00

retrieval of Mana generation rag has

play00:02

been seen as an effective approach to

play00:04

reduce hallucination in large language

play00:06

models however in our previous videos we

play00:08

saw something weird like this this

play00:10

reasoning does not really make sense to

play00:12

me at all even if we provide our kg data

play00:15

it somehow still doesn't incorporate

play00:18

correctly even providing the correct

play00:20

context and the ground truth to llms

play00:23

they somehow still didn't want to follow

play00:25

and we got this interesting comment from

play00:26

our viewer what if these llms think they

play00:28

already know the answer to the question

play00:30

because they think their internal

play00:32

knowledge is more correct than the

play00:34

external Knowledge from our RX system

play00:36

and this research how fateful are rack

play00:38

systems just came timely This research

play00:41

looks at how language models balance

play00:43

external information with their internal

play00:45

knowledge it Compares GPD 4 GPD 3.5 and

play00:48

mistro 7B GPD 4 is found to be the most

play00:51

reliable model when using external

play00:53

information with GPD 2.5 as second and

play00:57

mral 7B at the third place regardless of

play00:59

their difference all models tend to

play01:01

stick to their own knowledge if they're

play01:03

confident that the external knowledge is

play01:05

less correct and this study concludes

play01:07

that while rack can still enhance

play01:09

accuracy its Effectiveness depends on

play01:11

the model's confidence and the prompting

play01:13

technique it's insightful to know lm's

play01:15

tendency to fall back on their internal

play01:18

patterns I was also trying to find out

play01:20

in the study if pairing elens with

play01:22

different embedding models would make

play01:24

any difference as you previously found

play01:26

out the different combinations of

play01:27

language models and embedd models can

play01:30

lead to quite different results such as

play01:33

who are the other Founders Elon

play01:35

co-founder SpaceX with PayPal that's

play01:38

weird study doesn't seem to include such

play01:40

information but it highlighted the use

play01:42

of different prompting techniques can

play01:44

influence how llms follow external

play01:46

knowledge as Lan chain and L andex were

play01:49

used in this particular study I wonder

play01:51

if thep which is a framework for

play01:53

autotune prompts can effectively make

play01:56

the different parents of LMS and Betty

play01:58

models better follow external knowledge

play02:00

in the rack system just to recap thepi

play02:03

is a modular framework to improve llm

play02:06

pipelines it has an Optimizer that

play02:08

applies bootstrapping to create and

play02:10

refine examples bootstrapping is a

play02:12

technique that repeatedly samples and

play02:14

tests data to find the best performance

play02:17

patterns this process automatically

play02:19

creates self-improving prompts based on

play02:21

specific metrics before we go into the

play02:24

details of setting metrics for the

play02:25

optimizer in the pipeline remember

play02:27

seeing the strange answer of PayPal

play02:30

being a co-founder of SpaceX previously

play02:33

we later found out that the entity link

play02:35

feature from bib natural language API

play02:38

can exactly prevent this from happening

play02:41

entity linking is a process to validate

play02:43

information by mapping and identifying

play02:46

words in your text to entities in a

play02:48

Knowledge Graph and in this video we're

play02:49

using the diff B Knowledge Graph because

play02:52

it has the largest network of verified

play02:54

information sources on planet Earth so

play02:57

natural language API can be used to

play02:59

extract entities and relationships but

play03:01

it also can verify how valid the answer

play03:04

is by linking facts back to the default

play03:07

Knowledge Graph PayPal is being

play03:09

categorized as these entity types also

play03:12

with a confidence score and here you see

play03:15

the clickable link which actually

play03:17

further lead to the page about PayPal

play03:20

and all the other relevant information

play03:22

here which means this information is

play03:24

valid but let's type pip Piper pip Piper

play03:27

is a madeup company name from the sitcom

play03:30

Silicon Valley while the NP can

play03:32

correctly identify this is probably an

play03:35

organization name there's no link here

play03:37

which means it can't be mapped back to

play03:39

this Bond Knowledge Graph because it's

play03:41

not a valid company well they do have a

play03:43

valid LinkedIn page without these fake

play03:45

profiles Richard Hendrick thees Jared

play03:48

another dases and Ed Chambers Jared done

play03:52

go for

play03:55

Chambers adding an entity Linker in LM

play03:58

Bas systems will look like this this

play04:00

validation step not only checks the

play04:02

correct answer types but also can help

play04:04

filter made up information when llms are

play04:07

hallucinating so now we kind of see what

play04:09

entity Linker can do for us we updated

play04:12

our dspo rack Pipeline with this answer

play04:15

type validity check to see if it can

play04:18

improve the output from the question

play04:20

that we saw last time so let's actually

play04:22

look at the obvious difference between

play04:24

our basic rack and the one with entity

play04:26

Linker when we update with the step to

play04:29

check entity type this is closer to the

play04:31

right answer and when we look at the

play04:32

reasoning part language model was guided

play04:35

to find specifically the information

play04:37

regarding person as you can see here

play04:39

Elon Musk is a person gym control is

play04:42

also a person the correct answer should

play04:44

be Elon is a sole founder of SpaceX but

play04:48

at least this is closer to the right

play04:50

answer compared to the previous one with

play04:52

PayPal which to some degree proves that

play04:54

this is a step that we should consider

play04:56

integrating in our llm pipeline as a

play04:59

step to validate so what we are going to

play05:01

do is to have both entity type checks

play05:04

and guiding language models to stick

play05:07

strictly with the external data from our

play05:09

knowledge graph so here's a brief recap

play05:11

of how we previously designed our custom

play05:14

DS py rack pipeline integrating with

play05:16

Knowledge Graph data first it refines

play05:18

the original question with our knowledge

play05:20

graph because knowledge graphs can

play05:22

provide more comprehensive contacts as

play05:24

they organize data points with

play05:27

connections and then the language model

play05:29

will further retrieve relevant

play05:31

information from our Vector database

play05:33

based on this refined query and finally

play05:35

it would enhance the answer by combining

play05:38

information from Vector database and

play05:40

metadata from our knowledge graph as we

play05:42

mentioned previously we need metrics for

play05:44

DSP Optimizer so the two metrics we're

play05:47

using here first check the entity type

play05:50

and the second is to assess how

play05:52

effective the pipeline is following the

play05:55

data from Knowledge Graph so if the

play05:57

entity type of the answer is MCH matched

play05:59

in our entity Linker or it's a yes no

play06:02

type of answer which entity type is not

play06:05

required we'll increase the score by one

play06:08

and then move on to the second metric

play06:10

which is to assess if the answer aligns

play06:13

or is consistent with the context from

play06:15

our knowledge graph so now we have our

play06:17

metrics ready and we also have our

play06:19

training data set with just a few

play06:21

examples besides question we probably

play06:23

will feel confused like why page

play06:25

contacts and answer are pretty much the

play06:26

same so the intention here is the final

play06:29

answer should strictly align with the

play06:32

ground truth provided from our knowledge

play06:34

graph because if you still remember in

play06:36

our theas pyra Pipeline with knowledge

play06:38

graphs there's a step that language

play06:40

model will retrieve and generate answer

play06:43

from our Vector database but the vector

play06:45

based answer should further get

play06:47

validated as being consistent with

play06:50

information from the knowledge graph

play06:52

that's why you see here there's a custom

play06:54

metric as H context and the correct

play06:57

answer should always align with this

play07:00

metric

play07:01

[Music]

play07:03

here so the nice thing about knowledge

play07:05

graphs is you actually can see the

play07:07

details of the relationship between

play07:09

these two people with the evidence here

play07:11

so this is the evidence from the article

play07:14

that showcases the interaction between

play07:16

Elon Musk and Mark Zuckerberg and you

play07:18

can easily get this as the property for

play07:21

relationships in your knowledge draft by

play07:23

just calling the def natural language

play07:25

API and it's free so now you see what we

play07:28

have in our knowledge graph which serves

play07:30

as the ground truth when we go into the

play07:32

reasoning part here the knowledge graph

play07:34

context is being considered now we got

play07:37

this enriched query let's see what the

play07:39

enhanced output looks like you can see

play07:41

that more specific passages regarding

play07:44

the martial arts match are being

play07:46

retrieved it further incorporates how

play07:48

their relationship evolves this is

play07:50

probably a clear example to illustrate

play07:52

what it means to enhance language models

play07:55

retrieval ability by bringing in

play07:57

knowledge grafts and I think this is

play07:58

also a good example of how knowledge

play08:01

graphs can be combined with vector-based

play08:04

[Music]

play08:07

rack so here's what the ground True

play08:09

Looks like in our knowledge graph

play08:11

there's only one relationship regarding

play08:13

who found this SpaceX and this tells us

play08:15

that elamus is the sole founder this

play08:18

relationship is being supported by the

play08:21

evidence here from the text in the

play08:23

Wikipedia page and that's why we will

play08:25

get an empty list this is the ground

play08:27

truth that no co-founders of SpaceX will

play08:30

be returned here we provided the ground

play08:32

truth but the optimized program did not

play08:35

seem to successfully make GP 3.5 stick

play08:38

to the external knowledge see what it

play08:40

did and we can also look at the third

play08:43

program that we optimize on the answer

play08:45

was not even close to being relevant so

play08:49

here's example of how the automated

play08:51

prompt pipeline just went a little bit

play08:53

too far this is our original question

play08:56

and also we provide a ground truth our

play08:58

original query

play09:00

somehow just got optimized as who else

play09:02

has co-founded companies with Elon Musk

play09:04

BX for some reason just got rid of even

play09:07

if now we have Optimizer in the pipeline

play09:09

The self-directed Prompt pipeline by

play09:12

language models themselves may not be

play09:14

that reliable plus by Nature they're

play09:16

just unpredictable so to what degree

play09:20

should we rely on their self-directed

play09:22

reasoning

play09:24

[Music]

play09:27

ability so now that we saw the

play09:29

performance of the DSP pipeline can be

play09:32

quite language model or embedding model

play09:36

dependent for the less performing

play09:38

combinations one is L 3an expert and the

play09:41

other one is gp2.5 plus .002 now I'm

play09:45

coming back to manually tweaking prompts

play09:48

to make language models stick more to

play09:50

the external knowledge as to research

play09:52

how faithful are RX Syms also pointed

play09:55

out that the forcefulness of prompt

play09:57

templates can influence the degree of

play09:59

how the language models are following

play10:01

the external knowledge this is our first

play10:03

prompt template where the language model

play10:06

was not specifically instructed to

play10:09

strictly follow the ground truth from

play10:11

Knowledge Graph and Hallucination is

play10:13

obvious here well this could be some

play10:15

existing patterns in the language model

play10:17

if we go on to the next template where

play10:20

we specifically instruct Elms to

play10:22

strictly follow the ground truth and

play10:24

Knowledge Graph especially if there's

play10:26

conflicts between their internal

play10:28

patterns and external knowledge as you

play10:31

can see here and see what we got this

play10:33

perfectly aligns with our kg data in

play10:35

this specific use case if we want

play10:38

language models to stick strictly to

play10:40

external knowledge manually customizing

play10:42

proms may still be needed you can

play10:44

literally see the difference between

play10:46

this answer and the previous one in my

play10:49

personal opinion I think dpip is an

play10:51

elite framework at least for me

play10:52

throughout the process the learning

play10:54

curve was quite steep for me so if

play10:56

you're having a great time with it and

play10:58

reaching better results

play11:00

congratulations you are probably being

play11:01

verified as one of the top 5% Elite

play11:04

programmers by this Elite framework so

play11:06

good for you but so far we're taking a

play11:08

break from this because we are moving on

play11:11

to graph rack so I'll see you in the

play11:13

next one

Rate This

5.0 / 5 (0 votes)

Related Tags
语言模型知识图谱准确性外部信息内部知识提示技术优化框架实体链接信息验证自动调优
Do you need a summary in English?