Building Corrective RAG from scratch with open-source, local LLMs

LangChain
16 Feb 202426:00

Summary

TLDRThe transcript discusses building self-reflective retrieval-augmented generation (RAG) applications using open-source and local models. It highlights the concept of self-reflection in RAG, where the system grades the relevance of retrieved documents and performs knowledge refinement. The speaker introduces 'langra', a tool for implementing these ideas locally, and demonstrates its use with a local LLM and a CPU-optimized embedding model. The process involves creating a local index, grading documents, and using conditional logic to decide whether to perform web searches or generate responses. The transcript emphasizes the potential of logical flows and local models for complex reasoning tasks without the need for an agent.

Takeaways

  • ๐ŸŒŸ The concept of self-reflection in RAG (Retrieval-Augmented Generation) is gaining popularity, allowing for more dynamic and relevant information retrieval and generation based on feedback loops.
  • ๐Ÿ“š The 'Corrective RAG' (C-RAG) paper demonstrates a straightforward approach to self-reflection, by grading retrieved documents and refining knowledge based on their relevance and correctness.
  • ๐Ÿ’ก Implementing self-reflective RAG apps can be achieved using open-source and local models, which can run efficiently on a laptop without the need for large-scale, API-gated models.
  • ๐Ÿ› ๏ธ The LangChain team has developed a tool called 'langra' which facilitates the implementation of self-reflective RAG using local LLMs (Language Models).
  • ๐Ÿ” For local information retrieval, the use of no embeddings and GPT for all embeddings from Nomic is suggested for its CPU optimization and effectiveness.
  • ๐Ÿ“ˆ The process of building a RAG app involves creating a graph of logical steps, where each node represents a specific operation or function, and the state is propagated through these steps.
  • ๐Ÿ”— The use of AMA (Alama) with the mraw 7B model is highlighted for its ability to run large-scale models locally and its support for JSON mode, which structures the model's output for easy interpretation and flow control.
  • ๐Ÿ”„ The concept of logical gates is introduced, where the output from one step (e.g., document grading) determines the next step in the process (e.g., appending relevant documents or performing a web search).
  • ๐Ÿ” The demonstration showcases a multi-step logical flow in action, including retrieval, grading, web search, question transformation, and generation, all running locally and seamlessly integrated.
  • ๐Ÿš€ The potential of using local models in a constrained, step-by-step manner is emphasized over using them as agent executors, which can lead to more reliable and effective logical reasoning tasks.

Q & A

  • What is the main focus of the Lang Chain team's discussion?

    -The main focus of the Lang Chain team's discussion is building self-reflective rag (retrieval-augmented generation) applications from scratch, using only open source and local models that run strictly on a laptop.

  • What is the significance of self-reflection in rag research?

    -Self-reflection in rag research is significant as it allows the system to perform retrieval based on a question from an index, assess the relevance or quality of the retrieved documents, and perform reasoning to potentially retry various steps, leading to more accurate and refined outputs.

  • How does the concept of self-reflection improve the rag process?

    -The concept of self-reflection improves the rag process by allowing the system to not just perform a single-shot retrieval and generation but to also self-reflect, reason, and retry steps from alternative sources, leading to enhanced accuracy and relevance in the final output.

  • What is the role of local LLMs in the discussed approach?

    -Local LLMs play a crucial role in the discussed approach as they are smaller and more manageable models that run locally on a system, allowing for efficient and fast processing without relying on API-gated, large-scale models.

  • How does the 'corrective rag' paper contribute to the self-reflection idea?

    -The 'corrective rag' paper contributes to the self-reflection idea by demonstrating a method where the system performs retrieval, grades the documents based on relevance, refines knowledge when documents are correct, and performs a web search to supplement retrieval when documents are ambiguous or incorrect.

  • What is the benefit of using open source tools like AMA and Langra for local model implementation?

    -The benefit of using open source tools like AMA and Langra for local model implementation is that they provide an easy, efficient, and seamless way to run models locally, enabling users to leverage powerful machine learning capabilities without the need for extensive infrastructure or API access.

  • How does the use of GPT for all embeddings from nomic enhance the local indexing process?

    -The use of GPT for all embeddings from nomic enhances the local indexing process by providing a CPU-optimized, contrastively trained embedding model that works well locally, ensuring fast and efficient document indexing without relying on external APIs or cloud services.

  • What is the purpose of the conditional edge in the logical flow of the rag process?

    -The purpose of the conditional edge in the logical flow of the rag process is to make decisions based on the output of certain nodes, such as the grading step, to determine the next course of action, like whether to append a relevant document or perform a web search to supplement the retrieval.

  • How does the JSON mode in AMA help in constraining the output of the local LLM?

    -The JSON mode in AMA helps in constraining the output of the local LLM by enforcing a specific output format, such as a binary yes/no score in JSON, which makes it easier to interpret and process the model's output within the logical flow of the rag application.

  • What is the key takeaway from the discussion on building logical flows using local models and Lang graph?

    -The key takeaway is that building logical flows using local models and Lang graph allows for the creation of reliable and efficient rag applications by breaking down the process into a series of logical steps, each performed by the local model, without the need for a complex agent executor. This approach enhances the reliability and manageability of the system.

Outlines

00:00

๐Ÿš€ Introduction to Self-Reflective Rag Apps

Lance from the Lang chain team introduces the concept of building self-reflective rag apps from scratch using open source and local models. He discusses the trend of self-reflection in rag research, where the system performs retrieval based on a question, grades the relevance of documents, and refines its process based on the quality of generations. Lance highlights the importance of feedback and retry mechanisms in self-reflective rag, and introduces the 'corrective rag' paper as an example of this approach. He also mentions the use of Langra for implementing these ideas effectively with smaller local models.

05:00

๐Ÿ› ๏ธ Setting Up Local LMS with AMA

Lance explains how to set up local LMS using AMA, a tool that allows for easy model deployment on various platforms. He guides through the process of downloading the AMA application, selecting a model from the list, and preparing the environment for model usage. Lance chooses the mrol instruct model, a 7 billion parameter model, and demonstrates how to pull the model locally using AMA's JSON mode. He also discusses the use of no embeddings for local retrieval and the creation of a local index for performing rag on a specific blog post.

10:00

๐Ÿ” Building a Retrieval and Grading System

In this section, Lance details the process of building a retrieval and grading system for the rag app. He uses GPT for generating embeddings and chroma, a local vector store, to create a retriever. Lance demonstrates how to retrieve relevant documents based on a query and how to grade these documents using the local llm. He emphasizes the use of JSON mode in AMA for structuring the model's output to facilitate downstream processing in the graph.

15:01

๐Ÿ“ˆ Defining the Logical Flow of the Rag Graph

Lance outlines the logical flow of the rag graph, explaining how state is transformed at each node. He describes the state as a dictionary containing keys relevant to rag, such as the question, pen documents, and independent generation. He also discusses the conditional edge in the graph that decides the next step based on the grading results. Lance highlights the convenience of AMA's JSON mode in enforcing structured output for logical reasoning in the graph.

20:02

๐Ÿง  Implementing Functions for Each Node and Conditional Edge

Lance provides a walkthrough on implementing functions for each node and conditional edge in the rag graph. He explains how each node modifies the state and how functions are defined for retrieval, grading, query transformation, web search, and generation. He demonstrates how the grading function filters relevant documents and triggers web search when necessary. Lance also shows how the decide to generate function acts as a conditional edge, determining the next node to traverse based on the search flag.

25:05

๐ŸŽฏ Testing the Rag App with Different Queries

Lance tests the rag app with different queries, one relevant to the context and another not in the index. He shows how the app performs retrieval, grading, and generation for a question about agent memory, which is successfully answered using the blog post index. For a question about Alpha codium, which is not in the context, the app correctly identifies the irrelevance and performs a web search to supplement the answer. Lance emphasizes the reliability of using local models for logical reasoning tasks and the effectiveness of constraining the model to perform specific tasks at each step of the graph.

๐Ÿ“š Conclusion and Encouragement for Local LLM Usage

Lance concludes by encouraging the use of local models for complex logical reasoning tasks. He suggests that for certain problems, a state machine or a graph with a series of logical steps might be more effective than using an agent. He highlights the benefits of constraining the local model to perform small tasks at each step, which he finds more reliable for logical reasoning. Lance assures that the code for the rag app will be shared and encourages others to experiment with this approach.

Mindmap

Keywords

๐Ÿ’กself-reflective

In the context of the video, 'self-reflective' refers to the ability of a system or model to evaluate its own performance and make adjustments based on that evaluation. This concept is integral to the theme of building advanced language models that can perform tasks like retrieval and generation more effectively by reflecting on the relevance and quality of their outputs. An example from the script is the idea of performing retrieval based on a question and then grading the retrieved documents relative to the question, which embodies self-reflection as the system assesses the usefulness of its own retrieved information.

๐Ÿ’กRAG (Retrieval-Augmented Generation)

RAG is a machine learning technique that combines information retrieval with text generation. It is central to the video's discussion about creating intelligent systems that can perform complex tasks. The script describes RAG as a process where a model first retrieves relevant documents based on a question and then generates responses using those documents. The video emphasizes the importance of self-reflection in the RAG process to improve the quality of the generated content.

๐Ÿ’กopen source

The term 'open source' relates to software or models that are freely available for use, modification, and distribution. In the video, the emphasis on open source models highlights the accessibility and collaborative nature of the technology being discussed. The script mentions using open source models like mraw for local running, which underscores the theme of leveraging community-driven resources to build sophisticated applications.

๐Ÿ’กlocal models

Local models refer to machine learning models that run on an individual's local machine rather than on a remote server. The video script discusses the benefits of using local models for tasks like RAG, emphasizing the importance of reducing reliance on API-gated models and focusing on self-sufficient, locally-hosted solutions. This approach is illustrated in the script by downloading and running the mraw 7B model locally using AMA (A Local Model Management Tool).

๐Ÿ’กLang Chain

Lang Chain is a tool or framework mentioned in the video that is used to build self-reflective RAG applications. It is significant because it represents a system that facilitates the integration of various components, such as local models and retrieval mechanisms, to create a cohesive and effective application. The script describes using Lang Chain to implement a corrective RAG approach, which involves iterative refinement of the retrieved information and generation process.

๐Ÿ’กknowledge refinement

Knowledge refinement is the process of improving the quality or relevance of information within a system. In the context of the video, it is part of the self-reflection mechanism where documents deemed correct by the model undergo further analysis to compress and retain only the most relevant information. This concept is crucial for enhancing the efficiency and accuracy of the RAG process, as it allows the model to focus on the most pertinent data when generating responses.

๐Ÿ’กweb search

Web search is the act of querying the internet to find information. In the video, web search is used as a mechanism to supplement the retrieval process when the initial documents retrieved are deemed irrelevant or ambiguous. This approach demonstrates the video's theme of combining local model capabilities with external data sources to improve the overall performance and reliability of the RAG system.

๐Ÿ’กquery rewrite

Query rewrite refers to the process of modifying a search query to improve the results it yields. In the context of the video, when the initial retrieval does not produce satisfactory results, the system performs a query rewrite to better refine the search and retrieve more relevant information. This concept is integral to the self-reflective nature of the RAG application being discussed, as it allows the system to adapt and improve its performance based on the initial outcomes.

๐Ÿ’กlogical flow

Logical flow describes the sequence of steps or operations in a process. In the video, the logical flow is critical for the RAG application's performance, as it outlines the order in which tasks such as retrieval, grading, and generation are executed. The script details a multi-step logical flow that includes conditional branching based on the outcomes of previous steps, which is essential for the self-reflective capabilities of the system.

๐Ÿ’กstate machine

A state machine is a computational model that manages its behavior based on a set of states and the rules for transitioning between those states. In the video, the concept of a state machine is used to illustrate how the RAG application progresses through its logical flow. The script describes using a state machine-like approach to control the sequence of operations in the RAG process, with the ability to loop back to earlier stages or move to subsequent steps based on the outcomes of each operation.

Highlights

Building self-reflective RAG (Retrieval-Augmented Generation) apps from scratch using open source and local models.

Utilizing recent trends in self-reflection within RAG research to improve the quality of document retrieval and generation.

Implementing the idea of self-reflection in RAG by performing retrieval, grading documents, and potentially retrying steps based on relevance and quality.

The introduction of the corrective RAG (C-RAG) paper, which has gained attention and presents a straightforward approach to enhancing RAG.

Using LangChain, a tool developed recently, which works well with smaller, local LLMs (Language Models) and is an alternative to large-scale, API-gated models.

The process of running LMS (Language Models) locally, with a focus on AMA (AI Model Application) as a simple and efficient way to run models on personal devices.

Downloading and using the mraw open source model as a demonstration of how to work with local models for RAG applications.

Creating a local index for RAG using a blog post on autonomous agents and splitting it into chunks for efficient retrieval.

Employing GPT for all embeddings from nomic, a CPU-optimized model that runs locally without the need for an API.

Using Chroma, an open-source local vector store, to facilitate efficient document retrieval and indexing.

Defining a logical flow for RAG that involves a series of steps including retrieval, grading, decision-making, query transformation, web search, and generation.

The use of conditional edges in the logical flow graph, which allows for dynamic decision-making based on the output of previous steps.

AMA's JSON mode is highlighted as a crucial tool for structuring model output in a way that can be reliably interpreted by subsequent steps in the logical flow.

A detailed example of how to build a RAG app using local models, including defining graph states, implementing functions for each node, and connecting nodes through edges.

The demonstration of a multi-step logical flow working effectively with local models, showing the potential for reliable and efficient RAG applications without the need for large-scale models.

The encouragement to consider the use of state machines or graph-based logical flows instead of agent-based executors for certain tasks, as it can be more reliable and manageable.

Transcripts

play00:02

hi this is Lance from Lang chain team

play00:04

I'm going to talk about building a

play00:07

self-reflective rag apps from scratch

play00:10

using only open source and local models

play00:13

um that run strictly on my

play00:16

laptop now one of the most interesting

play00:20

Trends in the rag research and a lot of

play00:23

like methods that become pretty popular

play00:25

in recent U months and weeks is this

play00:28

idea of self-reflection

play00:30

so when you do rag you perform retrieval

play00:33

based upon a question from an index and

play00:36

this idea of self-reflection is saying

play00:39

based upon for example the relevance of

play00:41

the retriev documents to my question or

play00:44

based upon you know the quality the

play00:46

generations relative to my question or

play00:48

the generations relative to the

play00:50

documents I want to make I want to

play00:52

perform some kind of reasoning and

play00:54

potentially feed back and retry various

play00:57

steps so that's kind of the big idea and

play00:59

there's a there's a few really

play01:00

interesting papers that implement this

play01:03

and what I want to kind of show is that

play01:07

implementing these ideas using something

play01:11

that we've developed recently called

play01:13

langra is a really nice approach um and

play01:17

it works really well with local llms

play01:19

that are much smaller for example than

play01:21

you know API uh gated very large scale

play01:25

Foundation

play01:27

models um and so we're going to look at

play01:30

particular paper called corrective rag

play01:32

or C rag now this paper is kind of um

play01:36

there's been some attention on for

play01:38

example Twitter about this work uh it's

play01:41

a really neat

play01:42

paper and the idea is actually pretty

play01:45

simple and straightforward if you go

play01:47

down to the figure

play01:48

here you're going to do perform

play01:52

retrieval and you're going to grade the

play01:54

documents relative to the the question

play01:57

so you're kind of doing a relevance

play01:59

grading

play02:00

and there's some theistic like basically

play02:02

if the documents are deemed correct they

play02:05

actually do some uh knowledge refinement

play02:08

where they further strip the documents

play02:10

to compress relevant chunks within the

play02:13

documents and retain them um and if the

play02:17

documents are either deemed ambiguous

play02:19

relative to the query or incorrect it

play02:22

performs a web search and supplements

play02:24

retrieval with the Webster so that's

play02:26

kind of the big idea but it's a nice

play02:28

illustration of this General principle

play02:30

of don't just do rag as kind of like a

play02:33

you know a singleshot process where you

play02:35

perform retrieval and then go to

play02:36

generation you can actually perform

play02:38

self-reflection and reasoning you can

play02:40

retry you can uh retrieve from

play02:43

alternative sources and so forth that's

play02:45

kind of the big

play02:47

idea now in our build here we're going

play02:49

to make some minor simplifications um

play02:53

here's kind of a layout of the graph

play02:55

that we're interested in we're going to

play02:57

perform retrieval and for that we're

play02:59

going to use no embeddings which run

play03:01

locally um we're going to build a node

play03:03

for grading those documents relative to

play03:06

the question to say are they relevant or

play03:08

not and if any documents are deemed

play03:11

irrelevant we'll go ahead and do a query

play03:14

rewrite web search and we'll s go ahead

play03:17

to generation based upon the web search

play03:19

results so that's the

play03:21

flow now first things first is how do I

play03:25

get started running LMS locally and and

play03:28

kind of where do I go and where I often

play03:30

direct people and what I found to be

play03:32

really useful is

play03:34

AMA um it is a really nice way to run

play03:38

models locally uh for example on your

play03:41

Mac laptop very easily and they are

play03:42

launching support for various other

play03:44

platforms as well um and so basically if

play03:47

you go to their website it's very simple

play03:49

you simply download their application um

play03:52

you can see it's running here on my

play03:54

machine um and once you have it

play03:58

downloaded you all you need to do is you

play04:01

can go to their model list and you can

play04:03

kind of search around so you can

play04:06

actually look I think it's sorted by

play04:07

popularity so you can see mraw obviously

play04:09

a really interesting open source model

play04:12

um is kind of one of the top so you can

play04:15

see it has like 210,000 polls it's one

play04:18

of the top models I click on this and

play04:21

where this takes me is a model page I

play04:23

can look at this

play04:25

tags uh Tab and this basically shows me

play04:30

um a bunch of model versions that I can

play04:32

really easily uh just like download and

play04:35

run and we'll we'll show how to do that

play04:37

here very

play04:38

shortly um what I'm going to do is I'm

play04:42

going to choose mrol instruct so that is

play04:45

their 7 billion uh parameter instruct

play04:48

model and so all I would do I'm going to

play04:50

go over to my notebook

play04:52

here so I have an empty notebook and all

play04:56

I've done is I've already uh done a few

play04:58

pip installs

play05:00

and I've also set uh a few environment

play05:03

variables to use Langs Smith and we'll

play05:04

see why that's useful later that's

play05:07

really all I've

play05:08

done now I'm going to put a note here to

play05:13

uh for

play05:15

olama and what I'm going to do is this

play05:18

AMA pull the model I want and you just

play05:22

run

play05:23

that so normally this will take a little

play05:26

bit because you're actually pulling the

play05:27

model and typically it's like a couple

play05:29

gigs I actually already have this model

play05:31

so it's faster um it's actually already

play05:34

done but that's really all you do okay

play05:36

so that's kind of like step one and then

play05:39

what we're going to do is I'm just going

play05:40

to create this variable local

play05:43

llm

play05:46

um that I am going to yeah so I'm just

play05:50

going to Define this variable Mr all

play05:52

instruct because this is the model that

play05:54

I download using a llama pull miston

play05:57

struct that's all that's going on here

play05:59

so this is be the llm I'm going to work

play06:00

with I've pulled this so it's local on

play06:02

my system it's available V Lama which is

play06:04

basically running in the background on

play06:06

my system and you can see is really

play06:08

seamless and easy to

play06:10

use now the first thing I want to do for

play06:14

this approach is I'm going to call this

play06:17

um

play06:19

index so because this was a corrective

play06:24

rag approach I need an index that I care

play06:26

about that I'm actually performing rag

play06:28

on and so here I'm going to use uh a

play06:32

particular blog post that I like on

play06:35

agents and we can like pull it up here

play06:37

and have a look let pull it up over here

play06:40

actually so this is a pretty neat blog

play06:42

post on autonomous agents it's like

play06:45

pretty long and mey so it's kind of like

play06:46

a good Target for performing retrieval

play06:49

on lots of details here uh really neat

play06:52

really detailed blog post so what I'm

play06:55

going to do is I'm going to load it here

play06:59

I'm going to split it and I'm going to

play07:01

use a chunk size of 500 tokens um these

play07:04

are kind of somewhat arbitrary

play07:05

parameters you can play with these as

play07:07

you want the point is here I'm just B

play07:08

building a a quick local index um so I

play07:12

load it I split it into chunks now this

play07:15

is the interesting bit I'm going to use

play07:17

GPT for all embeddings from nomic which

play07:20

is let's actually pull up the link here

play07:22

I had it available here so

play07:26

these are um you can see right here it

play07:31

is a CPU uh CPU optimized cont

play07:35

contrastively trained s basically eser

play07:38

model um so you can like drill into

play07:40

sentence Transformers so you can see um

play07:44

yep so there it is the initial work is

play07:46

described in our paper espert basically

play07:49

so the key point is this this is a

play07:52

locally running CPU optimized embedding

play07:54

model that works quite well I found um

play07:57

runs on your system no AP I nothing so

play08:00

it's pretty nice runs fast so we're

play08:02

going to go ahead and use that um from

play08:05

our friends at

play08:06

nomic and I'm also going to use chroma

play08:08

which is an open source local Vector

play08:10

store that's really easy to spin up runs

play08:12

locally and all I'm doing is I'm taking

play08:14

my documents um I'm going to define a

play08:17

new collection taking my embedding model

play08:19

GPD for all embeddings I'm going to

play08:20

create a retriever from this so there we

play08:23

go okay it shows you some uh you some

play08:27

parameters so cool I have a retri so we

play08:29

can actually call we can say get

play08:31

relevant

play08:32

documents um and I can say something

play08:34

about like um let's say like agent

play08:37

memory or something you know let's just

play08:40

test and okay cool look at that so it's

play08:41

nice and quick we get a bunch of

play08:43

documents out that relate to memory so

play08:45

yeah you can see memory stream like the

play08:47

documents are are sayane so it looks

play08:49

like everything's kind of working here

play08:51

so that's great we have a

play08:53

retriever

play08:55

now let's think a little bit about what

play08:58

we want to do next next so when I do

play09:01

these kinds of uh kind of logical rag

play09:05

flows but as graphs first I always try

play09:08

to lay out the

play09:11

logic and um let me move this up here I

play09:16

try to lay out kind of The Logical

play09:19

steps um and in each logical

play09:23

step what's happening is I'm

play09:26

transforming state so in these Gra

play09:29

really all you're doing is you're

play09:30

defining a stake that you're just

play09:32

modifying throughout the flow of the

play09:34

graph now in this case because we're

play09:38

we're interested in rag our state is

play09:40

just going to be a dictionary and that

play09:42

dictionary you can see I actually kind

play09:43

of schematically uh laid it out here

play09:46

it's just going to contain a few keys

play09:48

that are things relevant to rag it's

play09:50

going to be like a question then it's

play09:51

going to be you pen documents to your

play09:53

dick um and then eventually you're

play09:56

independent generation so that's really

play09:58

all that's going on in terms of like how

play10:00

your State's being propagated through

play10:02

the graph and at every node you're

play10:05

making some modification to State that's

play10:07

the key point so you're basically going

play10:09

to do you start with a question from the

play10:11

user you perform retrieval relevant to

play10:13

the question um you're then going to

play10:16

grade the documents so you're going to

play10:17

do a modification of the documents then

play10:20

you're going to make a decision are they

play10:22

relevant or not if they're not relevant

play10:24

um you're going to transform the query

play10:26

so you modify the question do a web

play10:28

search the final step is a generation

play10:31

based upon the D documents so that's

play10:32

your

play10:33

flow now what I want to call out here is

play10:37

there's one very important what we call

play10:39

conditional Edge where depending upon

play10:41

the results of the grading step I want

play10:44

to do one thing or another so I'm going

play10:45

to make a decision so I want to show you

play10:48

something that's very

play10:50

convenient um that we can use with olama

play10:55

to help us

play10:56

here so this

play10:59

is I'm going to kind of make a note here

play11:02

um to note what I'm going to highlight

play11:06

so this is AMA Json

play11:09

mode so the basic logic of that

play11:13

conditional Edge decide to generate is

play11:15

going to be something like this I

play11:16

already have this prompt laid out um but

play11:19

it's basically going to be I'm going to

play11:21

take a document and I'm going to take my

play11:24

question and I'm going to do some kind

play11:26

of comparison to say is the document

play11:27

relevant to the question that's really

play11:29

what I want to do but here's the catch

play11:32

because I want that edge to process very

play11:34

particular output either yes or no I

play11:36

want to make sure that my output is

play11:39

structured in a way that can reliably be

play11:41

interpreted Downstream in my in my graph

play11:45

this is where Json mode from Alama is

play11:47

really useful and you can see all I do

play11:50

is now I'm I'm importing chat llama this

play11:54

is going to reference that local model

play11:57

that I specified up here Mr instruct

play12:00

which I've downloaded so I have the

play12:01

model

play12:02

locally and I'm just flagging this um

play12:06

format Json to tell the model to Output

play12:09

Json

play12:10

specifically and what I'm going to do in

play12:13

my prompt here I'm basically saying you

play12:16

know you're a grer um here's the

play12:19

documents here's the question and here's

play12:20

the catch give a binary score yes no um

play12:25

and provide it as Json with a single key

play12:28

score and no Preamble no explanation so

play12:30

I kind of explain in the prompt what I

play12:33

want and when I call this with Json mode

play12:37

uh it will enforce that Json is returned

play12:40

and hopefully with this single key we

play12:41

expect score and either binary yes no

play12:45

and when I'm going to run that as a

play12:46

chain so I'm going to supply that prompt

play12:49

to my llm and I'm going to then parse

play12:51

that Json string out into a Json object

play12:54

which I can work with so let's try that

play12:57

we're going to try to run this chain we

play12:59

defined we're going to run retrieval on

play13:02

here's a here's a question here's our

play13:04

docs let's grade one of the docs using

play13:07

basically passing question and one

play13:09

document and we're going to take the

play13:10

page content from the document which is

play13:12

like basically all the text and we're

play13:14

going to run

play13:15

this so let's test that quickly and it

play13:18

is still running now it's finished let's

play13:20

check the output here we can see so we

play13:23

get a Json back which just is the score

play13:25

yes no so that's exactly right that's

play13:27

what we want and we can actually look

play13:29

under the hood here at

play13:33

um yeah so we can actually look under

play13:36

the hood in Langs Smith at that grading

play13:39

process and we can see here that our

play13:42

prompt got populated with um the context

play13:47

so here is the

play13:49

document um

play13:52

and um right here was a question here is

play13:55

a document and um the task was of course

play14:00

to grade it so we can see here's like

play14:03

the full prompt you're a grader

play14:05

assessing the relevance retri document

play14:06

here's a document and then here's the

play14:09

model output score yes so this is really

play14:11

nice we've basically enforced the output

play14:15

from our local

play14:16

llm um using Json mode so we know every

play14:20

time it's going to Output binary yes no

play14:23

score as a Json object which we again

play14:25

extract so that's a very key point that

play14:27

I just wanted to flag it's a very nice

play14:29

thing that ama offers that's extremely

play14:32

helpful when building out uh

play14:34

particularly these kind of logical

play14:36

graphs where you really want to

play14:37

constrain the flow at certain

play14:40

edges so that's kind of the like really

play14:44

key thing I wanted to highlight a lot of

play14:47

the rest of this is actually pretty

play14:49

straightforward so let's now Define our

play14:52

graph State this is the dictionary that

play14:54

we're going to basically pass between

play14:55

our nodes so this is just some code I'm

play14:57

going to copy over

play14:59

this is defining your graph State you're

play15:00

just saying it's a dict that's all

play15:02

there's really to that um now here is

play15:06

where I'm going to copy over some code

play15:10

that basically implements a function for

play15:14

every node and every conditional Edge in

play15:18

our graph so if you remember we can kind

play15:20

of go over and look our graph is laid

play15:23

out like this and all we're doing is for

play15:27

every node drawn we're going to find a

play15:29

corresponding function here which

play15:32

performs some operation so retrieve is

play15:35

basically just doing we had our retri

play15:37

defined get relevant documents and write

play15:39

them out to state so again we take a

play15:42

question in so if you look here we

play15:46

basically have this state dict passed

play15:48

into the function we extract the state

play15:51

dict here uh we extract the question

play15:53

from the state dict we do retrieval and

play15:56

we write that state dict back out to the

play15:58

so you think about every node is just

play16:00

doing some modification on the state

play16:03

reading it in doing something writing it

play16:06

back out that's really all that's going

play16:07

on and we can really just march across

play16:10

our little like diagram here and see how

play16:14

um basically each one of these nodes is

play16:17

implemented as a function and again you

play16:20

can see in every case we're using uh for

play16:24

example cadow llama in some of these

play16:26

cases we don't need Json mode so if

play16:28

we're just doing like a generation step

play16:31

um as you can see here we don't need

play16:32

Json mode for the grading we do so we're

play16:35

actually going to implement here the

play16:37

same thing we just showed um chat AMA

play16:40

using Json mode and what's going to

play16:42

happen is we can see we generate our

play16:44

score every time and then we can extract

play16:47

our grade from that

play16:49

Json and then we know the grade is going

play16:52

to constrained to the output yes or no

play16:55

then here's the key point we do some

play16:58

logical reasoning on that um to say for

play17:02

example if the grade is yes um then

play17:05

we're going to um like append the

play17:08

document it's relevant if not then what

play17:11

we're going to do is we're going to

play17:13

filter that document out and we're also

play17:15

going to set this flag to search perform

play17:18

web search as yes so what really

play17:21

happening here is we are kind of

play17:25

applying a kind of a logical gate to say

play17:29

if any document is scored as relevant

play17:32

then we just add it to our final list of

play17:34

filter documents if not we're going to

play17:36

go ahead and do a web search and we're

play17:38

going to set the search flag to be yes

play17:40

and we're not going to include that

play17:41

document in the output and you can see

play17:43

here we return a dictionary which

play17:45

contains our filter documents our

play17:47

question and then that flag to run web

play17:50

search yes or no you can see it was

play17:52

default no but if we ever encounter an

play17:54

irrelevant document we change that to

play17:56

yes so that's really all that's going

play17:58

going on here um you can see we do our

play18:01

queer transform down here again we just

play18:03

use um Mr all again here is like a a

play18:07

transform prompt but you kind of get the

play18:09

idea um web search node we use Tav web

play18:13

search here it's really kind of a nice

play18:14

quick way to perform web searches um and

play18:18

you can see we just supplement the

play18:19

documents with the web search results

play18:21

and then this was kind of the final step

play18:24

where we wrote out yes or no to our

play18:27

search key and depending upon the state

play18:31

which we can read in here we make

play18:33

decision to uh basically either return

play18:37

transform query or return generate which

play18:41

will basically that's determining the

play18:43

next uh node to go to um so this decide

play18:48

to generate is our conditional Edge

play18:50

that's actually right here and so it's

play18:53

looking at the results that we wrote out

play18:56

from grade documents in particular

play18:58

that uh search yes or no key in our dict

play19:05

and it's then going to basically

play19:07

determine the next node to Traverse to

play19:09

that's really all we're doing here so

play19:11

that's kind of nice now what we're going

play19:13

to do is we kind of copied over all

play19:17

these um these functions we then can go

play19:22

ahead and run that and now we just lay

play19:26

out our graph so again

play19:29

our graph was kind of explained here and

play19:33

here's where we actually just lay out

play19:35

the full kind of graph

play19:37

organization um how we're going to

play19:39

connect each node so we add the nodes

play19:41

first we set our entry point and then we

play19:44

add the edges accordingly between the

play19:46

nodes and basically the logic here just

play19:49

Maps over to our diagram here that's

play19:51

really all that's

play19:53

happening

play19:55

um

play19:57

cool

play19:59

so I'm going to go ahead and go

play20:02

down and now let's kind of see this all

play20:06

working together so I'm going to go

play20:07

ahead and compile My

play20:09

Graph and I'm going to go ahead and ask

play20:12

a question explain how the different

play20:14

types of agent memory work and what I'm

play20:18

going to do let's go back to our D

play20:20

diagram so we can kind of reference that

play20:22

I'm going to call this and I'm actually

play20:25

just going to like this will like

play20:26

Traverse every step along the away and

play20:28

it'll print out something to explain

play20:30

what's happening so you can see I

play20:32

perform retrieval and now I'm doing my

play20:34

grading steps and this is all running

play20:36

locally um and they were all deemed

play20:39

relevant so then I'm going to go ahead

play20:41

and

play20:42

generate and it's running right

play20:44

now and there we go so we can go over to

play20:48

Lang Smith and let's actually have a

play20:51

look at what happened under the hood so

play20:52

this is what just ran so we can see that

play20:56

at each one of these steps

play20:58

we called Shadow llama with our mraw 7B

play21:02

model that's running

play21:04

locally um and this is our grading step

play21:07

so this was each document being graded

play21:10

um so again like look at this so it

play21:12

outputs a binary score yes no as a dict

play21:15

that's great um so this has a bunch more

play21:18

down here so these are all of our

play21:20

documents uh graded and now here is that

play21:23

final llm call which basically packed

play21:26

that all into our rag prompt you're an

play21:28

assistant for question answering task

play21:30

use a following to answer the question

play21:32

here's all up our docs here's the answer

play21:35

so that's pretty cool um we can see that

play21:37

this uh multi-step logical flow all

play21:41

works um now let's try something kind of

play21:45

interesting I'm going to ask a question

play21:46

that I know is not in the context and

play21:50

see if it will kind of perform that

play21:52

default to do web search so um I'm going

play21:56

to say Explain how how uh

play22:00

Alpha codium works so this is a recent

play22:04

paper that came out that's not relevant

play22:06

at all to this blog post so I know uh

play22:09

that retrieval should not be considered

play22:11

relevant and let's go ahead and run that

play22:14

and convince oursel that that's true so

play22:16

good this is perfect so the greater is

play22:19

determining these documents are not

play22:20

relevant and so it should be making that

play22:23

decision to perform web search so it it

play22:25

should be kind of going to this lower

play22:27

branch

play22:28

transform the query run web search and

play22:31

looks like that all ran so it tells us

play22:33

Alpha coding is an open source AI coding

play22:35

generation tool developed by Cod M this

play22:38

is perfect that's exactly what it is and

play22:40

we can actually go into Langs Smith and

play22:42

again see what happened here so you can

play22:46

see here the trace is a little bit more

play22:48

extensive because all of our grades are

play22:51

incorrect so or irrelevant again we get

play22:54

the nice Json

play22:55

out um

play22:58

and okay so this is pretty cool so this

play23:01

was our question rewriting node so

play23:04

basically provid an improved input

play23:06

question without any Preamble so what is

play23:08

the mechanism behind Alpha codium

play23:10

functionality so it modifies the

play23:12

question we use Tali search right here

play23:15

so it basically does retrieval it

play23:17

searches for Stuff related to Alpha

play23:19

codium so that's great and then we

play23:22

finally passed that to our our model for

play23:24

Generation based on this new context and

play23:28

there we go Alpha codom Source AI code

play23:30

assistant tool um so that kind of gives

play23:33

you the main idea and the key point is

play23:37

this is all running locally again I used

play23:40

GPT for all embeddings for indexing up

play23:43

at the top right here and I used AMA

play23:48

with mrol 7B instruct um and Json mode

play23:53

for that one crucial step where I need

play23:56

to constrain the output to be kind of a

play23:57

score of yes no um and for other things

play24:01

I just use the model without Json mode

play24:03

to do perform Generations like to

play24:05

question rewrite or to do the final

play24:08

generation so in any case I hope this

play24:10

gives you kind of an overview of how to

play24:12

think about building logical uh flows

play24:15

doesn't have to be rag but rag is a

play24:17

really good kind of use case uh for this

play24:20

using local models and Lang graph and

play24:24

the thing I want to kind of leave you

play24:25

with is there is a lot of interest in

play24:28

complex logical reasoning using local

play24:30

llms and a lot of you know focus on

play24:32

using agents and I do want to kind of

play24:35

encourage you to think about depending

play24:37

on the problem you're trying to solve

play24:39

you may or may not actually need an

play24:40

agent it's possible that kind of

play24:42

implementing a state machine or a graph

play24:44

kind of as shown here with some series

play24:46

of logical steps this can incorporate

play24:49

Cycles or Loops back to like prior

play24:51

stages we have some more complex

play24:53

examples that show that um this actually

play24:56

can work really well with local mod

play24:57

models because a local model is only

play25:00

performing a step um within each node so

play25:05

you're kind of constraining it to like

play25:07

just do this little thing just do this

play25:09

little thing like just rewrite the

play25:11

question just grade the document rather

play25:14

than using the local llm um as like you

play25:18

know an agent executor that has to make

play25:21

all these decisions kind of jointly um

play25:25

or kind of in a less controlled workflow

play25:30

where for example like the the The

play25:32

Ordering of these various tasks can be

play25:34

determined arbitrarily by the agent here

play25:37

we really nicely constrain The Logical

play25:40

flow and let the local model just do

play25:43

little tasks at each step and I've just

play25:45

found it to be a lot more reliable and

play25:48

really useful for these kinds of like

play25:49

logical reasoning tasks um so hopefully

play25:52

this is helpful give it a try um and

play25:55

we'll make sure all this code is is

play25:56

easily shared thank thank

play25:58

you

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Open SourceLanguage ModelsLocal ComputingInformation RetrievalContent GenerationSelf-ReflectionTech InnovationMachine LearningData ScienceAI Research