What is RAG? (Retrieval Augmented Generation)

Don Woodlock
18 Jan 202411:37

Summary

TLDRThe video explains retrieval augmented generation (RAG), a popular AI architecture that allows large language models to answer questions using an organization's proprietary content. It works by breaking down an organization's content into chunks, vectorizing those chunks to enable finding the most relevant passages, and packaging those passages with the user's question into a prompt that is fed to the language model to generate an answer. RAG leverages the power of language models while tailoring their responses to an organization's unique content. The video aims to clarify this increasingly common AI pattern that creates chatbot-like experiences for employees and customers.

Takeaways

  • 😀 RAG stands for Retrieval Augmented Generation and is a popular AI solution pattern
  • 📚 RAG leverages large language models to create chatbot-like experiences using your own content
  • 🤔 RAG works by retrieving the most relevant content from your documents/website to augment the language model's generation
  • 🔎 Your content is broken into chunks which are vectorized - this allows finding content relevant to the user's question
  • ⚙️ The relevant content chunks are added to the prompt sent to the language model to provide context for answering the question
  • 🔢 Vector similarities between the user question and content chunks allow retrieving the most relevant information
  • 🗣 A prompt with instructions, relevant content chunks, and the user question is sent to the language model to generate an answer
  • 💡 RAG provides a better experience than just listing relevant documents - it creates a new tailored answer
  • ✨ Most large language model projects leverage some form of RAG architecture to work with custom content
  • 📝 Prompt engineering is important for providing good instructions and content to the language model

Q & A

  • What is the key benefit of using a rag architecture?

    -A rag architecture allows you to leverage the power of large language models to answer questions and generate content, but using your own proprietary content rather than just what's available on the public internet.

  • What are some common use cases for a rag system?

    -Common use cases include customer chatbots, internal company FAQs, documentation search, and automated ticket resolution by searching past tickets and solutions.

  • How does the system know which content is relevant to the user's question?

    -The system vectorizes all the content, giving each chunk a vector of numbers representing its essence. The user's question is also vectorized. Then the vectors are compared to find the chunks most similar to the question vector.

  • What is a prompt in this context and why is it important?

    -A prompt provides instructions and context to the language model. Careful prompt engineering, especially the content before the actual question, helps the model generate better, more tailored responses.

  • What types of content can I use in a rag system?

    -You can use any text content - website pages, PDF documents, internal databases, past tickets or cases, etc. The system will index and vectorize the content for easy retrieval.

  • Does the rag system replace search engines and knowledge bases?

    -No, rag systems complement search and knowledge bases by taking content, aggregating it, and generating tailored responses. So they enhance rather than replace existing systems.

  • What are vectors in this context and how are they created?

    -Vectors are numeric representations of chunks of text, encoding their semantic meaning. They are generated by passing content through a language model encoder which outputs the vector.

  • What components make up a rag system?

    -The main components are the content indexer, vector database, retriever, language model (encoder and decoder), and prompt generator.

  • Is a separate language model required for encoding versus decoding?

    -No, the same language model can be used for both encoding content to vectors and decoding prompts to generate responses. But separate models can be used as well.

  • How much content is needed to create an effective rag system?

    -There's no fixed amount - start with whatever content you have. The more high-quality, relevant content you can provide, the better the generated responses will be.

Outlines

00:00

😃 Introducing rag architecture for leveraging large language models on custom content

The paragraph introduces the idea of retrieval augmented generation (rag) architecture, which allows using large language models to generate answers and content based on a company's own custom documents and data sources, similar to how ChatGPT leverages internet content. It explains why rag is popular for building chatbots, QA systems, and more on top of proprietary data.

05:02

😉 Explaining the components of a rag system

The paragraph provides an overview of the different components that make up a rag system, including the user question, content sources like a website or documents, instructions that frame the context for the language model, relevant content extraction using vector similarities, assembling the prompt with extracted content, sending to the language model, and getting the final generated response.

10:04

🤓 How content chunks are vectorized for retrieval in rag architecture

The paragraph focuses on how content is vectorized in chunks like paragraphs to enable quick retrieval of the most relevant pieces for a given user question. It explains how vectors represent the essence of chunks numerically, with similar chunks having similar vector representations that allows matching user questions to related content.

Mindmap

Keywords

💡language model

A language model is a key component of systems that can generate human-like text. As described in the video, large language models like ChatGPT can ingest content from across the internet and generate new text that seems to answer users' questions. Language models power the text generation capabilities of these systems.

💡prompt

A prompt is the input text that is fed into a language model to get it to generate an output text. As described in the video, prompts can include instructions for the language model, content for it to ingest and incorporate, and the actual user question to be answered. Careful engineering of prompts can improve the quality of language model outputs.

💡vector

In the context of this video, a vector is a numeric representation of a chunk of text, generated by passing it through a language model. Vectors allow passages with similar topics to have similar numeric values. This enables quickly finding content relevant to a user's question, to augment prompts.

💡retrieve

Retrieval refers to the automated searching and selection of relevant content from a knowledge base to augment a prompt. For example, finding paragraphs about parking from a website to enhance a prompt to answer user questions about parking.

💡augment

Augmentation refers to enhancing or improving a system's capabilities, in this case using retrieved content to improve a language model's ability to generate high quality, customized responses to users' questions.

💡generation

Generation refers to a language model's capacity to create new text that seems to appropriately answer questions and follow instructions. Retrieving custom content to augment prompts enhances this text generation.

💡vector database

A vector database stores numeric vector representations of text passages, enabling efficient similarity searches to find content for prompt augmentation. As described in the video, user questions can also be vectorized to find relevant vectors.

💡prompt engineering

Prompt engineering involves carefully structuring prompts to language models, including instructions, content, and questions, in order to guide the model to generate high quality outputs. This allows customizing systems for specific applications.

💡rag architecture

RAG stands for 'Retrieval Augmented Generation' and refers to the overall system design of using retrieval/search to select relevant content, in order to augment/enhance the text generation capabilities of a language model to suit custom needs.

💡content ingestion

Content ingestion refers to the language model's ability to take in or encode large volumes of text content, which allows it to then generate high quality text as if it has understood that content. Retrieving custom content facilitates ingestion.

Highlights

Explains the retrieval augmented generation (RAG) architecture which leverages large language models on custom content

RAG systems can answer questions or generate content using a company's custom documents instead of just generic internet content

The prompt engineering before the actual question prompt allows RAG systems to guide the language model's response

Relevant custom content chunks are embedded into numeric vectors, allowing matches against the user's question vector

The top content chunks matching the question vector are added to the prompt to augment the language model's generation

Retrieval refers to finding relevant custom content chunks and augmentation refers to enhancing the language model's generation

RAG is a popular solution pattern for creating chatbot-like experiences from custom content with large language models

Example use cases include customer chatbots, product documentation answers, and internal ticket resolution suggestions

The prompt engineering provides instructions to the language model like branding guidelines or tone of voice

The retrieved content chunks become the prompt before the actual user question prompt

Paragraph chunks allow breaking down large content sources into smaller pieces for matching

Similar paragraph chunks have similar numeric vector representations for easily finding relevant matches

The user's question is also vectorized to find the top matching content vector chunks

Using the most relevant custom content with the question allows better responses than just a generic language model

The whole RAG architecture greatly enhances the user experience over just providing raw content results

Transcripts

play00:00

hello everyone uh welcome to my code

play00:02

deare uh video series um what I'm doing

play00:06

is I'm rotating through three different

play00:08

types of topics educational topics uh

play00:11

use case topics and then kind of bias

play00:13

ethics safety uh topic so now on the

play00:17

education rotation and today what I

play00:19

wanted to talk about is uh what is

play00:21

retrieval augmented generation or rag uh

play00:26

and you may think that I'm going into

play00:28

some kind of nook and cranny of the AI

play00:31

uh field but this is a very important

play00:33

and popular kind of solution pattern

play00:36

that I see um being used over and over

play00:39

and over again for uh how to leverage

play00:42

large language models so I thought I

play00:44

would explain it uh to you uh and the

play00:47

the the thing that this is used for is

play00:49

basically systems that leverage large

play00:51

language models but on your own content

play00:55

so let me describe that if you think of

play00:57

like the chat GPT experience and if you

play01:00

think about that um relative to like the

play01:02

search engine experience that we had

play01:05

before if you ask a question like um I

play01:08

don't know what color is the sky or how

play01:10

do I fix this plumbing issue or

play01:12

something like that a search engine

play01:14

would go out uh or appear to go out

play01:17

search the internet find relevant

play01:19

content and then just list that content

play01:21

for you list those links and then you as

play01:24

a user would need to click on the links

play01:26

that seem seem right read it digest it

play01:29

and figure out the answer to your

play01:31

question what a large language model

play01:33

does is it seems to do that first part

play01:35

meaning leverage the content on the

play01:37

whole internet but instead of just

play01:39

listing that content it sort of digests

play01:41

it digests it combines it assembles it

play01:44

together and answers your question sort

play01:46

of generates an answer um so it's a

play01:49

whole lot better I mean search engines

play01:51

have been great but this is taking the

play01:52

whole experience to another level and in

play01:55

addition the question and answering uh

play01:57

you can also give it instructions like

play01:59

write me this document or write me a

play02:01

lesson plan to teach geometry to seventh

play02:03

graders uh and it will do something

play02:05

similar it will kind of assemble content

play02:08

that it SE that it has seen uh that

play02:10

talks about geometry or seventh graders

play02:12

or how to do lesson plans or whatever uh

play02:15

pulls that together assembles it and

play02:17

then writes out a lesson plan okay so

play02:21

it's a much better experience than just

play02:23

taking the raw content from the internet

play02:25

but it really uh creates something new

play02:28

from that now let's say you want that

play02:30

same experience but on your own content

play02:33

so it might be a chatbot on your website

play02:36

or you might have a library of PDF

play02:37

documents that this documentation for

play02:40

one of your products uh and instead of

play02:42

just linking the user to parag sections

play02:46

of the documentation you want to

play02:47

actually answer their question uh it

play02:50

might be your service ticketing uh

play02:52

system so when a new issue comes in you

play02:54

could say how would I resolve this issue

play02:56

and it can assemble past similar issues

play02:59

uh and then come up with a new uh new

play03:01

solution based on that so this is an

play03:04

incredible experience that these large

play03:07

language models offer but how can you

play03:09

create that experience on your own

play03:11

content uh that might not be available

play03:14

to the internet or available to these

play03:16

large language models well the solution

play03:18

to this is this rag um architecture this

play03:22

retrieval augmented uh generation

play03:24

architecture so now I'm going to do my

play03:25

best to explain that uh to

play03:28

you so let's say you have a um

play03:32

user and I'm going to use the example of

play03:35

a uh patient chatbot and the content

play03:39

source is going to be that content from

play03:41

your website let's say or could be

play03:43

content from PDF documents or or

play03:46

whatever but you want this to be the

play03:47

content to answer the patient's

play03:49

questions so if the patient has a

play03:50

question like how do I prepare for my

play03:52

knee surgery instead of just going to

play03:55

chat sheet PT and getting a generic

play03:56

answer you'd like to provide an answer

play03:59

that's from your health system or a

play04:02

question like do you have parking you'd

play04:04

like to provide an answer for your

play04:06

health system for your the office where

play04:07

the patient is seen okay so that's a

play04:10

scenario that I'd like to do so the

play04:12

patient has a

play04:13

question uh and I'm going to do do you

play04:16

have

play04:17

parking have

play04:23

parking um you can uh imagine that

play04:26

question being bundled up into a prompt

play04:30

what's called a prompt and I'll describe

play04:32

this more

play04:33

later so there is the question that

play04:37

prompt is sent to a large language

play04:40

model and that large language model will

play04:44

come up with a response to that question

play04:48

okay now um if you just wanted to use uh

play04:51

chat GPT let's say or some other llm uh

play04:54

without any extra content you could just

play04:57

use this flow how do I prepare for my

play04:59

knee surgery or do you have parking put

play05:02

that into a prompt send that to the uh

play05:04

large language model and get a response

play05:06

back okay but uh but what we want to do

play05:09

is enhance this experience with our own

play05:11

content so let's say here is your

play05:13

content

play05:14

source and again this might be all the

play05:17

content of your website or PDF documents

play05:21

or internal ticketing system or

play05:23

databases or that uh that sort of thing

play05:27

and what you'd like to do is something

play05:29

called called the prop before the propt

play05:32

so in these systems you don't just send

play05:34

the user question to the large language

play05:36

model you usually have some level of

play05:38

instructions So the instructions might

play05:41

be you are a contact center specialist

play05:44

working for a hospital answering patient

play05:47

questions that come in over the Internet

play05:50

uh please be uh nice to the patients and

play05:53

responsive and folksy because that fits

play05:55

with our brand or some instructions like

play05:58

that are sometimes sent with the prompt

play06:01

um and then uh

play06:03

Additionally you want to provide the

play06:06

information that the L llm needs to

play06:08

answer the question so what you'd

play06:11

ideally like is information from your

play06:14

website to be included here um and uh

play06:18

and that to be sent to the llm as well

play06:20

so the full prompt might be your

play06:23

instructions it might be something like

play06:25

please use this content um in order to

play06:28

answer the patient question at the end

play06:30

and then you put in a bunch of

play06:32

information about parking or about knee

play06:34

surgery or whatever the patient asked

play06:36

you put that in the prompt before the

play06:38

prompt then you have the question then

play06:40

you send that whole package to the llm

play06:43

and the llm will give a great response

play06:45

based on your

play06:47

content okay with me so far so um so

play06:51

this notion is the prop before the

play06:53

prompt um and and that's why prompt

play06:56

engineering and these types of things

play06:58

are a big field right now now because

play07:00

you can really hone the um these systems

play07:03

by doing a better and better job with

play07:05

the actual prompt before the prompt um

play07:08

in uh in this

play07:10

style now the last trick here is your

play07:14

website or your content is huge and it

play07:16

talks about all kinds of topics Beyond

play07:19

parking and Beyond knee surgery so you

play07:21

really want to somehow pull out only the

play07:24

parts of your content that are relevant

play07:26

to the patient's question so this is

play07:29

another um a tricky part of this whole

play07:32

rag architecture uh and the way that

play07:34

works is that um you take all your

play07:37

content and you break it into chunks or

play07:40

these systems will break it into chunks

play07:42

so chunk might be a paragraph of content

play07:44

or a p or a couple paragraphs a page

play07:46

something like that and then those um

play07:50

chunks are sent to a large language

play07:53

model could be the same one or a

play07:55

different one and they are turned into a

play07:58

vector

play08:01

and uh so each each paragraph or each

play08:04

chunk will have a

play08:06

vector which is just is just a series of

play08:09

numbers and that series of

play08:12

numbers you can think of it as the

play08:14

numeric representation of the essence of

play08:17

that

play08:18

paragraph and what's uh different about

play08:21

these numbers just they're not random

play08:23

numbers but paragraphs that talk about a

play08:25

similar topic have close by numbers they

play08:28

almost have the same vectors okay so in

play08:31

addition to the uh it's a numera Zed

play08:33

version of the paragraph but it's such

play08:37

that similar paragraphs on similar

play08:39

topics will have similar vectors will

play08:42

have similar numbers so that means that

play08:46

what happens is when um uh a user will

play08:49

ask a question like do you have parking

play08:51

let's

play08:52

say then that is also sent to the llm in

play08:55

real time right after the user asked the

play08:58

question

play08:59

that comes up with the vector as well

play09:02

you could think of that as the question

play09:04

vector and then what happens we do we do

play09:06

a mathematical comparison real quick

play09:09

between the vector of the question and

play09:11

then the vectors of your content and

play09:13

pick like the top five documents that

play09:15

are closest to this question so do you

play09:17

have parking will be a vector then you

play09:21

have all your content and it's going to

play09:23

try and find the five documents that

play09:25

taught the most about parking basically

play09:28

um and so it'll find those I don't know

play09:30

what that is it'll find those documents

play09:32

let's say uh from these it'll grab the

play09:34

paragraphs associated with those

play09:37

documents um and it'll use that

play09:41

here so those will be the subset of your

play09:45

content basically that is used as part

play09:48

of the prompt before the prompt okay so

play09:51

this whole uh concept is uh kind of

play09:54

vectorizing your content uh typically

play09:58

that then our storage in something

play09:59

called a vector database which is

play10:01

basically a representation of your

play10:03

content in this numeric form and then

play10:06

this system that you build this rag

play10:08

system will uh take the question find

play10:12

retrieve the most relevant content make

play10:15

that as part of the prompt before the

play10:17

prompt send that to the llm and then

play10:20

you'll get a good response back actually

play10:23

so it's a little bit confusing but um

play10:25

but it's actually not that confusing um

play10:28

uh I just made it more confusing by this

play10:30

horrible uh horrible drawing but this

play10:32

whole thing is um what is uh called rag

play10:37

retrieval so you're retrieving the

play10:39

relevant documents from your content

play10:42

you're augmenting the generation process

play10:45

so you're augmenting the lm's ability to

play10:48

do generative AI based on the documents

play10:51

that you retrieve so that's why it's

play10:52

retrieval augmenting

play10:55

generation okay so I hope that made

play10:58

sense uh like I said this is a very

play11:00

popular um solution pattern that I'm

play11:03

seeing over and over again in fact the

play11:05

majority of llm projects that I see are

play11:08

this kind of thing using my content

play11:11

packaging that up with an llm system to

play11:14

create a kind of chat chpt like

play11:16

experience for my employees or for my

play11:20

customers for my users that kind of

play11:22

thing and it works extremely well that's

play11:24

why uh that's why it's so popular so I

play11:27

hope that was interesting and

play11:29

educational and made sense if you have

play11:31

any questions please leave them for me

play11:33

uh as part of the comments uh thank you

play11:35

very much