Retrieval Augmented Generation for Navigating Large Enterprise Documents

Google Cloud Events
28 Feb 202442:01

Summary

TLDRThe Google Cloud Community session featured the Generali Italia team discussing their experience with developing a RAAG-based application for navigating complex enterprise documents. The team highlighted the challenges of information retrieval in a heavily regulated industry with extensive documentation. They detailed their approach using large language models, the process of embedding and retrieval, and the importance of in-context learning. The session included a live demonstration and Q&A, emphasizing the team's innovative use of AI to enhance document accessibility and information retrieval within their organization.

Takeaways

  • πŸ“ˆ The Generali Italia team developed a RAAG (Retrieval-Augmented Generation) based application for navigating complex enterprise documents.
  • πŸš€ The project aimed to leverage AI advancements to simplify the information retrieval process within a large volume of technical and regulatory documentation.
  • πŸ“š The team faced challenges with over 400 documents totaling more than 5,000 pages, which would take over 100 hours to read.
  • πŸ” Information retrieval was identified as a key field to assist with the challenges, involving searching for documents, information within documents, or the documents themselves.
  • πŸ’‘ The team utilized large language models (LLMs) and generative AI to surpass the state of the art in understanding language and generating meaningful conversations.
  • 🧠 In-context learning was employed to reduce hallucinations in AI responses by providing the model with relevant context from the documents.
  • πŸ“Š The team conducted experiments with default parameters and later introduced custom strategies for document chunking and hyperparameter tuning.
  • πŸ“ˆ They created a synthetic dataset for evaluation purposes due to the lack of an existing validation set, extracting questions and answers using a large language model.
  • πŸ”§ The experimentation involved tools like Vertex AI Platform, various LLMs, and a vector database for storing embeddings.
  • πŸ“ The architecture included an ingestion phase and an inference phase, with the latter involving user interaction and frontend services.
  • πŸ”„ The team plans to experiment with new foundation models and Vertex AI's vector search, as well as work on LLMs for better handling of new documents.

Q & A

  • What was the main challenge Generali Italia faced with their documentation?

    -The main challenge was the continuous growth of textual data and knowledge, which made it difficult to extract information efficiently from a large volume of documents, leading to significant time consumption.

  • How did Generali Italia leverage AI to simplify the information retrieval process?

    -They defined a perimeter of relevant business documents and used large language models within a retrieval-augmented (RAG) based solution to develop a document Q&A application.

  • What is information retrieval and how does it assist in addressing the challenges faced by Generali Italia?

    -Information retrieval is the science of searching for information in documents or for documents themselves. It assists by providing methodologies to efficiently locate and extract the needed information from vast document collections.

  • What is the role of the embedding model in the RAG architecture?

    -The embedding model, which is a large language model itself, takes text as input and returns a list of numbers (vector). It helps in creating context embeddings that are used to find similar information chunks for answering user queries.

  • How did Generali Italia handle the lack of a validation dataset for their RAG system?

    -They created a synthetic dataset by extracting paragraphs from each document and using a large language model to generate questions and answers, which were then used for validation and performance evaluation.

  • What are the key metrics used to evaluate the performance of the RAG-based application?

    -Key metrics include Mean Reciprocal Rank (MRR), Mean Average Precision (MAP) at a given cut-off of K, Recall, and Rouge and BLEU scores for comparing the quality of generated responses against real answers.

  • What was the significance of the research paper 'Lost in the Middle' in the context of Generali Italia's RAG system?

    -The paper provided insights into how large language models use the information from the context provided. This led Generali Italia to introduce a re-ranking layer to optimize the organization of information presented to the LLM for better performance.

  • How did Generali Italia ensure the scalability and reliability of their RAG-based application?

    -They utilized the Vertex AI platform for experimentation and model training, which ensured scalability and reproducibility. Additionally, they used Google's infrastructure for the reliability of their product.

  • What was the outcome of the experiments with custom chunking strategies and hyperparameter tuning?

    -The experiments resulted in improved performance, with the best chunk size identified as 1,000 characters and a recall of 80% at 15 documents, along with a question-answer accuracy of 73%.

  • How did Generali Italia address the need to explain acronyms and insurance definitions to users?

    -They added custom chunks to their collection that explained acronyms and insurance definitions, which improved the chatbot's ability to answer questions related to these topics, despite a slight decrease in overall metrics.

  • What are the next steps for Generali Italia's RAG-based application?

    -The next steps include testing new foundation models like Gemini Pro, using Vertex AI's side-by-side pipeline to compare different models, and exploring Vertex Vector search for a more efficient vector database solution.

Outlines

00:00

πŸ“’ Introduction and Overview

The video begins with Ivan, a Google Cloud customer engineer, welcoming the audience to a session on Google Cloud Community. He introduces the Italia Italy team, including Ian, a tech lead data scientist, and Domino, a tech lead machine learning engineer. They will share their experience in developing a rag-based application for navigating complex enterprise documents. The agenda includes presenting the scenario of the application's development, discussing the choice of large language models, detailing the deployment process, and concluding with a live demonstration and Q&A session.

05:03

πŸ“ˆ The Evolution of Information Retrieval

The presentation continues with a historical overview of information retrieval, from its theoretical foundations in the 1950s to the integration of machine learning in the 2000s. It highlights the current era's focus on large language models (LLMs) and generative AI. The team explains the concept of in-context learning, which reduces hallucinations in AI responses by providing contextual information. They describe the process of embedding documents into a vector database for efficient information retrieval and the importance of metrics for evaluating the system's performance.

10:05

🧠 Deep Dive into the Rag-Based Architecture

The team delves into the specifics of the rag-based architecture, discussing the process of creating a context section for the large language model. They explain how documents are split into paragraphs and then into chunks, which are embedded and stored in a vector database. The user's query is processed in the same way, and the model uses the embeddings to search the database for relevant information. The retrieved documents are then used to construct a prompt for the large language model to answer the user's question accurately.

15:07

πŸ” Experimentation and Metrics

The team shares their initial experiments with the rag-based system, using default parameters and a synthetic dataset generated by a large language model. They discuss the challenges of lacking evaluation metrics and the strategies they employed to create a validation set. The introduction of metrics such as mean reciprocal rank, mean average precision, and recall at a given cut-off, as well as Rouge and BERT score, allowed them to evaluate and improve the system's performance. The team also experimented with custom chunking strategies and tuning hyperparameters to enhance the system's recall and question-answering accuracy.

20:09

πŸ”§ Enhancing the Information Retrieval Process

The team focuses on enhancing the information retrieval process by moving from a simple embedding search to a combined method that includes both embedding techniques and classical search methods like BM25. They discuss the impact of this change on system performance, including a significant boost in recall and QA accuracy. The team also explores research findings on how large language models use context information and introduces a re-ranking layer to organize the provided information more effectively. They discuss the integration of new features, such as increasing the document collection and testing different models like Palm and Gemini Pro.

25:13

🎯 Final Architecture and Takeaways

The team presents the final architecture of their product, which includes an ingestion phase with cloud storage, vertex pipeline for data processing, and an inference phase where the user interacts with the frontend service. They discuss the importance of the Google Cloud infrastructure for scalability and reliability. The team shares their key takeaways from the project, such as increasing document accessibility, experimenting with cutting-edge AI technologies, and the potential for future improvements. They also outline next steps, including testing new foundation models and exploring the Vertex AI vector search.

30:15

πŸ’¬ Q&A and Live Interaction

The session concludes with a Q&A segment, where the team addresses questions from the audience. Topics covered include building a consistent framework for a rag-based system without customer Q&A data, handling document updates, chunking strategies, storing chunks with metadata, and handling follow-up questions that lack context. The team provides insights into their approach to these challenges and shares best practices for similar scenarios.

Mindmap

The video is abnormal, and we are working hard to fix it.
Please replace the link and try again.

Keywords

πŸ’‘Google Cloud Community

The Google Cloud Community is a platform where professionals and enthusiasts come together to share experiences, learn, and discuss various topics related to Google Cloud services. In the context of the video, it is the setting where the presentation about developing and deploying a RAAG (Retrieval-Augmented Generation) based application is being held.

πŸ’‘RAAG (Retrieval-Augmented Generation)

RAAG is an AI architecture that combines retrieval mechanisms with generative models to improve the performance of language models. It retrieves relevant information from a database and uses it to generate responses. In the video, the RAAG model is used to navigate complex enterprise documents for an insurance company, Generali Italia.

πŸ’‘Information Retrieval

Information retrieval is the process of finding, organizing, and presenting information relevant to a user's needs from a collection of data. In the video, it is a critical component in addressing the challenges of extracting information from a vast array of enterprise documents.

πŸ’‘Large Language Models (LLMs)

Large Language Models (LLMs) are machine learning models trained on vast amounts of data, capable of understanding language, generating new content, and maintaining meaningful conversations. They are a key component in the RAAG architecture, used to generate responses based on retrieved information.

πŸ’‘In-Context Learning

In-context learning is a technique where a large language model is prompted to use all the information from a given context to answer a question. It helps reduce hallucinations by ensuring the model's response is grounded in the provided context, which is crucial for accurate information retrieval.

πŸ’‘Vector Database

A vector database is a type of database that stores and retrieves vector representations (embeddings) of data points, such as text paragraphs. It is used in the RAAG architecture to store embeddings of document chunks, enabling efficient searching for similar information.

πŸ’‘Embedding Model

An embedding model is a type of machine learning model that converts text into numerical vectors, which can be understood and compared by other algorithms. In the video, it is used to transform document paragraphs into embeddings that are stored in the vector database for information retrieval.

πŸ’‘Synthetic Data Set

A synthetic data set is a collection of data that is artificially generated, often used when real data is scarce or unavailable. In the video, a synthetic data set is created to serve as a validation set for evaluating the RAAG application's performance.

πŸ’‘QA (Question Answering)

Question Answering (QA) is the process of generating answers to questions based on a given context or knowledge base. In the video, QA is a primary function of the RAAG application, which is designed to answer questions using information retrieved from enterprise documents.

πŸ’‘Evaluation Metrics

Evaluation metrics are quantitative measures used to assess the performance of a model or system. In the context of the video, they are crucial for measuring the accuracy, recall, and relevance of the RAAG application's responses.

Highlights

The session focuses on the experience of developing and deploying a RAAG (Retrieval-Augmented Generation) based application for navigating complex enterprise documents.

The Google Cloud Community session introduces the Italia Italy team from Generali Italia who share their insights on leveraging AI for technological innovation in the insurance industry.

The insurance industry's significant reliance on documentation presents challenges in information retrieval, with over 400 documents and 5,000 pages requiring over 100 hours to read.

The challenges faced include the continuous growth of textual data, the time it takes to extract information from text, and information access from multiple databases.

Information retrieval is defined and its evolution from the 1950s to the adoption of large language models (LLMs) in the 2020s is discussed.

The presentation covers the RAAG architecture which combines generative AI with information retrieval to surpass the state of the art in language understanding and generation.

In-context learning is introduced as a method to reduce hallucinations in large language models by providing contextual information for accurate responses.

The process of creating a context section by embedding documents and retrieving similar information through a vector database is explained.

The importance of evaluation metrics for determining the direction and success of the RAAG application development is emphasized.

A synthetic dataset was created due to the unavailability of an existing validation dataset, using a large language model to generate questions and answers from document paragraphs.

Metrics such as mean reciprocal rank, mean average precision, and recall at a given cut-off, as well as Rouge and BERTScore, were used to evaluate the performance of the RAAG application.

The experimentation phase included custom splitting strategies for documents and tuning hyperparameters like model temperature and chunk lengths for in-context learning.

The team experimented with increasing the number of documents in the collection set and migrating from one model to another, such as from Palm to Gemini Pro, for improved performance.

The architecture of the RAAG application includes an ingestion phase for data processing and an inference phase for user interaction and prompt engineering.

The session concludes with a live demonstration of the application's user interface, showcasing its ability to answer questions using relevant documents.

The project increased the accessibility of documents within the company, allowed experimentation with cutting-edge AI technologies, and relied on Google's infrastructure for scalability and reliability.

Future steps include trying new foundation models, utilizing Vertex AI's pipeline for model comparison, and working on LLMs for RAAG applications to handle new documents effectively.

Transcripts

play00:03

hello everyone and welcome to this new

play00:05

session of the Google Cloud Community

play00:08

today we have the pleasure to have the

play00:10

generally Italia Italy team uh who will

play00:14

share their experience in developing and

play00:17

deploying a rag based application to

play00:20

navigate complex uh Enterprise

play00:24

documents before to before to start let

play00:27

me to introduce uh myself uh I am Ivan

play00:31

nardini I'm a customer engineer at

play00:34

Google cloud and I supported generally

play00:36

in implementing the this generative AI

play00:39

application and together with me today

play00:41

we have Ian and Domino Ian Domino would

play00:45

you like to introduce yourself oh for

play00:48

sure thank you and welcome everybody my

play00:51

name is Ian a tech lead data scientist

play00:54

in general Italia and I support the

play00:56

development of artificial intelligence

play00:58

and machine learning Solutions

play01:01

hello I'm domano and Tech lead machine

play01:03

learning for General

play01:06

Italian okay so let's uh take a look at

play01:10

the agenda

play01:12

then so first of all we will Begin by

play01:16

presenting the scenario in which the

play01:19

document Q&A application was developed

play01:22

and next we will explore why generally

play01:24

choose to address these challenges using

play01:27

large language models within a rag based

play01:30

solution following that we will Deep

play01:32

dive uh into the process that enabled

play01:36

the generally team to successfully

play01:38

deploy the rug uh based llm application

play01:41

into production and in particular they

play01:43

will share with you some details about

play01:45

the experiments they conducted in terms

play01:47

of chuning Le lexical search and ranking

play01:51

strategies that uh ultimately lead to

play01:54

the deployment of the application and

play01:56

finally we will conclude the session

play01:58

with a live demonstration some takeaways

play02:01

and as always the Q&A session so with

play02:04

that Ian Domino the stage is

play02:09

yours okay so thank you even again uh

play02:13

let's start with the general business

play02:16

case uh generally is investing in

play02:20

technological innovation but as an

play02:22

insurance company documentation is

play02:24

always an important and significant

play02:26

component for our business uh the

play02:30

industry is Guided by technical

play02:32

regulation all accompanied by

play02:34

documentation and we have documents such

play02:37

as policy statement with terms and

play02:40

conditions premium

play02:42

statement um risk assessment reports or

play02:45

internal company knowledge uh with

play02:47

documents such as uh machine learning

play02:50

model documentation corporate

play02:52

regulations information related to uh

play02:55

legal entities and so on and uh going

play02:58

through all these documents in in order

play03:00

to find the right information is very

play03:02

important uh recognizing this problem we

play03:05

saw an opportunity um and the question

play03:08

we asked ourself was how we can Leverage

play03:12

The advancement in AI to simplify the

play03:15

information retrieval

play03:18

process so we defined first of all a

play03:21

perimeter of relevant business documents

play03:24

to focus on in order to understand the

play03:26

complexity of this challenge Uh current

play03:29

we have more than 400 documents at our

play03:32

disposal uh toiling more than 5,000

play03:36

pages and this means that it will take

play03:38

around more than 100 hours to read them

play03:42

all uh in front of these numbers we

play03:46

stand before three significant

play03:48

challenges uh firstly there is a

play03:51

continuous growth of available textual

play03:54

data and knowledge uh so this will um um

play03:59

expand our resources and this growth

play04:02

introduce another additional challenge

play04:05

the T the time it takes to extract

play04:08

information from from text the third

play04:11

challenge is related to information

play04:14

access uh from multiple different uh

play04:18

data sources located in different

play04:21

databases and the field of information

play04:23

retrieval uh can assist us with all

play04:26

these

play04:28

challenges first of all

play04:30

uh what is information retrieval uh so

play04:33

uh information retrieval is the science

play04:35

of searching for document searching for

play04:38

information in a document or searching

play04:40

for document themselves so lens embark

play04:42

on a journey through the timeline of

play04:44

information retrieval just few concept

play04:47

we me in the 1950s a decade Market by

play04:50

theoretical Foundation of uh this field

play04:54

uh with Concept like index indexing and

play04:57

search algorithm and fast forward to the

play04:59

1980s we saw the Advent of vector space

play05:02

model entering in um in the 19 um in in

play05:10

in two in the 2000 we see the

play05:12

integration of machine learning and

play05:14

search engines into uh this field and

play05:19

now in 2020s particularly in 2023 we are

play05:23

in a period with the adoption of uh

play05:26

llms so uh over all there are numerous

play05:30

information retrieval

play05:32

methodologies which have been um

play05:35

categorized here with three main

play05:38

categories uh there are the C sear

play05:41

mechanism that are based on precise

play05:44

identification of words or vectors then

play05:47

we have probabilistic approaches and

play05:50

those based on machine learning um and

play05:53

finally there are uh there are more

play05:55

advanced methods that could be an emble

play05:58

of previous ones or other architecture

play06:01

but here we will talk about rag

play06:03

architecture which is based on

play06:06

generative Ai and here we are with the

play06:09

generative AI part of this presentation

play06:12

uh so generative AI have recently shown

play06:15

to surpass the state of Art in terms of

play06:17

performance of um understanding language

play06:21

generating new contents and maintaining

play06:23

a meaningful

play06:25

conversation uh a large language model

play06:27

is a machine learning model that's is is

play06:29

trained on a vast amount of data and

play06:32

they found a very pleasant definition on

play06:34

internet uh to describe them they are

play06:37

large autoc comption systems so in this

play06:40

slide you can see a a sentence uh and

play06:44

this is an input of an nlm that from now

play06:47

on we will call a prompt so slow and

play06:51

steady wins from a grammatical

play06:53

standpoint there are many combination

play06:56

that could be used to finish this

play06:58

sentence but is a essentially only one

play07:00

way to conclude it and it's the

play07:05

race okay so one important concept for

play07:09

building rug is the in context learning

play07:11

let's see it in action for example we

play07:14

can try to ask to our large language

play07:16

model what is the color of your T-shirt

play07:19

and if we see the response you will see

play07:22

that uh the the t-shirt is red but this

play07:26

is wrong this is what we usually call

play07:30

hallucination this because the large

play07:32

language models don't know the answer to

play07:35

this question but try anyway to uh find

play07:39

an answer for that question so one way

play07:42

to uh reduce this type of hallucinations

play07:46

is the in context learning the in

play07:48

context learning is a prompt like this

play07:50

where we say to the large language model

play07:53

that uh you need to answer to that

play07:56

question using all the information that

play07:58

comes from uh the context um windows so

play08:03

in this example we are seeing that in

play08:05

the context section uh what even is

play08:08

wearing and in this case we sub if we

play08:12

submit this prompt to the large language

play08:14

model we will see that it will answer

play08:17

correctly so the idea here is to insert

play08:21

in this context section all the

play08:23

information that comes from our

play08:25

documents and in this way the our large

play08:28

language model we will able to answer to

play08:31

the um to our uh question using

play08:35

information from our documents so the

play08:38

first step we need to

play08:39

do is to uh create the context SE

play08:43

section so we start from our document

play08:46

database we split each document in

play08:49

different par paragraph or chunks and we

play08:52

pass these chunks to an embedding model

play08:56

the embedding model is a large langage

play08:58

model itself but but takes as input uh

play09:01

text and returns a list of number a

play09:04

vector so for each paragraph at this

play09:06

point we will have the corrective

play09:09

embedding here the idea is that the

play09:11

paragraph with same information or with

play09:14

similar information we will have also

play09:17

similar embedding and this similarity

play09:19

between embeddings can be calculated

play09:22

mathematically using some distance

play09:25

Matrix like the coin similarity and uh

play09:29

at this point we can take all these

play09:31

embeddings and store it in a vector

play09:33

database that are some database built do

play09:36

for retrieving and storing uh these

play09:40

vectors and at this point we can go to

play09:44

the retrieving generate steps so we have

play09:47

an user that use a query this query is

play09:50

processed using the same text Em text

play09:52

embedding model so we La as output uh

play09:56

the embedding of the user question and

play09:58

we can use this uh user embedding to St

play10:02

to search through the vector database

play10:04

all the information all the uh chunks

play10:08

that have similar information the idea

play10:11

is that these uh context that we are

play10:14

retrieving uh contains the information

play10:17

to answer that question so at this point

play10:20

we can take all this information the

play10:22

user question and the documents

play10:24

retrieved and put it in a prompt like

play10:27

the prompt we have seen in the in

play10:29

context learning and we can submit the

play10:31

prompt to the large language model that

play10:33

at this point we will answer to the user

play10:36

question here we can see a summary of

play10:39

this process so we have a question

play10:41

generated by a user we uh with a chatbot

play10:46

and a user interface we retrieve all the

play10:49

knowledge that we need to answer that

play10:51

question this information are uh used by

play10:55

the large language model to answer to

play10:58

the user uh using the internal reg of

play11:01

our

play11:05

documents so uh we can say that is not

play11:08

too complex to create an architecture of

play11:11

this kind we have conducted some initial

play11:14

experiments and these activities were

play11:17

conducted with default parameters

play11:19

leaving the splitting Methods at theault

play11:22

and even the Chun CLS uh Chun CLS at

play11:25

default and even the information

play11:27

retrieval process unchanged but we

play11:29

didn't venture to making any kind of

play11:32

sophistication within it and we obtained

play11:34

approximately 45,000 chance for the

play11:37

database in order to represent the

play11:40

entire document collection but we came

play11:43

to realize something very important we

play11:46

lacked of um evaluation metrics uh we

play11:50

didn't have a sort of compass that would

play11:52

allow us in order to uh to determine if

play11:55

we are moving in the right direction or

play11:57

in the wrong one uh and essentially

play12:00

there are two possible strategies the

play12:02

first one is having access to an

play12:04

existing data set uh that can serve as a

play12:07

validation Set uh and the second

play12:10

strategy is to create a new data set the

play12:13

first option was not feasible uh that

play12:16

because a a validation database was not

play12:19

available so we decided to create a new

play12:22

synthetic data set um we generated a

play12:26

synthetic data set using a large

play12:28

language model and first of all we

play12:31

extracted paragraphs from each document

play12:34

and then we have sent them uh into the

play12:36

large language model asking it directly

play12:39

to identify three questions and three

play12:42

Associated answers and by doing that we

play12:45

have obtained pairs of question and

play12:48

answers

play12:49

pairs so uh in this slide you can see

play12:53

the situation an LA uh the first part on

play12:57

the left is what we have just just

play12:59

discussed we have a huge amount of

play13:01

questions uh and we injected all these

play13:04

questions in the in the rag architecture

play13:07

presented by Dominico and as a result we

play13:10

have obtained pairs of real and

play13:12

predicted answer on which we can perform

play13:16

comparisons at this point we need

play13:18

metrics in order to evaluate the quality

play13:21

of our work and the introduction of

play13:25

metrics is is crucial for evaluating the

play13:28

performance of the model comparing

play13:30

predicted answers against Real answers

play13:33

in term of accuracy recall relevance and

play13:36

the the overall model ability in doing

play13:40

this task and metrics help us even to

play13:44

understand the variation in performance

play13:46

when we introduce a new fature uh into

play13:49

the system on the left you can see the

play13:51

metrix for the information retrial

play13:54

process uh on the right you can see the

play13:56

metrics related for um evaluating the

play14:00

the quality of responses generated we

play14:03

have the mean reciprocal rank which is

play14:06

the metric that is determined which

play14:08

determines uh if the chunk is placed at

play14:11

the top in the middle or at the bottom

play14:14

um of all the documents returned then we

play14:17

have metrics like the mean average

play14:19

precision and recall at a given cut off

play14:22

of K uh and that refers of the accuracy

play14:27

of the chunks over all the retriever

play14:29

chunks or over all the total number of

play14:32

current chunks in the data set I know

play14:34

that it's a little bit

play14:37

complicated um on the other side we have

play14:40

metrics such as the Rouge and this is a

play14:43

metric that is related to the machine

play14:45

translation field so moving from one

play14:48

language to another and this Bas this

play14:51

metric is based on the overlap of terms

play14:55

uh we have also the bir score that is

play14:57

another metric that we use in order to

play15:00

understand a comparison of sentence

play15:03

embeddings between predicted and real

play15:06

answer and finally last but not least um

play15:10

something new that is the QA evaluation

play15:14

and that that is a metric based on llm

play15:17

so essentially we inject the question

play15:21

the real answer and the predicted answer

play15:23

into the large language model and the

play15:25

large language model's task is to

play15:27

determine if we are um if if the if

play15:31

their answer is is correct or not uh so

play15:37

we decided to ah sorry sorry this part

play15:40

is

play15:42

for so before ding into the experiment

play15:46

let's check out the tool that we used so

play15:50

as experimentation layer we use the

play15:53

vertex platform verx AI platform inside

play15:55

the Google cloud and it's a useful tool

play15:59

because it allows us to scale the number

play16:01

of experiment and also the resources

play16:04

that we are using for that experiment it

play16:06

ensure the uh experiment reproducibility

play16:10

and finally it save for us automatically

play16:12

all the artifact that we we are

play16:15

generating through our uh experiments

play16:19

for the large language model layer we

play16:21

use the all the jamy and palm models and

play16:25

also the embedding multilingual the

play16:28

embedding model multilingual that works

play16:30

well with the Italian language that are

play16:32

the original language of our

play16:35

documents as chain layer we used lung

play16:38

chain this because it has some function

play16:41

for reading and processing uh PDFs and

play16:45

it has also function for the information

play16:47

retrieval step and finally as storage

play16:50

layer we use quadrum that is a vector

play16:53

database where we store our EMB bendings

play16:56

and we use this because it's as a very

play16:59

fast algorithm for searching through

play17:01

vectors and also an EAS installation on

play17:06

kubernetes so now that we have the

play17:09

metrix the test set and the tools that

play17:12

we can use we can start really with the

play17:14

experiments and so with our second

play17:17

experiment we implemented a

play17:21

um custom splitting strategies with our

play17:24

documents uh in particular we starting

play17:27

split by

play17:28

paragraph and then we split the

play17:31

paragraph in sub chunks using a a

play17:35

specific length uh on the other side we

play17:39

also tuned all all the other hper

play17:42

parameters that we had like the

play17:43

temperature of the model The Prompt the

play17:47

length of the various chunks and for

play17:50

example also the number of chunks that

play17:53

we insert in our prompt for the in

play17:55

context learning so after this this all

play17:58

this tuning we had that the best chunk

play18:01

size it's 1,000 Char for our chunks and

play18:05

we generated with this approach uh

play18:08

13,000 chunks for for the metric

play18:11

standoff point we had a recall at 15

play18:13

documents of 80% and a question answer

play18:17

um of uh 7

play18:22

73% on the right we can see a plot where

play18:25

we can see how the recall increase when

play18:28

we increase the number of chunks that we

play18:30

insert in our prompt so we can see that

play18:33

after the 10 10 chunks that we insert we

play18:37

had a plateau in our recall

play18:41

Cur okay for the next experiment we

play18:44

added some custom chunks in our

play18:47

collection this because a lot of our

play18:50

documents have acronyms and insurance

play18:55

defs and a lot of time we have question

play18:58

from the user where they need to where

play19:01

they ask the meaning of these acronyms

play19:03

of or some insurance definition for

play19:07

example what is the definition of c so

play19:10

we want that the chatbot is able to

play19:13

answer to this question and for doing

play19:15

this we added manually some chunks where

play19:18

we explain what that acronyms means or

play19:21

what that insurance definitions uh means

play19:25

and in this manner we generated another

play19:28

4,000

play19:30

chunks and for the metric standoff point

play19:33

with this experiment we had a recall of

play19:36

78% and a question answer of

play19:39

72% uh as you can see these metrics are

play19:42

a little bit

play19:44

lower if we compare it with the previous

play19:47

experiment but we choose to take this as

play19:50

best experiment just because we want

play19:52

that our chatboard can be able to

play19:54

explain acrs to the user or some

play19:59

Insurance

play20:01

definitions so uh we have seen after

play20:06

this experimentation and introducing the

play20:09

fact the the aspects related to

play20:11

definitions and so on that there is a

play20:13

direct correlation between the

play20:15

information retrieved by uh the vector

play20:18

database and then then the information

play20:20

used by the large language model uh but

play20:23

simply if the information from the

play20:26

vector database is incorrect thear wedge

play20:28

model cannot generate an accurate HW so

play20:32

we understood that we need to focus um

play20:36

our efforts and time on the information

play20:39

retrieval process um enhancing the

play20:42

quality of the chunks we identify and we

play20:45

move we moved from a simple let's say

play20:48

DSE search through embedding method to a

play20:51

neing method that combines embedding

play20:54

techniques with the classical uh bm25

play20:58

and these search U methods are defined

play21:01

as a mix mixture of search and then sear

play21:05

and then search and we left the other

play21:08

parameters unchanged and as you can see

play21:11

at the bottom of this slide we still

play21:13

achieved a significant performance uh

play21:16

boost in terms of recall and even in in

play21:20

the QA

play21:22

accuracy but moving forward we also

play21:26

ventured a little bit in to the research

play21:29

field and last year uh there was an

play21:31

interesting paper uh titled lost in the

play21:35

middle uh I I think that it was

play21:37

published in in November uh and it's a

play21:40

paper where a group of researchers aimed

play21:42

to understand how a large language

play21:44

models use the information from the

play21:47

context provided and this research team

play21:50

um have found that there is a um a a

play21:55

correlation um of where the the chunk

play21:59

the cor chunk is used by the large Lang

play22:01

wi model compared to the overall number

play22:04

of chunks in the prompt here in this

play22:07

graph we um we have the what socalled

play22:11

u-shaped Cur which essentially represent

play22:14

the reduction of performances in in

play22:16

relation to the um placement of corrent

play22:19

information across uh the r return

play22:22

document set and this consideration lead

play22:25

us to think about how the information is

play22:29

provided and is organized to llm so

play22:32

that's why we introduced the new layer

play22:35

or re ranking layer that is capable of

play22:38

sorting uh the information that is the

play22:41

most accurate information that might be

play22:43

used by the large language model to

play22:46

either at the top or at bottom of all

play22:49

the documents into the

play22:52

prompt so the last phase of our

play22:55

experimentation was the integration of

play22:58

two new features first of all we

play23:00

increased we increased the number of

play23:02

documents uh into the collection set and

play23:05

the second was um to moving from Palm

play23:08

one to Palm two and currently we are

play23:11

testing uh even Gemini Pro in terms of

play23:15

performances and as you can see at the

play23:17

bottom of these slides there is a recall

play23:20

reduction uh in term of performances and

play23:23

we must acknowledge uh some change in in

play23:26

terms of metrics so this is motivated by

play23:30

the introduction of new documents

play23:32

without fine-tuning the input

play23:35

preprocessing uh pipeline it's very

play23:37

important to adjust the uh the input

play23:40

pipeline um to the to the documents that

play23:44

you are going to use so we have also a

play23:47

supplementary um documents that contains

play23:50

definition within them and which can be

play23:53

used during the QA

play23:55

evaluation um as chunks

play23:58

for the questions and this could be a

play24:01

motivation for the recall reduction on

play24:04

the other sides as you can see uh from

play24:08

by migrating from Palm one to Palm Tre

play24:11

we obtained an improvement in terms of

play24:14

QA

play24:16

accuracy uh oh demo part okay so having

play24:20

discussed this

play24:22

architecture um experiments as we have

play24:25

just seen and many other things

play24:28

um let's see briefly uh just this

play24:31

application how it works uh give me one

play24:34

second to to upload the the the

play24:44

videos Okay so this um this is the the

play24:48

the user interface as you can see there

play24:52

the the user can insert a question uh

play24:55

into the the the tab and uh in this case

play25:00

we are asking to the system um explain

play25:02

the generally strategy in terms of

play25:04

environmental

play25:06

sustainability and as you can see there

play25:09

uh the answer is composed of two parts

play25:12

the first part is related to uh the real

play25:15

answer uh for the user query and the

play25:18

second part is related to um the

play25:21

document that is used by that is used

play25:25

for for providing the the the answer

play25:28

so that's is because we strongly believe

play25:31

that the user should have the

play25:33

opportunity to understand which document

play25:36

uh was used in order to extract the the

play25:40

information even in this case this is

play25:43

another question related to AI ethics

play25:46

from generali and even this case the the

play25:50

system generates uh the the answer to

play25:53

the question and then the the

play25:56

sources

play26:01

okay so this is our fin architecture

play26:04

that we build for our product and this

play26:06

is divided we have seen in two phases

play26:09

the ingestion pH phase where we start

play26:12

from a cloud storage where we store all

play26:14

our documents and then we have the

play26:16

vertex pipeline vertex pipeline are a

play26:19

tool inside vertex AI platform it

play26:23

ensure the training of models and and

play26:27

can be used also for data processing so

play26:30

we use that for doing all the data

play26:33

ingestion chunk the chunking part the

play26:36

embedding

play26:37

part and the creation of bm25 index at

play26:42

that point we have another two pipeline

play26:45

uh at step three and four that do is the

play26:47

information retrieval evaluation for

play26:50

that calculat the recall at 15 documents

play26:53

that we have seen and the part four

play26:57

where we we calculate the semantic

play27:00

evaluation and we calculate the question

play27:02

answer accuracy at this point all the

play27:06

eer parameters that we have find like

play27:08

the prompts the

play27:10

temperature are stored in an artifact

play27:13

registry on the other point we have the

play27:16

inference phase where we are user that

play27:18

interacts with a frontend service we

play27:21

have a back service that um manage all

play27:24

the prompt engineering phase and this

play27:27

can service read the information and all

play27:30

the parameter that uh you need to use

play27:33

like the temperature and the propt from

play27:35

the artifact registry we had also a no

play27:39

SQL database on top of fir store where

play27:41

we store the uh conversation between the

play27:44

user and the chatbot and this can be

play27:46

used we use it to store data man that we

play27:51

can fine tune uh in the next years uh

play27:56

the large language model

play28:00

okay so for what concerns the take homes

play28:05

with uh this project we increased the

play28:08

accessibility of our documents for all

play28:10

the company uh we had the opportunity to

play28:14

experiment with all the Google

play28:16

Foundation models so with Cutting Edge

play28:18

AI Technologies and we can rely on the

play28:22

Google infr infrastructure for the

play28:25

scalability of our experiments and also

play28:28

the reliability of our product and

play28:31

finally we are sharing a lot of

play28:33

knowledge with hian nardini and all the

play28:35

other Google Cloud

play28:38

Engineers on the other side for the next

play28:41

step uh the idea is to try the New

play28:44

Foundation model publisher like ger Pro

play28:47

1.5 try also the vertex AI AO side by

play28:52

side the pipeline these are some

play28:54

pipelines available in vertex where we

play28:56

can compare two different model and uh

play28:59

say how it answer to the same question

play29:03

so we can choose the better model and

play29:07

finally trying the vertex Vector search

play29:10

that is a vector database inside the

play29:13

Google inside the vertex this because

play29:16

the current one the vector database that

play29:18

we are using quadrant is an open source

play29:22

um tool and uh so we have also verx

play29:27

search that is implemented by default in

play29:30

vertex and finally we can also work on

play29:34

the LL lops for rug application for

play29:37

seeing for example uh What uh what to do

play29:41

when new documents are added to

play29:44

our um

play29:46

database I want to thank you also all

play29:49

the team that work on this incredible

play29:51

project and thank you you two for being

play29:54

here with

play29:56

us

play30:03

cool so thank you Ian thank you Domino

play30:06

for this great overview so as we

play30:08

promised at the beginning now it's time

play30:10

for having Q&A so we will start with

play30:12

some questions that we collect uh and

play30:15

then we will also go through the live

play30:16

questions that we just received so let's

play30:20

start with the first question which is

play30:21

something that you just touched so the

play30:24

first question is about how to build a

play30:26

consistent framework for a r based

play30:29

system when do when do you when you

play30:32

don't you do not have a possibility to

play30:35

collect hundreds of Q&A from your

play30:40

customer okay um so we have partially

play30:45

answered to this question previously um

play30:48

in our case was uh was this this

play30:53

scenario because we haven't access to

play30:56

our

play30:57

to an internal knowledge of Q&A so we

play31:01

decide to create a synthetic data set uh

play31:04

by providing uh chunks of paragraphs

play31:08

into nlm and then

play31:11

creating uh pairs of question and

play31:13

answers in in the machine learning field

play31:16

there are many possibilities in order to

play31:18

do that uh using uh llm is just one of

play31:23

them uh we suggest to have um a

play31:28

framework um with pipelines in order to

play31:31

be to to create this process in

play31:34

iterative way uh take into account that

play31:38

a synthetic data

play31:39

Generation Um is not the same as um as

play31:45

as the one that you can obtain from your

play31:47

from the the business unit or your

play31:50

customers uh so you need always to check

play31:54

the the quality of your questions and

play31:56

answers in order to have something that

play31:59

is uh quite of a good

play32:03

quality

play32:05

um nothing else I

play32:08

think okay well that's a great that's a

play32:10

great answer I think you you just take a

play32:13

it I so let's move on yeah let's move on

play32:17

the next question so what do you do if

play32:20

one of the documents uh get updated do

play32:24

you build or rebuild actually the index

play32:26

again

play32:28

okay I will say that it depends because

play32:32

if you add new documents you can just go

play32:35

through the text embedding model and

play32:38

update your index and your chat chat we

play32:42

will able to answer uh to the new

play32:46

question but maybe if you add too much

play32:49

documents uh maybe there is there are

play32:52

some more H parameters that are better

play32:55

maybe there is a better prompt a better

play32:57

length of chunks so uh my opinion is

play33:00

that if we had a lot of new documents

play33:03

maybe it's uh better to run the

play33:07

preprocessing pipeline so

play33:10

to to search if there are some better

play33:15

parameters okay that makes

play33:19

sense uh as you can see the question

play33:22

they're getting shorter and shorter

play33:23

which is good I think so uh what was

play33:26

your

play33:27

approach to

play33:29

chunking can you can you maybe live a

play33:32

little bit more yeah okay so I as I said

play33:37

previously we the first thing that we

play33:39

did is split by paragraph this because

play33:42

we want that uh a single Chun will be

play33:46

semantically different from another one

play33:49

in in this manner you don't have a chunk

play33:52

that has two different paragraph that

play33:54

can lead to some that can

play33:57

talk about different things so the first

play34:00

thing that we did is split by paragraph

play34:02

and then Subs spit um using the L chain

play34:07

tool uh the iterative

play34:12

splitter okay I hope uh it was a like

play34:19

provides the the answer that they were

play34:21

looking at so the I think this is the

play34:24

last one um let's see but

play34:27

what do you where do you store chunks of

play34:30

tax in a in a gap uh cloud storage or B

play34:36

query so as I said we use vertex

play34:40

pipeline for our experiments and vertex

play34:44

pipeline used by default save all the

play34:46

artifacts on cloud storage so I will say

play34:49

cloud

play34:50

storage

play34:52

okay you're passing the

play34:55

exam uh so the the the last no this is

play34:58

the last one this is the last question

play35:01

so they're asking for best practices for

play35:03

chunking large uh table data such as a

play35:06

complex spread spreadsheet with many

play35:09

sheets so now I don't know if you had

play35:11

this kind of data but uh maybe you face

play35:14

this uh this scenario with the other use

play35:17

case that you are working on so feel

play35:19

free to provide uh some best practices

play35:22

here okay so um maybe I can join this

play35:26

question with another that I have seen

play35:28

in in the chat uh currently we are using

play35:33

for doing that we are using an external

play35:36

uh Library which is unstructured uh in

play35:38

order to extract information from uh our

play35:42

documents um you can use um some CNN

play35:47

models in order to extract even the text

play35:50

from images uh but um you you can use um

play35:57

libraries like um Open PI Exel pandas in

play36:01

these cases or um the or even

play36:06

unstructured um it's important to take

play36:08

into account that uh it's um it's even

play36:11

important the way in which you feed the

play36:14

llm uh using this kind of data so if you

play36:18

are managing a spreadsheet you need to

play36:21

create a structure in the prompt that is

play36:25

um that is useful in order to be

play36:28

understood by the LM maybe you can

play36:30

integrate some sentences uh separators

play36:34

and change the things in order to uh

play36:37

create something that is more uh

play36:40

semantic semantically valid for the llm

play36:44

that's in my

play36:46

opinion no it makes it makes totally

play36:48

sense I don't know if Dominico you want

play36:50

to add something here or we can move on

play36:53

one of the Live question that we just

play36:56

received

play36:59

okay so one of the question that uh I

play37:02

think it's a uh it's valuable

play37:06

um they're asking about how large was

play37:10

your validation set so how many uh Q&A

play37:14

pairs you

play37:17

have okay uh so we we have around uh

play37:22

2,000 questions uh that we have used in

play37:26

order to to

play37:27

generate um this synthetic data set that

play37:31

we try to split into validation and test

play37:34

Set uh in order to be consistent in for

play37:37

the definition of hyper parameters it's

play37:40

uh we have seen that

play37:43

uh the the the

play37:46

parameters uh and the kind of rag

play37:49

architecture that we are going to use

play37:51

the the chunk lengths and so on it

play37:53

depends on the questions that uh your uh

play37:58

your rag is going to receive so it's

play38:01

very important to ask your your business

play38:04

units or your customers even what kind

play38:06

of question do you do you expect to

play38:09

generate or do do you expect to

play38:11

introduce even for um creating a

play38:17

um something that is uh okay from the

play38:21

point of view of the

play38:25

prompt okay so now let's uh let's uh uh

play38:30

talk about a couple of question let me

play38:32

ask you a couple of question uh related

play38:34

to chunks so one question is how do how

play38:38

do you store the chunks with metadata

play38:41

other than the source file

play38:47

name

play38:50

Domin okay yeah we store all this

play38:53

information uh in the vector database

play38:56

because quadrant have this feature that

play38:58

you can Store and search through um um

play39:03

metadata so is very useful to have it in

play39:06

the vector database because I've seen

play39:08

another question um that it's asking how

play39:13

do you uh use the right document of the

play39:16

right people and if you have metadata in

play39:20

your vector database you can just do the

play39:23

semantic search and then filter also on

play39:26

the on this metadata so if you you can

play39:29

add a lot of metadata on your documents

play39:32

that can be useful for your uh

play39:36

rug and let me ask let me ask the last

play39:39

question uh so how you handle scenarios

play39:44

where a user ask a followup question

play39:47

that lacks

play39:54

context I'm reading the question

play39:57

uh where okay okay I can give you the

play39:59

example by the way or you can read it if

play40:02

you

play40:06

prefer uh so uh in this case uh you

play40:09

could use multiple uh strategies you can

play40:13

even um to create a summary of the

play40:16

previous conversation and then using

play40:18

them in order to feed the The Prompt for

play40:20

generating the the the new answer or you

play40:24

can uh embed the the previous

play40:26

the entire previous conversation into uh

play40:29

The Prompt but this is based even on the

play40:33

um of

play40:34

the the length input that the large

play40:38

language model could

play40:40

accept uh currently we are still working

play40:44

on it so this feature is not integrated

play40:47

for

play40:49

us okay Dom do you

play40:54

have is exactly uh that and if you want

play40:58

in long chain these two strategies are

play41:00

already implemented so it's very easy to

play41:06

implement okay so yeah I think I think

play41:11

this is it

play41:12

so um let me just before to conclude let

play41:18

me just uh uh go back on the dck on the

play41:21

on the

play41:22

slide

play41:25

and and and yeah just one last uh

play41:28

reminder uh don't forget that we are

play41:31

going to have several events uh in March

play41:33

here you have the some of them so uh

play41:37

feel free to uh join them you have the

play41:39

link to like participate uh but for now

play41:43

uh I hope uh you enjoy the session and

play41:47

uh thank you for uh

play41:53

participating thank you thank you

play41:57

bye-bye bye-bye bye

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AI InnovationDocument NavigationInsurance TechEnterprise SolutionGoogle CloudLarge Language ModelsInformation RetrievalData ScienceMachine LearningChatbot Development