Tune and deploy Gemini with Vertex AI and ground with Cloud databases
Summary
TLDRIn this Google I/O 2024 session, Benazir Fateh and Bala Narasimhan demonstrate leveraging Vertex AI for the lifecycle of Google's Gemini Pro language model. They guide through fine-tuning, deploying, and evaluating AI models for media applications, enhancing web navigation with generative AI. Bala showcases deploying a chatbot for personalized news using a jumpstart solution with GKE, Cloud SQL, and Vertex AI, emphasizing security, infrastructure provisioning, and observability for production readiness.
Takeaways
- π The session is part of Google I/O 2024 and focuses on leveraging Vertex AI for the lifecycle of Google's Gemini Pro language model.
- π€ Benazir Fateh and Bala Narasimhan present, with Benazir specializing in AI/ML services on Google Cloud and Bala being a group product manager for Cloud SQL.
- π The scenario involves a media company facing challenges with customer satisfaction on their online newspaper platform, indicating a need for AI modernization.
- π The team explores two GenAI applications: one for generating high-quality subhead summaries and another for a conversational interface to improve website navigation.
- π οΈ The process of creating a GenAI application involves evaluating models, testing prompts, and possibly using Retrieval-Augmented Generation (RAG) or AI agents for interaction.
- π Crafting the right prompt template is crucial for repeatable model output and is part of the iterative development process.
- π§ Vertex AI Studio offers a platform for developing and refining generative models with features like rapid and side-by-side evaluation.
- π¬ Evaluation is key throughout the development lifecycle to ensure models meet requirements and perform well on customized datasets.
- π Tuning the model with Vertex AI is possible to improve performance, but it requires careful evaluation to ensure improved results.
- π Vertex AI provides various metrics for evaluation, including both quantitative (e.g., BLEU, ROUGE) and qualitative (e.g., fluency, coherence, safety).
- π The session also covers deploying generative AI applications using a jumpstart solution with components like GKE, Cloud SQL, and Vertex AI for embeddings and LLMs.
Q & A
What is the main topic of the Google I/O 2024 session presented by Benazir Fateh and Bala Narasimhan?
-The session focuses on demonstrating how to leverage Vertex AI for the complete lifecycle of Google's Gemini Pro language model, including fine-tuning, deploying scalable endpoints, evaluating and comparing models, and grounding the GenAI application using Google Cloud databases.
What is the issue faced by the AI team in the media company scenario presented?
-The AI team in the media company is dealing with customer satisfaction issues related to their new online newspaper. Readers are spending less time on articles and the customer satisfaction score has dropped, indicating a need for modernizing the website with better content and navigation experience.
What are the two GenAI applications the AI team agrees to experiment with?
-The team decides to experiment with two GenAI applications: one to generate high-quality subhead summaries to help readers quickly decide if they want to read an article, and another to build an interface that improves website navigation in a more conversational way, providing insights into trending news.
What is a prompt template in the context of generative AI applications?
-A prompt template is a recipe that developers use to get the desired model output in a repeatable manner. It serves as a set of instructions or a simple question that guides the generative AI model to produce specific results.
Why is evaluation important in the development lifecycle of a generative AI application?
-Evaluation is crucial as it serves as an interactive assistant to identify if the model, prompt, and configuration are correct and producing the desired output. It also helps in making decisions such as choosing the best model for the use case and guiding the design of augmentations.
What is the role of Vertex AI in building predictive and generative applications?
-Vertex AI provides a suite of services that allow developers to build both predictive and generative applications. It offers tools like Vertex AI Studio for developing and refining generative models, and Vertex AI Tuning for improving the performance of large language models in a managed and scalable way.
What is the purpose of the xsum dataset used in the demonstration?
-The xsum dataset is used for the experiment to test and validate different models, prompts, and configurations for the task of summarizing newspaper articles. It provides a standardized dataset to evaluate the performance of the generative AI model.
What is the significance of tuning a model in the context of generative AI?
-Tuning a model is important to improve its performance on a specific task or dataset. It allows the model to better match the tone, style, and content requirements of the application, such as generating summaries that match the publication's language style.
How does Vertex AI Tuning help in the process of improving an LLM model's performance?
-Vertex AI Tuning is a fully managed service that automates the entire tuning process based on Vertex AI Pipelines. It allows developers to monitor the progress of tuning through integration with Vertex AI Tensorboard and evaluate the tuned model to ensure it meets the desired performance criteria.
What are the different types of evaluation techniques provided by Vertex AI for monitoring models in production?
-Vertex AI offers computation-based and auto side by side evaluation techniques. Computation-based evaluation assesses the performance of a model with task-specific metrics computed on reference data. Auto side by side allows for pairwise comparison of models, such as comparing a new model with one in production.
What is the jumpstart solution presented by Bala Narasimhan for deploying generative AI applications?
-The jumpstart solution is a set of technology components that simplifies the deployment of generative AI applications. It includes GKE for application deployment, Cloud SQL for Postgres as a vector database, and Vertex AI for embeddings model and LLM. The solution also covers provisioning infrastructure with best practices, building and deploying applications, interacting with a chatbot, and ensuring observability in production.
Outlines
π Introduction to Google I/O 2024 and Generative AI
The session at Google I/O 2024 is introduced by Benazir Fateh, an AI/ML CE specialist on Google Cloud, who is joined by Bala Narasimhan, a group product manager for Cloud SQL. They plan to demonstrate how Vertex AI can be used throughout the lifecycle of Google's Gemini Pro language model, including fine-tuning, deploying, evaluating, and grounding with Google Cloud databases. The session aims to provide insights into building applications using Google Cloud, starting with a scenario where an AI team in a media company seeks to enhance user experience on their online newspaper website using generative AI to address issues with customer satisfaction and churn rates.
π€ Crafting Generative AI Applications with Vertex AI
The script delves into the complexities of creating generative AI applications, emphasizing the iterative process of crafting prompt templates and evaluating model performance. It discusses the importance of selecting the right model, tuning it to fit specific datasets, and using evaluation as a guide for improvement. Vertex AI Studio is highlighted as a platform for developing and refining generative models, offering capabilities for prompt design and model evaluation. The session also touches on the challenges of productionalizing GenAI applications and the necessity for ongoing evaluation and monitoring.
π¬ Model Evaluation and Tuning with Vertex AI
This paragraph demonstrates the practical steps of using Vertex AI for model evaluation and tuning. It outlines the process of setting up a Google Cloud project, choosing a dataset, and configuring a model for experimentation. The use of the xsum dataset for training and the application of the Gemini Pro model are detailed, along with the creation of an evaluation task using Vertex AI's Python SDK. The results of the evaluation experiment are visualized, and the importance of quantitative and qualitative metrics in assessing model performance is discussed.
π Tuning LLM Models and Observing Outcomes
The script explains the process of tuning a large language model (LLM) using Vertex AI Tuning, a fully managed service that automates and scales the tuning process. It describes how to use the service's SDK to pass training and validation datasets, define epochs, and set learning rate multipliers. The paragraph also discusses monitoring the tuning job's progress through Vertex AI Pipelines and Tensorboard, and evaluating the tuned model's performance against the original to determine improvements.
π Evaluating Tuned Models and Production Monitoring
The paragraph discusses the importance of evaluating tuned models and monitoring models in production. It presents a scenario where a tuned model may not perform as expected due to the quality of the dataset used for tuning. The use of quantitative and qualitative metrics to evaluate the model's performance is highlighted, along with the visualization of these metrics to make informed decisions. The paragraph also introduces Vertex AI's computation-based and auto side-by-side evaluation techniques for large-scale assessment and comparison of models in production.
π Deploying Generative AI Applications with Jumpstart Solution
Bala Narasimhan introduces a jumpstart solution for deploying generative AI applications, focusing on the use case of a media company wanting to provide personalized news articles through a chat interface. The solution involves GKE for application deployment, Cloud SQL for Postgres as a vector database, and Vertex AI for embeddings and LLM. The steps for provisioning infrastructure, deploying applications, interacting with a chatbot, and ensuring observability are outlined, with a demo illustrating the interaction with the chatbot and the infrastructure behind it.
π οΈ Summary of Generative AI Application Deployment
The final paragraph summarizes the session's content, focusing on the jumpstart solution for deploying generative AI applications with best practices. It emphasizes the ease of deployment, the technology components involved, and the importance of observability for monitoring the application in production. The session concludes with a demonstration of the observability features provided for the vector database and the overall generative AI application.
Mindmap
Keywords
π‘Google I/O 2024
π‘AI/ML
π‘Vertex AI
π‘Gemini Pro language model
π‘Fine-tuning
π‘Cloud SQL
π‘GKE
π‘Evaluation metrics
π‘Prompt template
π‘RAG
π‘Observability
Highlights
Introduction to Google I/O 2024 session by Benazir Fateh, focusing on leveraging Vertex AI for the lifecycle of Google's Gemini Pro language model.
Bala Narasimhan's presentation on deploying scalable endpoints and evaluating models using Vertex AI.
The scenario of a media company using Google Cloud to address customer satisfaction issues with their online newspaper.
Utilizing generative AI to enhance web navigation and improve user experience on a newspaper's website.
The process of experimenting with GenAI applications to generate high-quality subhead summaries and conversational interfaces.
The importance of crafting the right prompt template for generative AI applications and the iterative process involved.
Evaluation of different LLM models, prompts, and configurations to identify the optimal setup for a task.
The challenges of creating and maintaining a GenAI application in production, emphasizing the need for ongoing evaluation and adjustments.
Vertex AI Studio as a platform for developing and refining generative models through a user-friendly environment.
Demonstration of using Python SDK for rapid evaluation of models and prompts in Vertex AI.
The use of xsum dataset for training and evaluating the Gemini Pro model for summarization tasks.
Tuning the Gemini Pro model using Vertex AI Tuning service for improved performance on specific datasets.
Comparative evaluation of tuned vs. untuned models to assess the effectiveness of tuning on performance metrics.
The role of qualitative metrics like fluency, coherence, and safety in evaluating the quality of generated summaries.
Observability and monitoring of generative AI applications in production to ensure ongoing performance and user satisfaction.
Bala Narasimhan's overview of deploying a generative AI application for personalized news article recommendations using a chat interface.
Introduction of the jumpstart solution for easy deployment of generative AI applications with best practices.
Technology components of the jumpstart solution, including GKE, Cloud SQL for Postgres, and Vertex AI.
Provisioning infrastructure securely with best practices such as SSL, IAM-based authentication, and private IP.
Building and deploying applications into GKE as part of the jumpstart solution for generative AI.
Demonstration of interacting with a chatbot for personalized news article searches using natural language.
The importance of observability for monitoring generative AI applications in production, including Query Insights and data cache metrics.
Summary of the jumpstart solution's benefits for deploying generative AI applications with best practices and observability.
Transcripts
[MUSIC PLAYING]
BENAZIR FATEH: Hello, everyone.
And welcome to this session of Google I/O 2024.
My name is Benazir Fateh.
And I'm an AI/ML CE specialized in AI/ML services
on Google Cloud.
And today with me, I have Bala Narasimhan,
who is a group product manager for Cloud SQL.
Bala and myself are going to show you
how to leverage Vertex AI for the complete lifecycle
of Google's Gemini Pro language model.
In particular, you will learn how
to fine tune Gemini AI on Vertex AI on your data sets,
deploy scalable endpoints, and evaluate and compare models.
Finally, Bala will show you how to ground
Gemini and your GenAI application
using Google Cloud databases.
By the end of this session, you will have a better understanding
of how you can use Google Cloud for building applications.
So with that, let's get started.
Sorry.
Great, so imagine that you are part
of an AI team in a media company that works with Google Cloud.
Within your media company, you have online newspapers
that represent a core part of your business.
They have enabled your company to generate revenue
from various sources, such as subscriptions, advertising,
and sponsored content.
However, there have been some recent concerns
about customer satisfaction related to one
of these new online newspapers.
Some web traffic statistics and customer surveys
have revealed that your readers are not
as pleased with their experience navigating the newspaper's
website.
In particular, over the last three months,
it results that your newspaper website
is experiencing a constant rise in the average churn rate.
Readers, on average, are spending less than one minute
on the newspaper articles.
And the average customer satisfaction score
has dropped down to 6 out of 10.
All this analysis essentially say
that you have a great opportunity
to modernize your newspaper's website by providing
more quality content and a better navigation experience.
And that's why your AI lead has asked
you to explore generative AI.
Now, how can you use generative AI for enhancing web navigation?
After brainstorming with your AI team,
you agree to experiment with two different GenAI applications.
In the first application, you would use GenAI and Gemini
Pro on Vertex AI to generate some high quality subhead
summaries.
The idea is that if you generate more catchy subhead summaries
to summarize the article, the reader will be more--
the reader can quickly decide if they want to read this article
or not.
And the second application is about building an interface
for improving readers' website navigation in a more
conversational way.
So that you can provide direct insights into the trending news
of the day to the user.
As I said, you're part of this team
that has just been assigned to design these two GenAI
applications.
So the question is, how do you build this generative AI
application?
Well, when you start-- in general,
when you start making a GenAI application today,
there are several ingredients, there
are several moving pieces in that workflow.
You start by evaluating several different LLM
models according to the task that you
are trying to implement.
You test and validate different models.
You test and validate different prompts and also
different configuration for the model.
You do this to identify particularly
a model to use and also a prompt template and a configuration
that works best for your decided task.
Finally, depending on the generated responses
and the complexity of the application,
you may also need to ground these LLM responses using RAG.
Or you may also want to interact with some third party
systems using AI agents.
Now, what I described in words is, in general,
a high level overview of how to create a generative AI
application.
But in practice, it's way more difficult.
It's very difficult to create a GenAI application,
and to put it into production, and keep
it running in production.
So let's start from the beginning.
The first challenge is about crafting
the right prompt template.
As a developer, in order to craft a prompt template,
you seek a recipe that gets you the desired model output
in a very repeatable manner.
This recipe is called the prompt template.
Now, as you rapidly iterate through
various prompt templates, you need evaluation to identify--
evaluation will serve as your interactive assistant
to identify if you are heading in the right direction,
if your model, your prompt, and your configuration
are correct and are giving you the desired output.
Another decision you may face in this development lifecycle
is choosing the best model for your use case.
So a couple of models may meet your requirement.
But to make a good decision, you may need
to compare how they perform.
Now, normally, we see that all these large language
models are released.
And they're released with some benchmarking data sets.
But then those-- that data set and those benchmarking results
show you how that particular model
is doing on that particular data set, which
is, many times, a general data set, or a standardized data set.
But you still need to evaluate to see
if this particular model, which maybe fared great
on a particular benchmarking data set,
does it do well on your customized data set also?
Because summarizing a medical transcript
is very different from summarizing a newspaper article.
So you need to test the performance on the data that
matters to you.
And then as you find this optimal model,
you might go back and tweak either the model configuration,
the prompt template, et cetera.
So let's say-- let's--
and after you do this, after you do this initial thing,
you may ultimately, you do the evaluation,
and then you ultimately realize that maybe you need more.
Maybe the summaries are good, but the style and tone
of the generated summaries are not really
what your readers are used to.
Maybe there is a specific language
that you are used to in your publication.
So you can pick between tuning methods.
You can tune your model.
And then again, you need to evaluate that tuned model
to see that the tuned model is performing better
than the untuned model.
Next, I have a couple of options.
So once you've done this, you have tuned the model,
still, you may need more.
You may need your model to do more.
There are other options, such as you
can augment the model's knowledge
via an external source.
And again, you need evaluation to understand if this results
in a better performance or not.
Evaluation is also needed to guide decisions
on the design of this augmentation
as you try out different, like in your RAG system,
you try out different chunking sizes, for example,
for your app.
And finally, once you have done all this testing
and experimentation, you need to understand if you
are ready for production.
You need to evaluate on a larger data
set that will cover all corner cases to see, how well does
the final architecture perform?
And then this work doesn't end here.
We all know that once a model is--
or that application is put into production,
there will be new things that change, maybe
the way your user is asking or interacting with the application
changes, the prompts are changing-- may change.
So once in production, you will need
to observe and monitor what is happening,
fix mistakes, run evaluation on a larger
and more automated scale, and find potential improvements.
So in order to do all this and to address all these challenges,
you need skills and technology.
And talking about technology, that's why you have Vertex AI.
Vertex AI provides a suite of services,
which would allow you to build both predictive as well
as generative applications.
Let's see how you can use GenAI services
on Vertex AI for building the first application, which
is the news subhead summaries generator.
Let's start from the beginning.
With respect to crafting the prompt template
and validating different LLMs, Vertex AI
provides Vertex AI Studio.
Vertex AI Studio offers a comprehensive platform
to develop and refine generative models.
At the core of generative AI lies the concept
of a prompt, which serves as a set of instructions
or a simple question.
Vertex AI Studio will excel in providing
a user-friendly environment to design and evaluate
these prompts across different LLMs
by comparing them with either rapid evaluation capability
or side-by-side capability.
Now, you want to evaluate and compare prompts and models
in your preferred IDE.
This is the rapid evaluation SDK that you
see for model evaluation.
As you can see here, and I'll also show in the demo--
what you see here this is the Python SDK.
We also have an API for rapid evaluation.
What you create is you start with creating an experiment,
defining the metrics, the dimensions across which
you want to evaluate your task on,
and you create an evaluation task.
And this evaluation task, you provide the data set,
the metrics, the experiment name.
And then finally, you run this evaluation task.
So let's see this in action.
Can we switch to the--
OK, great.
So what you see on the screen is my Google Cloud project.
And in the Google Cloud project, I am under Vertex AI.
And then under Notebooks, I have created a Colab Enterprise
notebook.
In this Colab Enterprise notebook,
I have pre-created and pre-run all the cells.
And so let's see all these things into action.
To begin with, I have done some initial plumbing.
There is, I've created--
I've mentioned what my project ID is, the region.
I've created a bucket to store all the datasets,
as well as any artifacts that get created in the workflow.
I create a service account and provide the required roles
and permissions for that service account.
I create some-- I import some necessary libraries.
And I also created some helper functions,
which will help me with visualization
or just displaying all the dataframe
with different results.
Now, the data set.
So for this particular experiment,
I decided to use xsum dataset.
So I take the xsum data set, I create, train validation
and test data frames.
And I have just renamed the two columns, the content and summary
as content and reference.
So reference here is the ground truth.
This is the ground truth summary that you will see in the-- that
you see in the xsum dataset.
Great, now, the very first thing,
so I decide that I will use Gemini 1 Pro 002 as a model
for this experiment.
I decide my temperature to be 0.1.
And I configure some safety settings.
And then I create a prompt template
that I will be using for this experiment.
So this is a very simple few lines prompt template
that I create.
It will take this article.
It's supposed to summarize this article in one to two sentences.
Great, and then I just do a simple test
with identifying that you will generate-- that Gemini
Pro generates responses.
And I see that I have the content.
I have the ground truth reference.
And then I see the summary that's generated by Gemini.
Great, now, so what next?
Let's now run an evaluation experiment using--
so currently, we have decided on the model, the prompt template,
and the configuration.
Let's evaluate this combination.
So I create similar to what I have shown on the screen.
I have an experiment name.
I have the model.
I create-- I decide which metrics
do I want to evaluate against?
And then I create this evaluation task.
And I run this evaluation task.
Now, when this evaluation task is finished,
you will see that-- you will see this View Experiment.
And then it opens because it is integrated with Vertex AI
experiments that will manage all your-- all the experiments
that you are running.
So the Experiments bar can open right in the notebook,
where I can see, OK, these are the different experiments.
And I can see the metrics right here.
But just to aid in this--
just so you guys can see it better,
let's take a look at this.
Now here, I'm under Experiments.
And I look at experiments.
So I can take a look at this experiment.
Again, what it shows is it shows the metrics, the values
of the different metrics.
It shows the model name and the parameter template that I have.
So great.
You ran an evaluation experiment.
You can see the--
you can see the values for different metrics.
You can also-- in this data frame, again,
I show-- also show, What are the different values?
and also some explanations.
And then finally, also, let's see--
let's see some visualization.
So here I visualize, so as you can see,
I had decided on two types of metrics.
There were some quantitative metrics, like blue and rouge.
Let me point to that here, blue and rouge, and also
some qualitative metrics which would identify,
how good were the model generated summaries
when it comes to safety, coherence, and fluency?
So as you can see, this evaluation result
gave me some results.
I visualize them.
Great.
How do I know, are these good enough or not?
So that's what we'll cover in the next phase.
But I also wanted to mention one more thing.
There are many different types of tasks.
We picked a summarization task.
But evaluation can work on summarization, Q&A, tool use,
text generation, et cetera, so a variety of tasks.
And then there's a variety of evaluation metrics
that you can use.
And to help you with picking, what
are the different types of metrics, one,
available on Vertex AI and, two, that you can use?
We also have, in our documentation, metric bundles.
So you can see that there is a lot of different types
of metric bundles that are available for you to use, pick
and choose for doing the evaluation.
Great, now let's-- can we move to the slide again, please?
Thank you.
So let's say that after doing the tuning and evaluation,
you found the right prompt template for generating
the new summaries.
But still the tone and content of
the generated subhead summaries doesn't really
match with what you needed.
Now, so in this case, maybe you want
to consider that you can tune the model.
So you may want to consider providing some reference
subhead summaries as part of your tuning data
set and then tune Gemini Pro in such a way
that the tuned model would be better
able to reproduce the tone--
a tone that you are-- that you want.
Now, tuning an LLM model, again, is very challenging
in terms of there are different resources that you
need to tune the model.
And also, again, this is an entire ML workflow,
tuning a model.
So how can you-- how can you do that?
Now, Vertex AI provides a service.
It's a fully managed service, which
is called Vertex AI Tuning.
This service allows you to improve
the performance of your LLM in a managed and scalable way.
This service is fully based on Vertex AI Pipelines.
So that you can automate the entire tuning process.
And also, it will allow you to monitor it because you know,
there is an--
Vertex AI Pipelines have an inbuilt integration
with Vertex AI Tensorboard, so all the training
and evaluation metrics are logged into Tensorboard.
And you can monitor the progress of tuning there.
So to tune Gemini 1 Pro using--
now, you can also use Vertex AI UI as well as the SDK.
And what you see on the screen is
an example of what the tuning SDK looks like.
And we have tried to make it as simple for you as possible.
So the tuning process, this is what that SDK looks like.
It requires you to, of course, pass the training data set
and also optimally a validation data set.
Also, the next thing you will need to define
is the number of epochs and learning rate multiplier.
Finally, you launch this tuning job that you see here.
And you will monitor its status.
Let's take a look at it again in the notebook.
Thank you.
So how to tune the Gemini Pro 1.002 model.
So again, here what you'll see is
I've created some training, test, and validation datasets.
I then run this tuning job.
This is the same thing that you saw on screen in the deck, which
is I take the source model, I attach the training dataset
and the validation dataset.
I mentioned epochs and learning rate multiplier.
That's it.
And then you can kick off the job.
This is an asynchronous job.
It will continue running.
But then let's-- so I have already run this tuning job.
Let's take a look at what this tuning job--
so it will be under--
so under language, where you will see all the large language
models, you will see this supervised tuning
job listed there.
So first things, if you go under the Dataset tab,
you will see that it will-- because before the tuning
starts, it validates your datasets.
So it will show, OK, the training data set had 2,000
examples and roughly these many characters in total.
And then it gives you a small sample
of what that dataset looks like.
And also it shows some distribution.
So for example, it shows the distribution
of the input tokens per example and output tokens per example.
So this sort of validation that it does,
the tuning job does right in the beginning, it helps,
because like I said, Vertex AI tuning
is a fully managed service.
So before the tuning job starts, it
needs to identify, what are the resources it needs
to provision behind the scenes?
So it identifies, what is the size of the data set
that is going to tune with?
And of course, you have mentioned the model
that is used for tuning so it can easily
provision the resources needed behind the scenes.
And then once the tuning is in progress
and the tuning succeeds, you will
see different types of metrics.
So you will see training metrics as well as validation metrics.
Great.
So we tuned.
We ended up tuning the Gemini Pro 1 model on the xsum.
So I took 2,000 examples from the xsum dataset with
the content and ground truth.
And I used it for tuning.
Now, how do I know if this tuned model is good enough?
Now we know, we all know that be it a machine learning
model or a generative AI model, the model
is going to be as good as the data set that you provide to it.
So if the dataset is not good, then the--
even when you tune the model, it won't give you good results.
So in this case, I know that I've used Gemini Pro 002
as a base model.
And I've used xsum for my training--
as a training data set.
So let's evaluate this tuned Gemini model.
Again, I create this--
a test data set.
I generate some summaries using the tuned Gemini.
So I see Gemini summaries.
I see the tuned Gemini summaries.
And then I run an evaluation experiment.
So again, here what you see is I have
created another experiment for comparison of these two Gemini
models.
I have Gemini Pro, I have tuned Gemini Pro,
and I use the same set of metrics.
Again, there are-- a few of these metrics
are quantitative metrics like blue and rouge.
And the way quantitative metrics work is they
will look at the ground truth data that you provided.
And they identify how closely this model,
the model in question, Gemini Pro or tuned,
how well are those generated summaries comparing as opposed--
as compared to the ground truth data.
That's what the quantitative metrics are doing.
And then there are the qualitative metrics,
which is fluency, coherence, and safety.
Now, in order to evaluate for qualitative metrics,
we have, behind the scenes, an autorater model.
So the autorater model is a Google proprietary model
that identifies, irrespective of ground truth,
how good is this generated summary?
So again, I run this evaluation task.
I can see this experiment, like we saw before.
I can also see everything in a data frame
format, along with explanations for qualitative metrics.
I can also see them as--
here, I also show them how to--
I mean, again, I look at a few samples
here to see what the score for fluency, et cetera, is.
But let's take a look--
to get a more clearer picture, let's take
a look at-- let's visualize these evaluation results.
Let's first take a look at the quantitative metrics.
So what you see here is the Gemini Pro model.
The untuned one is in blue.
And the tuned model is in red.
So what you see is for--
when it comes to quantitative metrics,
the tuned model does better.
So, you trained this model.
You tuned this particular model by giving it a xsum data set.
So it does perform better.
But I have-- I do know that xsum sometimes--
I mean, xsum is widely used, but xsum is a low quality dataset,
especially for summaries generator.
And that point really shows up here,
because when I look at the qualitative metrics--
so when you see the qualitative metrics, what you'll see
is for safety, both the models perform equally.
They both are-- they both provide safe summaries.
But when it comes to the other two metrics, which
is coherence and fluency, I see that, actually,
the untuned version gives me a better result, but not
the tuned version.
So the tuned version performs slightly less optimal--
less better as compared to the untuned version.
Why?
So remember, these two are qualitative metrics.
They don't need any ground truth.
They are basically checking to see for this generated summary,
are they coherent and are they fluent?
The tuned version, because I tuned it
on xsum, which is a low quality dataset, it ended up--
for these two metrics, it ended up making the model even worse.
So, this is your aha moment, where you say,
it's not every time, it's not just that you tune a model
and it's going to work better, no, that's not the case.
You need to tune it.
You need to evaluate it to see if it really performed better.
Great, and then also I have some quantification
of how good or bad did each metric--
so we see that the blue and rouge metric improved.
But fluency and coherency actually
decreased by this many percentage.
Great.
Can we move on to the next?
Thank you.
So for now, let's assume that maybe you didn't end up
going with tuning.
You just ended up going with the regular Gemini Pro model.
But you made a decision on which model,
and which prompt template and configuration you
are going to use in production.
Great, and you integrated them into your newspaper's web page.
But, now this model is in production.
How can you keep monitoring the capabilities of this model?
Because we all know that once something is in production,
again, there are tons of things can go wrong.
And you may start seeing different types of newspaper
articles that get--
so maybe the prompt that you first decided
is no longer good enough for these new types of news articles
that you're seeing.
Or many other factors can happen in production.
So basically, we understand that once you
put your model into production, you still need to evaluate it.
So Vertex AI provides computation-based as well as
auto side by side.
These are, again, two different types of evaluation techniques
provided by Vertex AI.
What we saw earlier was rapid evaluation.
So here, I want to make one quick distinction.
Rapid evaluation is something, you take a few samples,
you run that SDK, you get your evaluation.
It's something that you do while you're doing experimentation.
It's really quick.
You call an SDK.
And you're done.
There are-- we understand that you may need
to do a large-scale evaluation.
So computation-based and auto side by side metrics,
they allow you to assess the performance of a model
with task-specific metrics that are computed on the reference
data.
And here, you can provide a much larger number of samples.
And the other cool thing is, both for computation
based and auto side by side, we have created a pipeline.
Like there is a pipeline SDK that is provided to you.
So this thing can-- you no longer
have to do it in your notebook.
You can run this pipeline.
And this is-- you can automate it on a much larger scale.
So once your models are in production,
you can schedule these pipelines, like on demand
or on a periodic recurring basis.
So you keep getting your evaluation results.
And the other thing is auto side by side,
what I showed you earlier was a pointwise evaluation.
You define the metrics, and for each metric,
you get some result. Auto side by side
is a pairwise evaluation.
So you provide-- you have a model in production,
but new models keep coming.
So you want to compare how the new, the latest model that's
released by Google, how is that comparing with the one
that you have in production?
So you can do an auto side by side comparison.
And to run this auto side by side evaluation,
this is what, again, the SDK looks like.
And here, as you'll see, this is a pipeline job.
So all you need to do is you need to configure your training
dataset, your pipeline name, et cetera,
and then you run this pipeline.
So Google Cloud provides an evaluation pipeline template,
and all you need to do is provide parameters.
Let's quickly take a look at this one in action.
Thank you.
So evaluation at scale for auto side by side.
So again, in this one, I created--
now, here for auto side by side, again, I'm comparing two models.
So I'm comparing the tuned Gemini and the untuned Gemini.
The same thing that we did earlier, but I'm doing--
I'm also showing how it can be done with auto side by side.
So I create the--
I compile the pipeline.
I configure the different parameters
that I need of the task summarization,
some prompt parameters.
And this is my pipeline job.
So we run this--
this pipeline.
And once this pipeline finishes, we'll
will see some auto side by side judgments.
So what you see here, so again, what you--
it will create an entire dataframe.
So you can create-- you can see what the response like
for each of this content.
What was the response from the tuned model as well as response
from the untuned model?
Great.
Can you switch back to the slide, please?
Thank you.
So at this point, we have our newspaper's subhead summaries
generator in production.
Now, what about the second GenAI application
that I earlier talked about?
Basically for the second one, we want
to build an interface for improving the reader's website
navigation in a conversational way
by providing direct insights on the trending news of the day.
And to talk about that, I would like
to invite Bala to talk more about that use case.
[APPLAUSE]
BALA NARASIMHAN: Thank you, Benazir.
As Benazir mentioned, what I'm going to show
you now is how simple media can deploy a generative AI
application.
So that their users can find personalized news articles
using a chat interface.
The good news is we have put together a jumpstart solution.
So that we can deploy generative AI applications easily.
And I'm going to use this juststart solution today.
So let's look at this jumpstart solution right now.
So these are the technology components
that come together to make this jumpstart solution.
The link that you're seeing here on the right
is where you can go to get access
to this jumpstart solution as well, in case
you're interested in easily deploying a generative AI
application.
So there are three technology components
that come together to make this jumpstart solution possible.
The first is GKE, where the applications will be deployed.
The second is Cloud SQL for Postgres, which
will act as a vector database.
And the third is Vertex AI.
Vertex AI will be used for as an embeddings model as well
as an LLM.
So let's look at the various steps
that you need to execute when you use the jumpstart solution.
There's essentially four steps you
need to execute in order to deploy a generative AI
application using this jumpstart solution.
The first is provisioning the infrastructure
with best practices in place.
The second is building and deploying the applications
into GKE.
And once you've done that, there's
a chatbot that's available as part of that jumpstart solution.
You're then able to interact with that chat bot.
And then the fourth is basically observability,
because once you deploy your generative AI
application in production, observability
becomes really important.
So that you can monitor how it's behaving.
What I'm going to do now is take you
through each of those four steps.
And for step three and four, I have a demo for you as well.
So what do I mean when I say that you
can deploy the infrastructure using best practices?
Here's an example of that.
When you use the jumpstart solution,
we want to make security easy for you.
So as you can see here, when you deploy your infrastructure using
the jumpstart solution, SSL is enabled
by default between the client and the server.
Similarly, IAM based authentication
is enabled by default. So that you
don't need to rely on database username and password.
And thirdly, private IP is set up by default between GKE
and your Cloud SQL for Postgres database
using private service connect, which
makes it very easy to set up networking and makes it secure.
So as far as building and deploying applications in GKE
goes, there's three applications that come
with the jumpstart solution.
The first is an application that converts
your Cloud SQL for Postgres instance into a vector database.
The second is the chat bot that is going
to run in the GKE environment.
And you're able to interact with that in natural language.
And the third is an application that interacts with Vertex AI,
generates the embeddings, which you can then
store into the vector database to run your semantic searches.
So now that we've deployed the infrastructure
and then we've deployed the applications,
we can look at how we can interact
with that chat bot using natural language for personalized
searches.
And then we can look at observability
of your GenAI application when it's running in production.
If we can jump to the demo, please.
All right, so this is the new GenAI website
that Cymbal Media has put together.
As you can see, there's a chat bot interface here.
I'm thinking of going on vacation to Germany.
And I'm looking for news articles
about travel reservations.
So I'm going to ask the chat bot to give me that.
The chat bot will return news articles on that topic,
taking into account my preferences.
As you can see here, that's exactly what has happened.
The chat bot has taken my question
that I asked in natural language, identified
a news article that makes sense based on my preferences.
So now, let's look at the infrastructure that's
powering this application.
This is the GCP Console for Cloud SQL for Postgres, which
is acting as a vector database.
It's running Postgres 15.
You'll see that the Postgres vector database actually has
something called a data cache.
A data cache is a server side SSD that acts as a read cache.
And it's able to accelerate your semantic search
queries because of improved read throughput and read latency.
Transitioning to observability, we
have something called Query Insights.
Query insights gives you an application-centric view
for observability.
Here, it's showing you the top queries that are running.
It's giving you the details of that query.
This is the semantic search query
that we executed as part of that natural language
question that was asked.
It's going to give you information
about the average execution time for that semantic search.
On the right-hand side, you're seeing a visual query plan
for the same query, which gives you a visual way
to see the different steps that were
executed when that particular vector search was run.
You're also seeing that it's showing you
who the top users are of the vector database.
So you can understand, in terms of load
on the database, who are the different users that
are executing the queries?
And here is the-- again, coming back to the visual query plan,
you're seeing that there was a sequential scan as part
of looking up that vector index when that semantic search
was executed.
We also provide you a rich set of metrics
so that you can monitor what's going on
in your vector database.
I talked about the data cache.
Again, the data cache is a server side SSD
that accelerates your semantic queries.
Here, there is a couple of metrics associated
with that data cache.
The first is the size of the data cache itself,
which in our demo over here is 375 gigabytes.
There's a metric that will show you that.
There's also a metric that will show you
that at any point in time, how much of that
data cache is actually consumed.
That way, you can monitor to make sure
that your data cache is sized appropriately
for the kinds of queries that you are running
against that vector database.
So you'll see that right now.
All right, to summarize then, what I've been talking about
is a jumpstart solution that we put together
that allows you to easily deploy generative AI applications using
best practices.
We provided you a link where you can go and access
this same jumpstart solution.
And you can deploy generative AI applications easily using that.
We talked about the different technology components that
form that jumpstart solution.
And we showed you a demo of the observability
that comes along with it.
So that you can run this generative AI
application in production.
If you can transition back to the slides, please.
All right, to summarize what Benazir and I have been talking
about today, we started off with the problem statement,
which is that Cymbal Media was facing three problems.
Firstly, their users were churning on their website.
Secondly, engagement on any specific news article
was going down, which means there
was an erosion of trust between Cymbal Media and its users.
As part of that, we came up with an AI-driven solution,
which was to revamp the website using generative AI
and make it a personalized experience for their users.
And in order to enable that, we use three technologies.
We used Vertex AI.
We used Cloud SQL.
And we used GKE.
Thank you.
[MUSIC PLAYING]
5.0 / 5 (0 votes)