Tune and deploy Gemini with Vertex AI and ground with Cloud databases

Google Cloud Tech
16 May 202438:32

Summary

TLDRIn this Google I/O 2024 session, Benazir Fateh and Bala Narasimhan demonstrate leveraging Vertex AI for the lifecycle of Google's Gemini Pro language model. They guide through fine-tuning, deploying, and evaluating AI models for media applications, enhancing web navigation with generative AI. Bala showcases deploying a chatbot for personalized news using a jumpstart solution with GKE, Cloud SQL, and Vertex AI, emphasizing security, infrastructure provisioning, and observability for production readiness.

Takeaways

  • 🌐 The session is part of Google I/O 2024 and focuses on leveraging Vertex AI for the lifecycle of Google's Gemini Pro language model.
  • πŸ€– Benazir Fateh and Bala Narasimhan present, with Benazir specializing in AI/ML services on Google Cloud and Bala being a group product manager for Cloud SQL.
  • πŸ“ˆ The scenario involves a media company facing challenges with customer satisfaction on their online newspaper platform, indicating a need for AI modernization.
  • πŸ” The team explores two GenAI applications: one for generating high-quality subhead summaries and another for a conversational interface to improve website navigation.
  • πŸ› οΈ The process of creating a GenAI application involves evaluating models, testing prompts, and possibly using Retrieval-Augmented Generation (RAG) or AI agents for interaction.
  • πŸ“ Crafting the right prompt template is crucial for repeatable model output and is part of the iterative development process.
  • πŸ”§ Vertex AI Studio offers a platform for developing and refining generative models with features like rapid and side-by-side evaluation.
  • πŸ”¬ Evaluation is key throughout the development lifecycle to ensure models meet requirements and perform well on customized datasets.
  • πŸ”„ Tuning the model with Vertex AI is possible to improve performance, but it requires careful evaluation to ensure improved results.
  • πŸ“Š Vertex AI provides various metrics for evaluation, including both quantitative (e.g., BLEU, ROUGE) and qualitative (e.g., fluency, coherence, safety).
  • πŸš€ The session also covers deploying generative AI applications using a jumpstart solution with components like GKE, Cloud SQL, and Vertex AI for embeddings and LLMs.

Q & A

  • What is the main topic of the Google I/O 2024 session presented by Benazir Fateh and Bala Narasimhan?

    -The session focuses on demonstrating how to leverage Vertex AI for the complete lifecycle of Google's Gemini Pro language model, including fine-tuning, deploying scalable endpoints, evaluating and comparing models, and grounding the GenAI application using Google Cloud databases.

  • What is the issue faced by the AI team in the media company scenario presented?

    -The AI team in the media company is dealing with customer satisfaction issues related to their new online newspaper. Readers are spending less time on articles and the customer satisfaction score has dropped, indicating a need for modernizing the website with better content and navigation experience.

  • What are the two GenAI applications the AI team agrees to experiment with?

    -The team decides to experiment with two GenAI applications: one to generate high-quality subhead summaries to help readers quickly decide if they want to read an article, and another to build an interface that improves website navigation in a more conversational way, providing insights into trending news.

  • What is a prompt template in the context of generative AI applications?

    -A prompt template is a recipe that developers use to get the desired model output in a repeatable manner. It serves as a set of instructions or a simple question that guides the generative AI model to produce specific results.

  • Why is evaluation important in the development lifecycle of a generative AI application?

    -Evaluation is crucial as it serves as an interactive assistant to identify if the model, prompt, and configuration are correct and producing the desired output. It also helps in making decisions such as choosing the best model for the use case and guiding the design of augmentations.

  • What is the role of Vertex AI in building predictive and generative applications?

    -Vertex AI provides a suite of services that allow developers to build both predictive and generative applications. It offers tools like Vertex AI Studio for developing and refining generative models, and Vertex AI Tuning for improving the performance of large language models in a managed and scalable way.

  • What is the purpose of the xsum dataset used in the demonstration?

    -The xsum dataset is used for the experiment to test and validate different models, prompts, and configurations for the task of summarizing newspaper articles. It provides a standardized dataset to evaluate the performance of the generative AI model.

  • What is the significance of tuning a model in the context of generative AI?

    -Tuning a model is important to improve its performance on a specific task or dataset. It allows the model to better match the tone, style, and content requirements of the application, such as generating summaries that match the publication's language style.

  • How does Vertex AI Tuning help in the process of improving an LLM model's performance?

    -Vertex AI Tuning is a fully managed service that automates the entire tuning process based on Vertex AI Pipelines. It allows developers to monitor the progress of tuning through integration with Vertex AI Tensorboard and evaluate the tuned model to ensure it meets the desired performance criteria.

  • What are the different types of evaluation techniques provided by Vertex AI for monitoring models in production?

    -Vertex AI offers computation-based and auto side by side evaluation techniques. Computation-based evaluation assesses the performance of a model with task-specific metrics computed on reference data. Auto side by side allows for pairwise comparison of models, such as comparing a new model with one in production.

  • What is the jumpstart solution presented by Bala Narasimhan for deploying generative AI applications?

    -The jumpstart solution is a set of technology components that simplifies the deployment of generative AI applications. It includes GKE for application deployment, Cloud SQL for Postgres as a vector database, and Vertex AI for embeddings model and LLM. The solution also covers provisioning infrastructure with best practices, building and deploying applications, interacting with a chatbot, and ensuring observability in production.

Outlines

00:00

🌐 Introduction to Google I/O 2024 and Generative AI

The session at Google I/O 2024 is introduced by Benazir Fateh, an AI/ML CE specialist on Google Cloud, who is joined by Bala Narasimhan, a group product manager for Cloud SQL. They plan to demonstrate how Vertex AI can be used throughout the lifecycle of Google's Gemini Pro language model, including fine-tuning, deploying, evaluating, and grounding with Google Cloud databases. The session aims to provide insights into building applications using Google Cloud, starting with a scenario where an AI team in a media company seeks to enhance user experience on their online newspaper website using generative AI to address issues with customer satisfaction and churn rates.

05:02

πŸ€– Crafting Generative AI Applications with Vertex AI

The script delves into the complexities of creating generative AI applications, emphasizing the iterative process of crafting prompt templates and evaluating model performance. It discusses the importance of selecting the right model, tuning it to fit specific datasets, and using evaluation as a guide for improvement. Vertex AI Studio is highlighted as a platform for developing and refining generative models, offering capabilities for prompt design and model evaluation. The session also touches on the challenges of productionalizing GenAI applications and the necessity for ongoing evaluation and monitoring.

10:04

πŸ”¬ Model Evaluation and Tuning with Vertex AI

This paragraph demonstrates the practical steps of using Vertex AI for model evaluation and tuning. It outlines the process of setting up a Google Cloud project, choosing a dataset, and configuring a model for experimentation. The use of the xsum dataset for training and the application of the Gemini Pro model are detailed, along with the creation of an evaluation task using Vertex AI's Python SDK. The results of the evaluation experiment are visualized, and the importance of quantitative and qualitative metrics in assessing model performance is discussed.

15:06

πŸ”„ Tuning LLM Models and Observing Outcomes

The script explains the process of tuning a large language model (LLM) using Vertex AI Tuning, a fully managed service that automates and scales the tuning process. It describes how to use the service's SDK to pass training and validation datasets, define epochs, and set learning rate multipliers. The paragraph also discusses monitoring the tuning job's progress through Vertex AI Pipelines and Tensorboard, and evaluating the tuned model's performance against the original to determine improvements.

20:06

πŸ“Š Evaluating Tuned Models and Production Monitoring

The paragraph discusses the importance of evaluating tuned models and monitoring models in production. It presents a scenario where a tuned model may not perform as expected due to the quality of the dataset used for tuning. The use of quantitative and qualitative metrics to evaluate the model's performance is highlighted, along with the visualization of these metrics to make informed decisions. The paragraph also introduces Vertex AI's computation-based and auto side-by-side evaluation techniques for large-scale assessment and comparison of models in production.

25:10

πŸ“ Deploying Generative AI Applications with Jumpstart Solution

Bala Narasimhan introduces a jumpstart solution for deploying generative AI applications, focusing on the use case of a media company wanting to provide personalized news articles through a chat interface. The solution involves GKE for application deployment, Cloud SQL for Postgres as a vector database, and Vertex AI for embeddings and LLM. The steps for provisioning infrastructure, deploying applications, interacting with a chatbot, and ensuring observability are outlined, with a demo illustrating the interaction with the chatbot and the infrastructure behind it.

30:12

πŸ› οΈ Summary of Generative AI Application Deployment

The final paragraph summarizes the session's content, focusing on the jumpstart solution for deploying generative AI applications with best practices. It emphasizes the ease of deployment, the technology components involved, and the importance of observability for monitoring the application in production. The session concludes with a demonstration of the observability features provided for the vector database and the overall generative AI application.

Mindmap

Keywords

πŸ’‘Google I/O 2024

Google I/O 2024 refers to the annual developer conference hosted by Google, which showcases the latest in technology and software development. In the video's context, it is the event where the session on leveraging Vertex AI for the lifecycle of Google's Gemini Pro language model is being presented, indicating the significance of the conference in the tech industry.

πŸ’‘AI/ML

AI/ML stands for Artificial Intelligence and Machine Learning, which are fields of computer science that emphasize the creation of intelligent agents capable of learning from and making decisions based on data. In the video, AI/ML is central to the discussion about fine-tuning the Gemini Pro language model on Vertex AI, highlighting the importance of these technologies in advancing applications.

πŸ’‘Vertex AI

Vertex AI is a suite of services provided by Google Cloud that enables developers to build, train, and deploy machine learning models. The video focuses on how Vertex AI can be used for the complete lifecycle management of Google's Gemini Pro language model, including fine-tuning, deployment, and evaluation.

πŸ’‘Gemini Pro language model

The Gemini Pro language model is a large-scale AI developed by Google, designed for natural language processing tasks. The video script discusses leveraging Vertex AI to fine-tune this model on specific datasets, indicating its role in enhancing applications with advanced language understanding capabilities.

πŸ’‘Fine-tuning

Fine-tuning is a technique in machine learning where a pre-trained model is further trained on a specific dataset to adapt to a particular task. In the script, fine-tuning Gemini AI on Vertex AI is a key step to tailor the model's performance for generating high-quality subhead summaries and improving website navigation.

πŸ’‘Cloud SQL

Cloud SQL is a fully managed database service for MySQL, PostgreSQL, and SQL Server that runs on Google Cloud Platform. The video mentions Cloud SQL as part of the infrastructure for deploying generative AI applications, emphasizing its role in providing a robust backend for database operations.

πŸ’‘GKE

GKE stands for Google Kubernetes Engine, a service that enables you to run and manage containerized applications with Google's infrastructure. In the video, GKE is one of the technology components used in the jumpstart solution for deploying generative AI applications, highlighting its importance in container orchestration.

πŸ’‘Evaluation metrics

Evaluation metrics are quantitative and qualitative measures used to assess the performance of a machine learning model. The script discusses the use of various metrics like BLEU and ROUGE for quantitative assessment, and fluency, coherence, and safety for qualitative evaluation, showing the multifaceted approach to model performance assessment.

πŸ’‘Prompt template

A prompt template in the context of AI refers to a set of instructions or a question that guides the generative model to produce a desired output. The video script mentions crafting the right prompt template as a crucial step in developing generative AI applications, indicating its importance in directing model responses.

πŸ’‘RAG

RAG stands for Retrieval-Augmented Generation, a machine learning approach that combines retrieval mechanisms with generative models to enhance performance. The script suggests grounding LLM responses using RAG, indicating a strategy for improving the relevance and accuracy of generated content.

πŸ’‘Observability

Observability in the context of software and systems refers to the ability to understand the internal state of a system through external observations. The video discusses the importance of observability for monitoring the performance and behavior of generative AI applications once deployed in production.

Highlights

Introduction to Google I/O 2024 session by Benazir Fateh, focusing on leveraging Vertex AI for the lifecycle of Google's Gemini Pro language model.

Bala Narasimhan's presentation on deploying scalable endpoints and evaluating models using Vertex AI.

The scenario of a media company using Google Cloud to address customer satisfaction issues with their online newspaper.

Utilizing generative AI to enhance web navigation and improve user experience on a newspaper's website.

The process of experimenting with GenAI applications to generate high-quality subhead summaries and conversational interfaces.

The importance of crafting the right prompt template for generative AI applications and the iterative process involved.

Evaluation of different LLM models, prompts, and configurations to identify the optimal setup for a task.

The challenges of creating and maintaining a GenAI application in production, emphasizing the need for ongoing evaluation and adjustments.

Vertex AI Studio as a platform for developing and refining generative models through a user-friendly environment.

Demonstration of using Python SDK for rapid evaluation of models and prompts in Vertex AI.

The use of xsum dataset for training and evaluating the Gemini Pro model for summarization tasks.

Tuning the Gemini Pro model using Vertex AI Tuning service for improved performance on specific datasets.

Comparative evaluation of tuned vs. untuned models to assess the effectiveness of tuning on performance metrics.

The role of qualitative metrics like fluency, coherence, and safety in evaluating the quality of generated summaries.

Observability and monitoring of generative AI applications in production to ensure ongoing performance and user satisfaction.

Bala Narasimhan's overview of deploying a generative AI application for personalized news article recommendations using a chat interface.

Introduction of the jumpstart solution for easy deployment of generative AI applications with best practices.

Technology components of the jumpstart solution, including GKE, Cloud SQL for Postgres, and Vertex AI.

Provisioning infrastructure securely with best practices such as SSL, IAM-based authentication, and private IP.

Building and deploying applications into GKE as part of the jumpstart solution for generative AI.

Demonstration of interacting with a chatbot for personalized news article searches using natural language.

The importance of observability for monitoring generative AI applications in production, including Query Insights and data cache metrics.

Summary of the jumpstart solution's benefits for deploying generative AI applications with best practices and observability.

Transcripts

play00:00

[MUSIC PLAYING]

play00:04

BENAZIR FATEH: Hello, everyone.

play00:05

And welcome to this session of Google I/O 2024.

play00:09

My name is Benazir Fateh.

play00:10

And I'm an AI/ML CE specialized in AI/ML services

play00:14

on Google Cloud.

play00:15

And today with me, I have Bala Narasimhan,

play00:17

who is a group product manager for Cloud SQL.

play00:21

Bala and myself are going to show you

play00:24

how to leverage Vertex AI for the complete lifecycle

play00:28

of Google's Gemini Pro language model.

play00:31

In particular, you will learn how

play00:34

to fine tune Gemini AI on Vertex AI on your data sets,

play00:39

deploy scalable endpoints, and evaluate and compare models.

play00:45

Finally, Bala will show you how to ground

play00:48

Gemini and your GenAI application

play00:51

using Google Cloud databases.

play00:54

By the end of this session, you will have a better understanding

play00:57

of how you can use Google Cloud for building applications.

play01:02

So with that, let's get started.

play01:06

Sorry.

play01:09

Great, so imagine that you are part

play01:12

of an AI team in a media company that works with Google Cloud.

play01:17

Within your media company, you have online newspapers

play01:20

that represent a core part of your business.

play01:23

They have enabled your company to generate revenue

play01:26

from various sources, such as subscriptions, advertising,

play01:30

and sponsored content.

play01:33

However, there have been some recent concerns

play01:36

about customer satisfaction related to one

play01:39

of these new online newspapers.

play01:42

Some web traffic statistics and customer surveys

play01:45

have revealed that your readers are not

play01:48

as pleased with their experience navigating the newspaper's

play01:52

website.

play01:54

In particular, over the last three months,

play01:57

it results that your newspaper website

play01:59

is experiencing a constant rise in the average churn rate.

play02:05

Readers, on average, are spending less than one minute

play02:08

on the newspaper articles.

play02:09

And the average customer satisfaction score

play02:13

has dropped down to 6 out of 10.

play02:17

All this analysis essentially say

play02:19

that you have a great opportunity

play02:21

to modernize your newspaper's website by providing

play02:25

more quality content and a better navigation experience.

play02:30

And that's why your AI lead has asked

play02:33

you to explore generative AI.

play02:36

Now, how can you use generative AI for enhancing web navigation?

play02:42

After brainstorming with your AI team,

play02:44

you agree to experiment with two different GenAI applications.

play02:48

In the first application, you would use GenAI and Gemini

play02:52

Pro on Vertex AI to generate some high quality subhead

play02:57

summaries.

play02:58

The idea is that if you generate more catchy subhead summaries

play03:03

to summarize the article, the reader will be more--

play03:07

the reader can quickly decide if they want to read this article

play03:09

or not.

play03:11

And the second application is about building an interface

play03:14

for improving readers' website navigation in a more

play03:18

conversational way.

play03:20

So that you can provide direct insights into the trending news

play03:24

of the day to the user.

play03:26

As I said, you're part of this team

play03:28

that has just been assigned to design these two GenAI

play03:31

applications.

play03:32

So the question is, how do you build this generative AI

play03:36

application?

play03:38

Well, when you start-- in general,

play03:40

when you start making a GenAI application today,

play03:43

there are several ingredients, there

play03:44

are several moving pieces in that workflow.

play03:47

You start by evaluating several different LLM

play03:50

models according to the task that you

play03:52

are trying to implement.

play03:54

You test and validate different models.

play03:57

You test and validate different prompts and also

play03:59

different configuration for the model.

play04:03

You do this to identify particularly

play04:06

a model to use and also a prompt template and a configuration

play04:10

that works best for your decided task.

play04:14

Finally, depending on the generated responses

play04:17

and the complexity of the application,

play04:19

you may also need to ground these LLM responses using RAG.

play04:25

Or you may also want to interact with some third party

play04:28

systems using AI agents.

play04:30

Now, what I described in words is, in general,

play04:33

a high level overview of how to create a generative AI

play04:35

application.

play04:36

But in practice, it's way more difficult.

play04:39

It's very difficult to create a GenAI application,

play04:41

and to put it into production, and keep

play04:44

it running in production.

play04:46

So let's start from the beginning.

play04:48

The first challenge is about crafting

play04:51

the right prompt template.

play04:54

As a developer, in order to craft a prompt template,

play04:57

you seek a recipe that gets you the desired model output

play05:01

in a very repeatable manner.

play05:03

This recipe is called the prompt template.

play05:07

Now, as you rapidly iterate through

play05:09

various prompt templates, you need evaluation to identify--

play05:14

evaluation will serve as your interactive assistant

play05:17

to identify if you are heading in the right direction,

play05:19

if your model, your prompt, and your configuration

play05:22

are correct and are giving you the desired output.

play05:27

Another decision you may face in this development lifecycle

play05:30

is choosing the best model for your use case.

play05:34

So a couple of models may meet your requirement.

play05:37

But to make a good decision, you may need

play05:39

to compare how they perform.

play05:41

Now, normally, we see that all these large language

play05:44

models are released.

play05:45

And they're released with some benchmarking data sets.

play05:48

But then those-- that data set and those benchmarking results

play05:52

show you how that particular model

play05:54

is doing on that particular data set, which

play05:57

is, many times, a general data set, or a standardized data set.

play06:01

But you still need to evaluate to see

play06:05

if this particular model, which maybe fared great

play06:08

on a particular benchmarking data set,

play06:10

does it do well on your customized data set also?

play06:13

Because summarizing a medical transcript

play06:16

is very different from summarizing a newspaper article.

play06:19

So you need to test the performance on the data that

play06:22

matters to you.

play06:23

And then as you find this optimal model,

play06:26

you might go back and tweak either the model configuration,

play06:30

the prompt template, et cetera.

play06:33

So let's say-- let's--

play06:35

and after you do this, after you do this initial thing,

play06:39

you may ultimately, you do the evaluation,

play06:41

and then you ultimately realize that maybe you need more.

play06:44

Maybe the summaries are good, but the style and tone

play06:48

of the generated summaries are not really

play06:50

what your readers are used to.

play06:52

Maybe there is a specific language

play06:54

that you are used to in your publication.

play06:57

So you can pick between tuning methods.

play07:00

You can tune your model.

play07:01

And then again, you need to evaluate that tuned model

play07:04

to see that the tuned model is performing better

play07:06

than the untuned model.

play07:10

Next, I have a couple of options.

play07:13

So once you've done this, you have tuned the model,

play07:16

still, you may need more.

play07:19

You may need your model to do more.

play07:21

There are other options, such as you

play07:23

can augment the model's knowledge

play07:24

via an external source.

play07:26

And again, you need evaluation to understand if this results

play07:29

in a better performance or not.

play07:32

Evaluation is also needed to guide decisions

play07:34

on the design of this augmentation

play07:36

as you try out different, like in your RAG system,

play07:40

you try out different chunking sizes, for example,

play07:42

for your app.

play07:45

And finally, once you have done all this testing

play07:47

and experimentation, you need to understand if you

play07:49

are ready for production.

play07:51

You need to evaluate on a larger data

play07:53

set that will cover all corner cases to see, how well does

play07:58

the final architecture perform?

play08:00

And then this work doesn't end here.

play08:02

We all know that once a model is--

play08:04

or that application is put into production,

play08:06

there will be new things that change, maybe

play08:09

the way your user is asking or interacting with the application

play08:13

changes, the prompts are changing-- may change.

play08:16

So once in production, you will need

play08:18

to observe and monitor what is happening,

play08:20

fix mistakes, run evaluation on a larger

play08:23

and more automated scale, and find potential improvements.

play08:29

So in order to do all this and to address all these challenges,

play08:33

you need skills and technology.

play08:35

And talking about technology, that's why you have Vertex AI.

play08:39

Vertex AI provides a suite of services,

play08:42

which would allow you to build both predictive as well

play08:45

as generative applications.

play08:47

Let's see how you can use GenAI services

play08:51

on Vertex AI for building the first application, which

play08:55

is the news subhead summaries generator.

play09:00

Let's start from the beginning.

play09:01

With respect to crafting the prompt template

play09:04

and validating different LLMs, Vertex AI

play09:07

provides Vertex AI Studio.

play09:10

Vertex AI Studio offers a comprehensive platform

play09:14

to develop and refine generative models.

play09:17

At the core of generative AI lies the concept

play09:21

of a prompt, which serves as a set of instructions

play09:24

or a simple question.

play09:26

Vertex AI Studio will excel in providing

play09:29

a user-friendly environment to design and evaluate

play09:32

these prompts across different LLMs

play09:35

by comparing them with either rapid evaluation capability

play09:38

or side-by-side capability.

play09:41

Now, you want to evaluate and compare prompts and models

play09:45

in your preferred IDE.

play09:47

This is the rapid evaluation SDK that you

play09:50

see for model evaluation.

play09:52

As you can see here, and I'll also show in the demo--

play09:54

what you see here this is the Python SDK.

play09:57

We also have an API for rapid evaluation.

play10:00

What you create is you start with creating an experiment,

play10:04

defining the metrics, the dimensions across which

play10:07

you want to evaluate your task on,

play10:09

and you create an evaluation task.

play10:11

And this evaluation task, you provide the data set,

play10:14

the metrics, the experiment name.

play10:17

And then finally, you run this evaluation task.

play10:21

So let's see this in action.

play10:32

Can we switch to the--

play10:33

OK, great.

play10:34

So what you see on the screen is my Google Cloud project.

play10:40

And in the Google Cloud project, I am under Vertex AI.

play10:44

And then under Notebooks, I have created a Colab Enterprise

play10:47

notebook.

play10:49

In this Colab Enterprise notebook,

play10:51

I have pre-created and pre-run all the cells.

play10:55

And so let's see all these things into action.

play10:58

To begin with, I have done some initial plumbing.

play11:01

There is, I've created--

play11:02

I've mentioned what my project ID is, the region.

play11:06

I've created a bucket to store all the datasets,

play11:09

as well as any artifacts that get created in the workflow.

play11:13

I create a service account and provide the required roles

play11:17

and permissions for that service account.

play11:20

I create some-- I import some necessary libraries.

play11:24

And I also created some helper functions,

play11:28

which will help me with visualization

play11:31

or just displaying all the dataframe

play11:33

with different results.

play11:35

Now, the data set.

play11:38

So for this particular experiment,

play11:39

I decided to use xsum dataset.

play11:42

So I take the xsum data set, I create, train validation

play11:46

and test data frames.

play11:48

And I have just renamed the two columns, the content and summary

play11:54

as content and reference.

play11:57

So reference here is the ground truth.

play11:58

This is the ground truth summary that you will see in the-- that

play12:01

you see in the xsum dataset.

play12:04

Great, now, the very first thing,

play12:06

so I decide that I will use Gemini 1 Pro 002 as a model

play12:11

for this experiment.

play12:14

I decide my temperature to be 0.1.

play12:17

And I configure some safety settings.

play12:21

And then I create a prompt template

play12:24

that I will be using for this experiment.

play12:26

So this is a very simple few lines prompt template

play12:29

that I create.

play12:30

It will take this article.

play12:31

It's supposed to summarize this article in one to two sentences.

play12:35

Great, and then I just do a simple test

play12:37

with identifying that you will generate-- that Gemini

play12:40

Pro generates responses.

play12:43

And I see that I have the content.

play12:45

I have the ground truth reference.

play12:47

And then I see the summary that's generated by Gemini.

play12:51

Great, now, so what next?

play12:56

Let's now run an evaluation experiment using--

play12:59

so currently, we have decided on the model, the prompt template,

play13:02

and the configuration.

play13:04

Let's evaluate this combination.

play13:06

So I create similar to what I have shown on the screen.

play13:12

I have an experiment name.

play13:14

I have the model.

play13:15

I create-- I decide which metrics

play13:19

do I want to evaluate against?

play13:21

And then I create this evaluation task.

play13:23

And I run this evaluation task.

play13:25

Now, when this evaluation task is finished,

play13:28

you will see that-- you will see this View Experiment.

play13:32

And then it opens because it is integrated with Vertex AI

play13:36

experiments that will manage all your-- all the experiments

play13:39

that you are running.

play13:40

So the Experiments bar can open right in the notebook,

play13:44

where I can see, OK, these are the different experiments.

play13:47

And I can see the metrics right here.

play13:50

But just to aid in this--

play13:54

just so you guys can see it better,

play13:55

let's take a look at this.

play13:57

Now here, I'm under Experiments.

play13:59

And I look at experiments.

play14:00

So I can take a look at this experiment.

play14:03

Again, what it shows is it shows the metrics, the values

play14:06

of the different metrics.

play14:08

It shows the model name and the parameter template that I have.

play14:14

So great.

play14:17

You ran an evaluation experiment.

play14:18

You can see the--

play14:20

you can see the values for different metrics.

play14:22

You can also-- in this data frame, again,

play14:26

I show-- also show, What are the different values?

play14:29

and also some explanations.

play14:32

And then finally, also, let's see--

play14:35

let's see some visualization.

play14:37

So here I visualize, so as you can see,

play14:40

I had decided on two types of metrics.

play14:43

There were some quantitative metrics, like blue and rouge.

play14:47

Let me point to that here, blue and rouge, and also

play14:52

some qualitative metrics which would identify,

play14:55

how good were the model generated summaries

play14:59

when it comes to safety, coherence, and fluency?

play15:03

So as you can see, this evaluation result

play15:06

gave me some results.

play15:07

I visualize them.

play15:08

Great.

play15:09

How do I know, are these good enough or not?

play15:11

So that's what we'll cover in the next phase.

play15:15

But I also wanted to mention one more thing.

play15:18

There are many different types of tasks.

play15:19

We picked a summarization task.

play15:21

But evaluation can work on summarization, Q&A, tool use,

play15:25

text generation, et cetera, so a variety of tasks.

play15:28

And then there's a variety of evaluation metrics

play15:31

that you can use.

play15:32

And to help you with picking, what

play15:37

are the different types of metrics, one,

play15:39

available on Vertex AI and, two, that you can use?

play15:42

We also have, in our documentation, metric bundles.

play15:45

So you can see that there is a lot of different types

play15:47

of metric bundles that are available for you to use, pick

play15:50

and choose for doing the evaluation.

play15:54

Great, now let's-- can we move to the slide again, please?

play16:02

Thank you.

play16:04

So let's say that after doing the tuning and evaluation,

play16:08

you found the right prompt template for generating

play16:11

the new summaries.

play16:12

But still the tone and content of

play16:15

the generated subhead summaries doesn't really

play16:17

match with what you needed.

play16:20

Now, so in this case, maybe you want

play16:22

to consider that you can tune the model.

play16:25

So you may want to consider providing some reference

play16:28

subhead summaries as part of your tuning data

play16:31

set and then tune Gemini Pro in such a way

play16:35

that the tuned model would be better

play16:37

able to reproduce the tone--

play16:40

a tone that you are-- that you want.

play16:44

Now, tuning an LLM model, again, is very challenging

play16:49

in terms of there are different resources that you

play16:51

need to tune the model.

play16:53

And also, again, this is an entire ML workflow,

play16:57

tuning a model.

play16:57

So how can you-- how can you do that?

play17:00

Now, Vertex AI provides a service.

play17:03

It's a fully managed service, which

play17:04

is called Vertex AI Tuning.

play17:06

This service allows you to improve

play17:09

the performance of your LLM in a managed and scalable way.

play17:14

This service is fully based on Vertex AI Pipelines.

play17:17

So that you can automate the entire tuning process.

play17:20

And also, it will allow you to monitor it because you know,

play17:24

there is an--

play17:25

Vertex AI Pipelines have an inbuilt integration

play17:27

with Vertex AI Tensorboard, so all the training

play17:31

and evaluation metrics are logged into Tensorboard.

play17:34

And you can monitor the progress of tuning there.

play17:38

So to tune Gemini 1 Pro using--

play17:41

now, you can also use Vertex AI UI as well as the SDK.

play17:45

And what you see on the screen is

play17:47

an example of what the tuning SDK looks like.

play17:50

And we have tried to make it as simple for you as possible.

play17:54

So the tuning process, this is what that SDK looks like.

play17:58

It requires you to, of course, pass the training data set

play18:02

and also optimally a validation data set.

play18:05

Also, the next thing you will need to define

play18:08

is the number of epochs and learning rate multiplier.

play18:12

Finally, you launch this tuning job that you see here.

play18:15

And you will monitor its status.

play18:17

Let's take a look at it again in the notebook.

play18:22

Thank you.

play18:24

So how to tune the Gemini Pro 1.002 model.

play18:29

So again, here what you'll see is

play18:31

I've created some training, test, and validation datasets.

play18:35

I then run this tuning job.

play18:38

This is the same thing that you saw on screen in the deck, which

play18:41

is I take the source model, I attach the training dataset

play18:45

and the validation dataset.

play18:46

I mentioned epochs and learning rate multiplier.

play18:49

That's it.

play18:50

And then you can kick off the job.

play18:53

This is an asynchronous job.

play18:54

It will continue running.

play18:56

But then let's-- so I have already run this tuning job.

play19:00

Let's take a look at what this tuning job--

play19:04

so it will be under--

play19:05

so under language, where you will see all the large language

play19:08

models, you will see this supervised tuning

play19:11

job listed there.

play19:13

So first things, if you go under the Dataset tab,

play19:18

you will see that it will-- because before the tuning

play19:21

starts, it validates your datasets.

play19:24

So it will show, OK, the training data set had 2,000

play19:27

examples and roughly these many characters in total.

play19:31

And then it gives you a small sample

play19:34

of what that dataset looks like.

play19:36

And also it shows some distribution.

play19:38

So for example, it shows the distribution

play19:40

of the input tokens per example and output tokens per example.

play19:44

So this sort of validation that it does,

play19:47

the tuning job does right in the beginning, it helps,

play19:51

because like I said, Vertex AI tuning

play19:53

is a fully managed service.

play19:54

So before the tuning job starts, it

play19:56

needs to identify, what are the resources it needs

play19:59

to provision behind the scenes?

play20:01

So it identifies, what is the size of the data set

play20:05

that is going to tune with?

play20:06

And of course, you have mentioned the model

play20:08

that is used for tuning so it can easily

play20:11

provision the resources needed behind the scenes.

play20:16

And then once the tuning is in progress

play20:20

and the tuning succeeds, you will

play20:22

see different types of metrics.

play20:24

So you will see training metrics as well as validation metrics.

play20:31

Great.

play20:32

So we tuned.

play20:33

We ended up tuning the Gemini Pro 1 model on the xsum.

play20:37

So I took 2,000 examples from the xsum dataset with

play20:41

the content and ground truth.

play20:43

And I used it for tuning.

play20:45

Now, how do I know if this tuned model is good enough?

play20:49

Now we know, we all know that be it a machine learning

play20:52

model or a generative AI model, the model

play20:56

is going to be as good as the data set that you provide to it.

play20:59

So if the dataset is not good, then the--

play21:05

even when you tune the model, it won't give you good results.

play21:09

So in this case, I know that I've used Gemini Pro 002

play21:13

as a base model.

play21:15

And I've used xsum for my training--

play21:18

as a training data set.

play21:20

So let's evaluate this tuned Gemini model.

play21:25

Again, I create this--

play21:28

a test data set.

play21:29

I generate some summaries using the tuned Gemini.

play21:33

So I see Gemini summaries.

play21:34

I see the tuned Gemini summaries.

play21:36

And then I run an evaluation experiment.

play21:40

So again, here what you see is I have

play21:43

created another experiment for comparison of these two Gemini

play21:46

models.

play21:47

I have Gemini Pro, I have tuned Gemini Pro,

play21:50

and I use the same set of metrics.

play21:52

Again, there are-- a few of these metrics

play21:54

are quantitative metrics like blue and rouge.

play21:56

And the way quantitative metrics work is they

play21:58

will look at the ground truth data that you provided.

play22:03

And they identify how closely this model,

play22:07

the model in question, Gemini Pro or tuned,

play22:10

how well are those generated summaries comparing as opposed--

play22:14

as compared to the ground truth data.

play22:16

That's what the quantitative metrics are doing.

play22:19

And then there are the qualitative metrics,

play22:21

which is fluency, coherence, and safety.

play22:23

Now, in order to evaluate for qualitative metrics,

play22:28

we have, behind the scenes, an autorater model.

play22:30

So the autorater model is a Google proprietary model

play22:33

that identifies, irrespective of ground truth,

play22:36

how good is this generated summary?

play22:40

So again, I run this evaluation task.

play22:43

I can see this experiment, like we saw before.

play22:46

I can also see everything in a data frame

play22:49

format, along with explanations for qualitative metrics.

play22:56

I can also see them as--

play23:00

here, I also show them how to--

play23:04

I mean, again, I look at a few samples

play23:06

here to see what the score for fluency, et cetera, is.

play23:11

But let's take a look--

play23:12

to get a more clearer picture, let's take

play23:14

a look at-- let's visualize these evaluation results.

play23:18

Let's first take a look at the quantitative metrics.

play23:21

So what you see here is the Gemini Pro model.

play23:24

The untuned one is in blue.

play23:26

And the tuned model is in red.

play23:28

So what you see is for--

play23:31

when it comes to quantitative metrics,

play23:33

the tuned model does better.

play23:35

So, you trained this model.

play23:38

You tuned this particular model by giving it a xsum data set.

play23:43

So it does perform better.

play23:45

But I have-- I do know that xsum sometimes--

play23:49

I mean, xsum is widely used, but xsum is a low quality dataset,

play23:53

especially for summaries generator.

play23:55

And that point really shows up here,

play23:58

because when I look at the qualitative metrics--

play24:01

so when you see the qualitative metrics, what you'll see

play24:04

is for safety, both the models perform equally.

play24:09

They both are-- they both provide safe summaries.

play24:12

But when it comes to the other two metrics, which

play24:15

is coherence and fluency, I see that, actually,

play24:18

the untuned version gives me a better result, but not

play24:21

the tuned version.

play24:23

So the tuned version performs slightly less optimal--

play24:27

less better as compared to the untuned version.

play24:30

Why?

play24:31

So remember, these two are qualitative metrics.

play24:35

They don't need any ground truth.

play24:37

They are basically checking to see for this generated summary,

play24:40

are they coherent and are they fluent?

play24:44

The tuned version, because I tuned it

play24:46

on xsum, which is a low quality dataset, it ended up--

play24:50

for these two metrics, it ended up making the model even worse.

play24:54

So, this is your aha moment, where you say,

play24:57

it's not every time, it's not just that you tune a model

play25:00

and it's going to work better, no, that's not the case.

play25:02

You need to tune it.

play25:03

You need to evaluate it to see if it really performed better.

play25:10

Great, and then also I have some quantification

play25:12

of how good or bad did each metric--

play25:15

so we see that the blue and rouge metric improved.

play25:19

But fluency and coherency actually

play25:21

decreased by this many percentage.

play25:24

Great.

play25:26

Can we move on to the next?

play25:28

Thank you.

play25:30

So for now, let's assume that maybe you didn't end up

play25:35

going with tuning.

play25:36

You just ended up going with the regular Gemini Pro model.

play25:39

But you made a decision on which model,

play25:41

and which prompt template and configuration you

play25:43

are going to use in production.

play25:45

Great, and you integrated them into your newspaper's web page.

play25:50

But, now this model is in production.

play25:52

How can you keep monitoring the capabilities of this model?

play25:56

Because we all know that once something is in production,

play25:59

again, there are tons of things can go wrong.

play26:02

And you may start seeing different types of newspaper

play26:08

articles that get--

play26:10

so maybe the prompt that you first decided

play26:13

is no longer good enough for these new types of news articles

play26:18

that you're seeing.

play26:19

Or many other factors can happen in production.

play26:23

So basically, we understand that once you

play26:25

put your model into production, you still need to evaluate it.

play26:29

So Vertex AI provides computation-based as well as

play26:33

auto side by side.

play26:34

These are, again, two different types of evaluation techniques

play26:38

provided by Vertex AI.

play26:40

What we saw earlier was rapid evaluation.

play26:42

So here, I want to make one quick distinction.

play26:45

Rapid evaluation is something, you take a few samples,

play26:48

you run that SDK, you get your evaluation.

play26:51

It's something that you do while you're doing experimentation.

play26:54

It's really quick.

play26:55

You call an SDK.

play26:56

And you're done.

play26:58

There are-- we understand that you may need

play27:00

to do a large-scale evaluation.

play27:02

So computation-based and auto side by side metrics,

play27:05

they allow you to assess the performance of a model

play27:08

with task-specific metrics that are computed on the reference

play27:11

data.

play27:12

And here, you can provide a much larger number of samples.

play27:16

And the other cool thing is, both for computation

play27:20

based and auto side by side, we have created a pipeline.

play27:23

Like there is a pipeline SDK that is provided to you.

play27:26

So this thing can-- you no longer

play27:28

have to do it in your notebook.

play27:29

You can run this pipeline.

play27:31

And this is-- you can automate it on a much larger scale.

play27:35

So once your models are in production,

play27:36

you can schedule these pipelines, like on demand

play27:39

or on a periodic recurring basis.

play27:41

So you keep getting your evaluation results.

play27:44

And the other thing is auto side by side,

play27:47

what I showed you earlier was a pointwise evaluation.

play27:50

You define the metrics, and for each metric,

play27:52

you get some result. Auto side by side

play27:54

is a pairwise evaluation.

play27:56

So you provide-- you have a model in production,

play27:58

but new models keep coming.

play28:00

So you want to compare how the new, the latest model that's

play28:03

released by Google, how is that comparing with the one

play28:05

that you have in production?

play28:06

So you can do an auto side by side comparison.

play28:11

And to run this auto side by side evaluation,

play28:13

this is what, again, the SDK looks like.

play28:16

And here, as you'll see, this is a pipeline job.

play28:18

So all you need to do is you need to configure your training

play28:22

dataset, your pipeline name, et cetera,

play28:24

and then you run this pipeline.

play28:27

So Google Cloud provides an evaluation pipeline template,

play28:31

and all you need to do is provide parameters.

play28:33

Let's quickly take a look at this one in action.

play28:38

Thank you.

play28:39

So evaluation at scale for auto side by side.

play28:43

So again, in this one, I created--

play28:46

now, here for auto side by side, again, I'm comparing two models.

play28:50

So I'm comparing the tuned Gemini and the untuned Gemini.

play28:54

The same thing that we did earlier, but I'm doing--

play28:57

I'm also showing how it can be done with auto side by side.

play29:02

So I create the--

play29:03

I compile the pipeline.

play29:05

I configure the different parameters

play29:09

that I need of the task summarization,

play29:12

some prompt parameters.

play29:14

And this is my pipeline job.

play29:16

So we run this--

play29:18

this pipeline.

play29:19

And once this pipeline finishes, we'll

play29:21

will see some auto side by side judgments.

play29:24

So what you see here, so again, what you--

play29:28

it will create an entire dataframe.

play29:30

So you can create-- you can see what the response like

play29:33

for each of this content.

play29:35

What was the response from the tuned model as well as response

play29:38

from the untuned model?

play29:42

Great.

play29:43

Can you switch back to the slide, please?

play29:47

Thank you.

play29:50

So at this point, we have our newspaper's subhead summaries

play29:53

generator in production.

play29:55

Now, what about the second GenAI application

play29:58

that I earlier talked about?

play30:00

Basically for the second one, we want

play30:02

to build an interface for improving the reader's website

play30:05

navigation in a conversational way

play30:08

by providing direct insights on the trending news of the day.

play30:12

And to talk about that, I would like

play30:14

to invite Bala to talk more about that use case.

play30:22

[APPLAUSE]

play30:26

BALA NARASIMHAN: Thank you, Benazir.

play30:28

As Benazir mentioned, what I'm going to show

play30:30

you now is how simple media can deploy a generative AI

play30:35

application.

play30:36

So that their users can find personalized news articles

play30:40

using a chat interface.

play30:42

The good news is we have put together a jumpstart solution.

play30:45

So that we can deploy generative AI applications easily.

play30:50

And I'm going to use this juststart solution today.

play30:53

So let's look at this jumpstart solution right now.

play30:56

So these are the technology components

play30:59

that come together to make this jumpstart solution.

play31:02

The link that you're seeing here on the right

play31:04

is where you can go to get access

play31:06

to this jumpstart solution as well, in case

play31:08

you're interested in easily deploying a generative AI

play31:11

application.

play31:12

So there are three technology components

play31:14

that come together to make this jumpstart solution possible.

play31:18

The first is GKE, where the applications will be deployed.

play31:22

The second is Cloud SQL for Postgres, which

play31:26

will act as a vector database.

play31:27

And the third is Vertex AI.

play31:29

Vertex AI will be used for as an embeddings model as well

play31:33

as an LLM.

play31:34

So let's look at the various steps

play31:36

that you need to execute when you use the jumpstart solution.

play31:39

There's essentially four steps you

play31:41

need to execute in order to deploy a generative AI

play31:44

application using this jumpstart solution.

play31:47

The first is provisioning the infrastructure

play31:50

with best practices in place.

play31:52

The second is building and deploying the applications

play31:55

into GKE.

play31:57

And once you've done that, there's

play31:59

a chatbot that's available as part of that jumpstart solution.

play32:02

You're then able to interact with that chat bot.

play32:05

And then the fourth is basically observability,

play32:08

because once you deploy your generative AI

play32:10

application in production, observability

play32:13

becomes really important.

play32:14

So that you can monitor how it's behaving.

play32:17

What I'm going to do now is take you

play32:18

through each of those four steps.

play32:20

And for step three and four, I have a demo for you as well.

play32:24

So what do I mean when I say that you

play32:26

can deploy the infrastructure using best practices?

play32:31

Here's an example of that.

play32:33

When you use the jumpstart solution,

play32:34

we want to make security easy for you.

play32:37

So as you can see here, when you deploy your infrastructure using

play32:41

the jumpstart solution, SSL is enabled

play32:44

by default between the client and the server.

play32:46

Similarly, IAM based authentication

play32:49

is enabled by default. So that you

play32:51

don't need to rely on database username and password.

play32:54

And thirdly, private IP is set up by default between GKE

play32:59

and your Cloud SQL for Postgres database

play33:01

using private service connect, which

play33:03

makes it very easy to set up networking and makes it secure.

play33:10

So as far as building and deploying applications in GKE

play33:13

goes, there's three applications that come

play33:16

with the jumpstart solution.

play33:18

The first is an application that converts

play33:21

your Cloud SQL for Postgres instance into a vector database.

play33:25

The second is the chat bot that is going

play33:28

to run in the GKE environment.

play33:30

And you're able to interact with that in natural language.

play33:33

And the third is an application that interacts with Vertex AI,

play33:37

generates the embeddings, which you can then

play33:39

store into the vector database to run your semantic searches.

play33:44

So now that we've deployed the infrastructure

play33:47

and then we've deployed the applications,

play33:50

we can look at how we can interact

play33:52

with that chat bot using natural language for personalized

play33:56

searches.

play33:57

And then we can look at observability

play33:59

of your GenAI application when it's running in production.

play34:03

If we can jump to the demo, please.

play34:08

All right, so this is the new GenAI website

play34:11

that Cymbal Media has put together.

play34:13

As you can see, there's a chat bot interface here.

play34:16

I'm thinking of going on vacation to Germany.

play34:19

And I'm looking for news articles

play34:20

about travel reservations.

play34:22

So I'm going to ask the chat bot to give me that.

play34:27

The chat bot will return news articles on that topic,

play34:31

taking into account my preferences.

play34:39

As you can see here, that's exactly what has happened.

play34:42

The chat bot has taken my question

play34:44

that I asked in natural language, identified

play34:46

a news article that makes sense based on my preferences.

play34:49

So now, let's look at the infrastructure that's

play34:51

powering this application.

play34:53

This is the GCP Console for Cloud SQL for Postgres, which

play34:56

is acting as a vector database.

play34:58

It's running Postgres 15.

play35:00

You'll see that the Postgres vector database actually has

play35:03

something called a data cache.

play35:05

A data cache is a server side SSD that acts as a read cache.

play35:09

And it's able to accelerate your semantic search

play35:11

queries because of improved read throughput and read latency.

play35:16

Transitioning to observability, we

play35:18

have something called Query Insights.

play35:19

Query insights gives you an application-centric view

play35:22

for observability.

play35:24

Here, it's showing you the top queries that are running.

play35:26

It's giving you the details of that query.

play35:29

This is the semantic search query

play35:30

that we executed as part of that natural language

play35:34

question that was asked.

play35:35

It's going to give you information

play35:37

about the average execution time for that semantic search.

play35:41

On the right-hand side, you're seeing a visual query plan

play35:43

for the same query, which gives you a visual way

play35:46

to see the different steps that were

play35:48

executed when that particular vector search was run.

play35:52

You're also seeing that it's showing you

play35:54

who the top users are of the vector database.

play35:56

So you can understand, in terms of load

play35:58

on the database, who are the different users that

play36:00

are executing the queries?

play36:02

And here is the-- again, coming back to the visual query plan,

play36:05

you're seeing that there was a sequential scan as part

play36:07

of looking up that vector index when that semantic search

play36:10

was executed.

play36:11

We also provide you a rich set of metrics

play36:14

so that you can monitor what's going on

play36:16

in your vector database.

play36:17

I talked about the data cache.

play36:19

Again, the data cache is a server side SSD

play36:21

that accelerates your semantic queries.

play36:24

Here, there is a couple of metrics associated

play36:26

with that data cache.

play36:27

The first is the size of the data cache itself,

play36:29

which in our demo over here is 375 gigabytes.

play36:33

There's a metric that will show you that.

play36:35

There's also a metric that will show you

play36:37

that at any point in time, how much of that

play36:39

data cache is actually consumed.

play36:41

That way, you can monitor to make sure

play36:43

that your data cache is sized appropriately

play36:46

for the kinds of queries that you are running

play36:48

against that vector database.

play36:50

So you'll see that right now.

play36:58

All right, to summarize then, what I've been talking about

play37:01

is a jumpstart solution that we put together

play37:03

that allows you to easily deploy generative AI applications using

play37:08

best practices.

play37:10

We provided you a link where you can go and access

play37:13

this same jumpstart solution.

play37:15

And you can deploy generative AI applications easily using that.

play37:19

We talked about the different technology components that

play37:23

form that jumpstart solution.

play37:25

And we showed you a demo of the observability

play37:28

that comes along with it.

play37:29

So that you can run this generative AI

play37:31

application in production.

play37:33

If you can transition back to the slides, please.

play37:37

All right, to summarize what Benazir and I have been talking

play37:40

about today, we started off with the problem statement,

play37:43

which is that Cymbal Media was facing three problems.

play37:46

Firstly, their users were churning on their website.

play37:50

Secondly, engagement on any specific news article

play37:53

was going down, which means there

play37:55

was an erosion of trust between Cymbal Media and its users.

play37:59

As part of that, we came up with an AI-driven solution,

play38:02

which was to revamp the website using generative AI

play38:06

and make it a personalized experience for their users.

play38:09

And in order to enable that, we use three technologies.

play38:12

We used Vertex AI.

play38:14

We used Cloud SQL.

play38:15

And we used GKE.

play38:19

Thank you.

play38:20

[MUSIC PLAYING]