Code-First LLMOps from prototype to production with GenAI tools | BRK110
Summary
TLDRこのビデオスクリプトでは、Azure AIチームのCassieとDanがコードファーストのLM(大規模言語モデル)アプリケーション開発プロセスを紹介しています。開発者がAIを活用して顧客体験を向上させる方法と、モデルのパフォーマンス向上やデバッグの課題に対処するための新しいツールと機能について詳しく説明しています。Promptyという新しいツールやAI Studioの統合、GitHub Actionsを使った評価と自動化、そしてプロンプトフロートレーシングSDKを活用したローカル開発環境でのトレースとデバッグが紹介されています。さらに、CI/CD環境での評価と、App Insightsを使ったプロダクションでの監視と評価スコアの可視化もカバーされています。
Takeaways
- 🌟 Azure AIチームのCassieとDanによる「コードファーストLMアプリケーションのプロトタイプからプロダクションへの移行」というトークを紹介しています。
- 🛠️ 顧客からのフィードバックに基づいて、モデルを構築し、ローカルとプロダクションでのデバッグ、非決定性の扱い方など、LMアプリケーションの開発における課題に対処する新しいツールを作成しました。
- 🔄 LMアプリケーションの開発プロセスには、アイデーション、開発、プロダクションという3つの反復ループがあります。それぞれの段階で異なる焦点を持つことが重要です。
- 🚀 ASOSはAIを活用して顧客体験を向上させる例として挙げられています。彼らはAIパワードの経験を開発し、顧客との絆を築きながら好みを学習しています。
- 🛠️ 新しいツールの紹介として、Azure Developer CLIのAI Studio統合、Prompty、VS Code拡張、GitHub Actionsなどが挙げられています。これらは開発者にとってAIをより使いやすいものに変えるものです。
- 📝 Promptyは言語に依存しないプロンプトアセットで、VS Code上でプロンプトエンジニアリングを簡単に行えるよう設計されています。
- 🔧 Promptyファイルを使用することで、開発者はプロトタイピングのループを開始し、AIがどのように応答するかを理解しやすくなります。
- 🔄 開発者はPromptyを用いてコードを迅速に作成し、C#などの言語を使用してプロンプトを実行できます。
- 📈 Azure Developer CLIを使用することで、AI Studioのテンプレートを利用してインフラストラクチャのコード化と自動化を行うことができます。
- 🤖 クリエイティブエージェントの例では、AIがウェブ検索やベクターデータベース検索を活用し、記事を作成するプロセスを紹介しています。
- 🔧 記事の途中で切れる問題が発生している場合、開発者はアプリケーションコードを掘り下げて、パラメータの調整を行うことができます。
- 📊 評価指標として関連性、流暢性、一貫性、信頼性などが挙げられており、Prompt Flow Evaluators SDKを用いてアプリケーションの品質を評価できます。
- 👀 トレーシングとデバッグは、AIモデルの動作を理解し、問題を迅速に特定し解決するのに非常に役立ちます。
- 📊 App Insightsを使用した監視ダッシュボードを介して、プロダクションでのLMアプリケーションのパフォーマンスを監視できます。
- 🔄 GitHub Actionsを用いたCI/CDにより、変更を自動的にテストし、アプリケーションの品質とパフォーマンスを継続的に向上させることができます。
Q & A
コードファースト言語モデル(LM)アプリケーションの開発にはどのような課題があるとカスタマーから報告されていますか?
-モデルが良好に機能し、ローカルとプロダクションの両方でのデバッグ、非決定性の特性、そしてtrue/falseに慣れている開発者にとっての新たな課題があります。
言語モデルアプリケーションの開発プロセスにはどのような3つの反復ループがありますか?
-アイデアの創出、初期のProof of Conceptの構築、単一のユーザー入力での出力を確認する開発の初期段階、Proof of Conceptを超えたアプリケーションの開発、そしてアプリケーションを本番環境に移し、実際のユーザーからのフィードバックをもとに品質を向上させるプロダクションへの移行の3つのループがあります。
ASOSはどのようにAIを活用して顧客体験を向上させていますか?
-ASOSはAIを活用して顧客との絆を築き、彼らの好みを学習するAIパワードの体験を開発しています。
Promptyツールとは何で、どのような利点がありますか?
-Promptyは言語に依存しないプロンプトアセットで、VS Code内でプロンプトを反復的に操作し、言語モデルとやり取りするプロンプトエンジニアリングを簡単に行うことができます。
Azure Developer CLIにAI Studioの統合が追加されたことによる利点は何ですか?
-AI Studioテンプレートを使用してインフラストラクチャをコードとして扱い、自動化を行うことができます。
GitHub Actionsはどのようにして評価と自動化に使われますか?
-GitHub Actionsはプロンプトの評価と自動化に使用され、コードの変更やインフラの変更をテストし、最新バージョンのアプリケーションを常にテストすることができます。
AIツールが開発者に与える主な変化とは何ですか?
-AIは開発者ツールとして開発者のツールベルトに加わり、研究やデータサイエンティストの手元から開発者全員の手に渡り、アプリケーション開発に活用されるようになりました。
プロンプトフロートレーシングとは何であり、どのように動作しますか?
-プロンプトフロートレーシングは、コード内の特定の関数の入力と出力を記録し、アプリケーションの動作を理解しやすくするためのツールです。オープンテレメトリーに基づいて構築されており、Azure MonitorのトレースエクスポーターをインポートしてApplication Insightsに情報を送信できます。
アプリケーションのトレース情報をApplication Insightsに送信することの利点は何ですか?
-トレース情報をApplication Insightsに送信することで、開発者はアプリケーションのパフォーマンスを監視し、問題をトラブルシューティングし、改善することができます。
プロンプトフローエバリュエーターSDKとは何であり、どのようにアプリケーションに統合されますか?
-プロンプトフローエバリュエーターSDKは、生成された記事の結果に対して評価を行うツールです。アプリケーション内で評価を行うことで、記事の品質を自動的に評価し、フィードバックを得ることができます。
GitHub ActionsをCICD環境で使用することの利点は何ですか?
-GitHub ActionsをCICD環境で使用することで、コードの変更がプッシュされるたびに自動的にインフラのプロビジョニングとアプリケーションのデプロイが行われ、常に最新バージョンのアプリケーションをテストできます。
Azure MonitorとAI Studioのどちらを使用してアプリケーションを監視するべきか、その理由は何ですか?
-どちらのツールを使用するかは開発者の選択によるが、Azure Monitorは高度な監視機能を提供し、AI Studioは評価とトレーシングを組み込み、異なる実行を比較するためのツールを提供します。
プロンプトの変更がアプリケーションの評価スコアにどのように影響を与えるかを理解するために重要なポイントは何ですか?
-プロンプトの変更が行われた後に、GitHub Actionsを通じてバッチ評価を行い、アプリケーションのパフォーマンスがどのように変化しているかを監視することが必要です。
トレースとデバッグがAIアプリケーション開発において重要な理由は何ですか?
-トレースとデバッグは、言語モデルに何が送信され、何が返ってくるのかを理解し、アプリケーションが特定の方法で動作する理由を理解するのに非常に役立ちます。
プロンプトのパラメータを調整することで開発者が得られる主な利点は何ですか?
-プロンプトのパラメータを調整することで、開発者はアプリケーションのパフォーマンスを最適化し、生成されたコンテンツの品質を向上させることができます。
プロンプトフローSDKを使用してカスタムエバリュエーターを作成することの利点は何ですか?
-プロンプトフローSDKを使用してカスタムエバリュエーターを作成することで、開発者は特定のビジネスニーズに合わせて評価プロセスをカスタマイズできます。
プロンプトのパラメータを調整する際に主に注意すべきポイントは何ですか?
-プロンプトのパラメータを調整する際には、言語モデルが生成するトークンの最大数と、生成された記事の長さを適切に設定することが重要です。
Outlines
😀 開発者向けAIツールの紹介
CassieとDanが主導するこのセッションでは、Azure AIチームが提供する新しいツールや機能について紹介しています。彼らは、言語モデル(LM)アプリケーションの開発が難しかったことと、モデルのパフォーマンス改善、デバッグ、非決定性の問題点に焦点を当てています。新しいツールは組織に開発プロセスを簡素化するのに役立つと期待されています。開発プロセスには、アイデアの創出、初期の概念実証、開発フェーズへの移行、そして本番環境への展開という3つの反復的なループがあります。各フェーズでの主なアクティビティが説明されています。
🛠️ Promptyツールを使ったプロンプト開発
Promptyは言語に依存しないプロンプトアセットで、VS Code内でプロンプトのイテレーションと開発を容易にします。YAMLとプロンプトテンプレートから構成されており、オーケストレーションフレームワークと統合されています。Promptyファイルを使用することで、開発者は迅速にプロトタイプを実行し、AIの応答を確認できます。また、PromptyのVS Code拡張機能を使用して、新しいPromptyファイルの作成や、既存のプロンプトの変更とテストが行えます。
🌐 Azure Developer CLIとAI Studioの統合
Azure Developer CLIにAI Studioの統合が加えられ、インフラストラクチャコードと自動化が可能になりました。これにより、AIスタジオテンプレートを使用して迅速にアプリケーションを展開できます。また、GitHub Actionsを用いた評価と自動化のデモンストレーションも行われています。
📝 Promptyを使ったコード開発の例
セッションではPromptyを使用した具体的なコード開発の例も紹介されています。Promptyファイルをエクスポートし、VS Codeでイテレーションを行う方法や、Promptyからコードへの変換方法が説明されています。C#を使用したSemantic Kernelでのプロンプトの実装例も紹介されています。
🔍 アプリケーションのデバッグとトレース
デバッグとトレースの重要性が強調されており、Prompt Flow Tracingを使用してアプリケーションの各ステップを追跡できます。トレースはオープンテレメトリーに基づいており、Application Insightsにデータを送信することができます。サンプリングレートの設定も可能です。
📈 プロンプトフローの評価と改善
Prompt Flow Evaluators SDKを使用して、生成された記事の品質を評価する方法が紹介されています。関連性、流暢性、一貫性、信頼性などの指標に基づいて記事を評価し、アプリケーションの品質を向上させるために使用されます。
🚀 CI/CDとGitHub Actionsを使った自動化
GitHub Actionsを用いたCI/CDの自動化について説明されています。コードの変更をプッシュすると、インフラストラクチャのプロビジョニングとアプリケーションの展開が自動的に行われます。また、評価アクションを通じて、モデルのパフォーマンスを評価し、品質を向上させることができます。
📊 監視ダッシュボードと運用の可視化
運用環境での監視と評価の重要性が強調されており、Application Insightsを使用して平均評価スコアや各モデルのトークン使用状況を監視できます。ダッシュボードから各モデルのパフォーマンスを確認し、コストとパフォーマンスの最適化が可能となっています。
🎉 新機能の概要と今後の展望
セッションの締めくくりとして、新機能の概要と今後の展望が示されています。Prompty、プロンプトフローのトレースとデバッグ、Jen AIの監視、AI StudioのAD統合など、開発者にとって有益なツールが提供されています。また、今後のセッションや情報源へのリンクも案内されています。
Mindmap
Keywords
💡Azure AI
💡プロンプトエンジニアリング(Prompt Engineering)
💡プロトタイピング(Prototyping)
💡開発ツール(Development Tools)
💡AIスタジオ(AI Studio)
💡プロンプトフロー(Prompt Flow)
💡トレーシング(Tracing)
💡CI/CD
💡アプリケーションインサイト(Application Insights)
💡プロンプトフローエバリュエータ(Prompt Flow Evaluators)
Highlights
Introduction to the webinar on transitioning language model (LM) applications from prototype to production using Azure AI tools.
Challenges in building LM applications, such as model performance, debugging, and nondeterministic behavior.
The importance of new tooling to assist in the development and performance of LM applications.
The three iterative loops of building LM applications: ideation, development, and production.
Step-by-step process for developing LM applications, from identifying business use cases to deploying and monitoring in production.
Case study of ASOS using AI to improve shopping experiences and create an AI-powered customer preference learning system.
Introduction of new tooling like Azure Developer CLI with AI Studio integration, Prompty, and VS Code extensions for prompt engineering.
Explanation of Prompty as a language-agnostic prompt asset for iterative prompt engineering in VS Code.
Demonstration of using Prompty files for quick iteration and prototyping of LM applications.
Showcasing the Azure Developer CLI for streamlined deployment of applications to Azure with AI Studio support.
Discussion on the integration of AI into traditional development practices, making AI a common tool for developers.
Introduction of the creative agent solution using function calling and AI search for information retrieval.
Deployment of a creative agent application using a single command with the Azure Developer CLI.
Use of GitHub Code Spaces for setting up a development environment with all dependencies installed.
Demonstration of the creative writing team application showcasing the interaction between different AI agents.
Analysis of an issue with article cutoff and exploration of potential solutions by adjusting LM parameters.
Integration of prompt flow tracing for better observability and debugging of LM applications.
Introduction of the Prompt Flow Evaluators SDK for evaluating the quality of LM outputs with scores.
Customization of evaluation metrics and prompts to tailor to specific company needs.
Use of GitHub Actions for CI/CD to automate testing and deployment of LM applications.
Monitoring and evaluation of LM applications in production using Azure Application Insights.
Closing remarks highlighting the transformative impact of the new tools and methodologies on generative AI application development.
Transcripts
Oh, they're already up. OK, Good.
Good, good, good.
So is it just one that ohh, there we go.
Alright, welcome everyone, welcome to code first LM UPS from
prototype to production with Jenny I tools. I'm Cassie Preview
and I'm a Senior Technical Program Manager on the Azure
AI team.
And hi everyone, I'm Dan Taylor. I'm a Principal Product
Architect working on our code first experiences in the Azure
AI platform.
So we have a lot of cool stuff to show
you today. And we've been hearing from our customers that
LM applications are difficult to build. There's a lot of
new challenges
that we're faced with, like how to get models to
perform well debugging both locally and in production, and then
also the fact that they're nondeterministic when we're used to
true and false. And there's there's different ways that we
can mitigate and work with models in order to get
them to perform well.
So we decided that we needed to make new and
updated tooling to help make it real for organizations.
So if you think about the process of building LM
applications, it works a little bit differently. And we talk
about these three different iterative loops. There's a lot more
experimentation that's happening as you're developing the applications. First, you
might start ideating, building an initial proof of concept. You
just start with some LM calls with some hard coded
information and some prompts just to see what the LM
can do, to see if your ideas are possible, feasible.
And then you might just start with a single user
input just to see if you can get the output
you're looking for. And then when you move
want to building up the application beyond the proof of
concept into the development phase, you'll start making sure that
it works well across a wider range of inputs, that
the quality is good, that it's ready for users to
test it. And then as you move the application in
production, that's when you put it out in the real
world. And you want to use the inputs from the
real world to understand how your application is performing and
bring that data back into your development environment to keep
improving that quality. So that's sort of like that life
cycle. And we'll be walking through a little bit of
these three phases today in our talk.
And I love these high level slides, but I want
to know what are the steps that I actually take
when I'm doing this? So if we kind of dig
into that, we can kind of take a look at
the step by step process. So we identify a business
use case or a task, Then we go and play
around with different models, see which ones going to solve
our problem best. We'll start testing it, iterating and evaluating
until we think we have something that is good enough
for production. Then we'll go ahead and deploy that and
monitor that in production.
So a great example of a customer that's creating amazing
experiences with AI is ASOS. So they've been improving shoppers
experiences with innovative AI. They've developed an AI powered experience
that befriends customers to learn their preferences.
So conceptually, this all makes sense, the high level, the
phases that we go through. But as a developer, I
really want to know like, what are the tools actually
look like? What does the code look like? Could you
could you help us dive in, Cassie?
There we go. And this is where a lot of
us are at in this journey is where do I
start? How do I actually build these things? And that's
where the tooling that we've been working on comes in.
So in another interesting thing about the industry that's happening
is that traditional development in AI development and, and AI
personas are really kind of coming together into kind of
a single thing. And really what happens is that AI
just becomes another tool in your developer belt. It's not
no longer in the research and in the data scientist
hands. It's now in all of our hands to start
building. And the tooling should really represent that change. And
so we're going to be introducing new tooling and updated
tooling to support this journey. One of the existing tools
is the Azure Developer CLI, but we have added AI
Studio integration into it. So you can start using AI
Studio templates to do infrastructure as code and automation. We
have Prompty, which is a brand new tool which you
may have seen in the keynote. We're going to show
you some more of it today
and we'll talk about that. We have VX Code extension
for Prompted that allows us to iterate really nicely in
VS Code to start working on our prompts. And then
we all know and love GitHub Actions, and we're gonna
show how we can start using those for evaluation and
automation as well.
So introducing Prompty, in my opinion, the cutest new product
that we've been talking about it build right. Yeah. So
Promptly is a language agnostic prompt asset. So it's to
help you start iterating and playing with prompts and kind
of doing that prompt engineering piece right in VS Code
and make it easier to get started.
If we take a look at what promptly looks like,
if you look on the right, this is a prompty
file. The top part is the YAML, which gives instructions
to the orchestration framework. It's integrated with prompt flow, semantic
kernel, and lane chain. And so they're going to take
that,
create the necessary parameters, which you can also override in
the frameworks. And then the bottom part is you're actually
prop template, which is gonna grab the variables that you
send in and send that to your LM.
So how this works is it's actually makes it super
easy to start because you can run this individually in
VS Code, but then you also can start building on
it with the different orchestration frameworks.
So this I understand, it's a single file that I
can just check into my code and I can start
iterating on it and start that prototyping loop where I
can just start running and executing the prompt template and
get a sense for how the AI responds without needing
to, you know, install a bunch of stuff and and
figure out how to call all these different SDK's and
things like.
That exactly. So let's show how we can get started
with Prompty.
OK, so there's actually two different ways we can get
started. We can start in the chat playground and AI
studio, and we can play with different model deployments and
we can start iterating here. And this is sometimes where
a lot of people kind of fall off a Cliff
of development. You they're like, OK, I kind of have
this model working. I have this prompt that I've been
working with in my system message, but how do I
go to code? So now you can simply click export
and go to a prompty file. That will open up
the file and you can start iterating in VS Code.
Another way that you can start and so I've already
installed my VS Code extension. Here is I can right
click and create new prompty. This is going to create
a default promptly to help me get started. As you
can see, it has the YAML. At the top I
need to add my endpoint and deployment, and at the
bottom we have the template.
So I'm going to go ahead and paste in my
endpoint and deployment and then I can just go ahead
and hit play. So now I'm running a promptly file.
It's going to use my deployed endpoint and it's going
to give me a result from my LM.
So there you go.
Now let's say I want to start updating this. Here
is the sample inputs that I'm using.
I'm going to change the name to Cassie, and I'm
going to tell it that I'm going to change the
question to
give me more information about the features.
Now I can go ahead and hit play again, and
now I'm doing iterative prompt engineering in a local playground.
Brighton VS Code.
There you go. We were able to see that the
updates happened.
Now how do I go to the next step? How
do I start creating code? I can right click on
the prompty and you can see we have a prompt
flow, semantic kernel and lane chain. So I can right
click create plump full code and you can see it's
a very small amount of code to actually start using
this prompty file.
I can right click and use C# as I love
C# and look at how simple this is in semantic
kernel. This is all I need for my prompt to
start executing and semantic kernel.
And then of course we have link chain. So this
is a really quick way to start getting creating your
prompts, adding them, building out your application, and really get
started with building LMS.
So a tool that I mentioned earlier too is the
Azure developer Celli. And if you're not familiar with that,
it's a really amazing tool that really streamlines deploying applications
to Azure. So we have new AI Studio support, as
I mentioned, but we also have been working on samples.
So if you go to this link, you'll find a
list of samples that are using Prompty and AD that
you can deploy with that single command
of a CD up.
It has, um, popular AI solutions. So we have things
like summarization Reg, and then we also have the one
that we're going to be looking at today, which is
the creative agent. So you may have heard a little
bit about agentic programming or multi agent programming. We're going
to be looking at a solution that we're going to
do that we're creating with prompting and prompt flow.
So the way that they, a researcher works is it's
using function calling to use the Bing API to go
get the results from the web. And then it's going
to use AI search in a vector database in order
to get the appropriate information. And then it's going to
give it to the writer agent. The writer is going
to take all that information. It's going to create an
article that we're going to return to the user so
that they can start marketing their their products. And then
lastly, the editor takes a look and says yes, this
is good or no, it's not. Try again.
So this is pretty cool, Cassie, but I notice there's
a lot of infrastructure here I need to set up.
You know, an AK S cluster, managed identity, front end,
back end, AI search, Azure opening, I service. This would
take me like a week to do if I was
clicking around in the Azure portal and wiring everything up
for myself.
Right. So actually to spend the rest of this session
just creating those resources. So you gonna go check out
the
Just kidding.
We're here for a week, everyone. Yeah.
OK, let's see how we can make this better. Let's
go back to how we can build this.
OK, so as I was talking about, we had the
collections, and here you can see this is the one
that we're looking at. You can go use these right
now. This assistance API one was in a couple sessions
already. You may have seen the Contoso chat one, which
is our example. We have some examples with link chain
and then also some process automation. And of course we
have C# and Python
so the next thing we want to do is take
a look at the templates so we can do
AD template list. And I just want to know what
are all the templates in the Z gallery that include
AI?
So now I can take a look at the different
ones that are available because there's obviously more than just
the ones I just showed you. And here's the Creative
Agent 1. So that's the one that we want to
deploy. I've already downloaded the source here. And I'm just
going to go ahead and hit a ZD up,
select my region
and now it's going to go ahead and package and
deploy all those application are all those resources for me.
But this takes a little while so.
So I'm gonna, I've actually already deployed this application ran
easy up earlier so I could have my own development
environment to work in. Cassie's got her own copy of
the development environment. And let's go ahead and dive into
the application and see what it looks like.
So one really cool thing about these templates is they
also come with a dev container definition, which allows you
to run them in GitHub Code Spaces. So if you
don't have a development environment set up, you can just
open it and GitHub Code Spaces, all your development dependencies
are installed and ready to go. And you can just
start running add commands right away. So I've run ASD
up here and it's deployed that application infrastructure. And before
I take a look at the application, I just want
to call out a really cool thing that I love
about code spaces. Sometimes it comes up with
really cool names. So this code space is my crispy
space yodel. And so whenever I find it cool code
space name, I just sort of hang on to it
and keep using that one because I come very attached
to them. So
let's take a look at what got deployed to Azure
by running an ASD up command. You can see that
we've got this resource group with application insights AKS, our
Azure opening I service or Azure AI search. We've got
a Bing search result and all this stuff is all
wired up together and it works out-of-the-box. So I can
go to this deployed application. We can see this is
our creative writing team here where we can give it
some context and some instructions for the article that we
wanted to write. And when you run the application, you
can see how the different agents
pass the information between them. We can, we've got the
researcher that runs and gets research data that passes to
the writer. Then the writer outputs to the editor and
the editor decides whether to accept the article or not.
And then finally we can see the output article. So
we've got this article here written about camping because everyone
knows that we love camping here at Microsoft in Seattle.
And the article overall looks pretty cool. It's selling some
of our our products a little bit, but one thing
we notice is that the article is actually cut off
at the end.
Yeah. And this is happening frequently. Yeah, it's been happening
there. It's constantly just stopping mid sentence, which is not
ideal.
Yeah. So maybe we can take a look at how
do we fix that. And so I'm gonna start to
dive into the application code so you can see how
this application actually works.
Alright, so back to my crispy space yodel.
So I'm gonna run the orchestrator that that's behind the
web application that you saw. So I'll just call the
API agents orchestrator module and that will run the orchestrator
with some sample inputs and it will print out to
the console as I'm running it. So while it's running,
I'll take a look at just some of the code
for you. So this is the right article function. And
you can see that the IT calls the different agents
here. So it calls the research agent, gets the
product agent and it and it sort of gets the
intermediate results and passes those between the agents. One cool
thing about this is each agent actually uses a prompty
file for its implementation. So we've taken that prompt flow
code and we've structured our code base around it and
built this application up.
And another thing is that, you know, as a developer,
I'm used to just printing stuff out to the console
and using the console outputs, but it's kind of hard
to read. So we actually got prompt flow tracing integrated
into this. So how does that work? Well, one thing
I'm doing here is I've decorated different methods in my
code with this at trace method. And So what that
does is we'll capture the inputs and outputs of specific
functions that I'm calling that I want to see in
my tracing tool, which I'll show you in a minute.
And how do we turn on that tracing? Well, we
simply import the prompt flow tracing SDK and we can
just call start Trace and our local development environment to
get a local tracing tool, which I'll show you in
a moment. The other really cool thing is that prompt
flow tracing is built on top of open telemetry. And
so that means that you can import the Azure Monitor
trace exporter and send the all the tracing information off
to application Insights. And another thing that we've set up
in this application is you can configure your sampling rate
for
about how much of the trace data you want to
send. Do you want to send 100% of the request
to Application Insights? Or you can lower that down depending
on you know what makes sense for your production application,
how much data you want to store and send off.
So that's really not a lot of code at all
to start adding really robust tracing into our solution.
Yeah, yeah, this is 35 lines. And I even like,
you know, check some environment variables along the way. So
another really cool thing I wanted to mention, that's part
of prompt flow. We've got this opening eye injector, so
this will trace the opening,
the calls so we can get the prompts and responses.
Awesome. So let's go take a look at what that
tracing looks like. It looks like our orchestrator has finished
running. And actually if you look at the very top
of the output, it put a little URL that we
can click and run locally to view the tracing tool.
And interestingly,
the VS Code output decided not to let me click
that, but that's OK. We'll switch over to the tracing
view that we have open here in the browser. So
this this tracing view captures all that data that's running
locally. It makes it really easy to understand what's happening
in the application. You can see the different parts of
the agents that have instrumented to see the inputs and
the outputs. And so I want to walk through kind
of each stage, each step of this application, so you
can get a little bit more of an understanding of
how does this application actually work.
So the first step here is that we're going to
get the research. So the research takes those inputs from
the user from that web page. So in this case,
it asks can you find the latest camping trends and
what folks are doing in the winter? And it's got
some instructions for
you know what to what to look up. And we
noticed that it returns some Bing web search results, right?
So how do we go from these inputs to Bing
web search results? Well, if we go through the prompted
execution down to the actual LLM call that was made,
the tracing tool shows us the actual prompt and response
that was sent to the LM.
So we can see here that it says you're an
expert researcher that helps put information together for a writer
who's putting together an article. And you can see that,
you know, it provides some instructions to the researcher. But
then what happens at the end is that the researcher
just outputs, hey, call this find information function with this
query of good places to visit. So this is actually
a tool that we've given the researcher. One of many
tools that we've given the researcher that it can choose
to execute in the LM is figuring out which tool
to call
and what arguments to pass it. A really cool thing
here is that the department flow tracing shows us all
the tools that were given to the LLM call. So
that find information call is the one that was that
it chose to use. And you can see here we
give it a description of what the function is and
what the different parameters are that it can call. We
can see that it could have also chosen to use
the find entities call, which uses the Bing Entities API
to look up information about different entities. And it could
also find news, so it could look up news
about a particular topic. So this is all in those
instructions. It depends on what the user asks for. If
I said could you show me the latest news about
camping, it would have decided to call this fine news
article.
So that's using tools to retrieve information by calling functions
in our code. The next thing that we do is
we actually do a vector search over an Azure AI
search index. So we can see this is the get
products function that's called in our code after the get
research function is called. And we can see here that
it takes the users input and
it generates, it uses an Ada two model to generate
a vector representation of that user's question. So here's that
vector representation down here. And then we can see we
run the search against an AI search index and we
get these articles back, which are all articles that are
relevant to camping, which is product information that's relevant to
camping from our product database.
So it's important to point out like we are using
Bing and we're using AI search as our information retrieval.
But you can do information retrieval for any type of
data source. It doesn't have to be a vector database.
Like maybe you're just going to pull information from SQL
or maybe you have a different endpoint that you have
information that you want to pull from. So ours is
specific to our use case and you can use this
this style application a lot of different ways, but this
is not the only way you can get information into
your LM.
Exactly. And so you may need to just call
like an API in your application to get information about
the customer or all sorts of different things. There's lots
of different information that you can feed to these alms
in order to generate a prompt. So that brings us
to the writer. So if we go to the writer,
let's take a look at the prompt that we actually
pass to the writer. And we can see the way
it works is that we tell that you're an expert
copywriter whose job it is to take research and generate
an article.
You can see all the product information that we retrieved
is included here in the prompts that we send to
the writer. You can see that the web search results
are also included in the information that we send to
the writer, as well as a few examples that show
the writer. Here's what you do with, you know, sample
product information. You know, here's an example that you might
get, and here's how you might translate that into the
article and so on. And then finally it says, hey,
write a fun and engaging article between 800 and 1000
words
about the topic. And that's the, this is the initial
request that the user passed in. And so then we
can finally see here's the actual article that was generated
as a result. And yeah, look, it's cut off again.
You know, it's very durable as it.
So what does the LM parameters that we're sending in?
You know, that's a good question. So we can see
not only the prompts but what were the parameters. And
I'm looking here and it says that the Max tokens
are 500 and 12112 tokens but we asked it for
800.
Words that's less.
That's that seems a little.
Bit less, that's probably that could be a problem.
Yeah. So that's that's something that maybe we'll try and
tweak that later and fix that. I'll get to that
in a moment, but let's just kind of finish going
through the trace here.
Just, you know, finally we see that the editor gets
to decide, you know, what to do and and it
decides not to, you know, give some feedback, but decides
the article is fine and we'll let that go back
to the user.
The other thing that's really cool is because we're using
tracing, we were able to really easily see what we're
sending to our prompt. So we actually able to see
the prompt and see what the words are, but we're
also able to see the parameters and we're easy to
really dig in and figure out what's wrong with it.
So we had this amazing as observability and before this
it was really hard to debug LM. So this is
a great way to really start understanding what they're sending,
how they're generating it, and figure out what's happening when
you're having issues.
Yeah, and it and it really speeds up that that
iteration loop as you're experimenting with different things because you
can just figure out what exactly is happening and what
you might want to try next. So another thing that
we have integrated into this application is that we're actually
running prompt flow evaluators inline the application on the result
of the output article. And so you can see that
the way this works at a high level
is that we collect all that information that was generated
as the orchestrator was running. And we put that into
a set of fields, query, context and response, where query
is the request from the user. The context is information
that was generated by the system that we're going to
use to ground the the LM's response. And then finally,
the response is the is the output from the LM.
So we passed these two through our Prompt Flow Evaluators
SDK and that gives us some evaluation scores.
Here you can see we're not doing super well on
on some of these scores.
Yeah. And I cut off article might be that. So
if you're not familiar with what the evaluation metrics are,
relevance is, is the answer relevant to the question that
the user asked. Fluency is does it does it flow
doesn't make sense. Coherence is also does it kind of
make sense and it's working and making sense in answering
the questions. And then groundedness is, is it truthful? Is
it actually giving information based on the products
that are coming in and grounded in that truth and
not giving it out hallucinations or making up information that
is not true?
Yeah. So you can see, you know, we're definitely using
the information about products in the article. So it seems
pretty grounded, but the article is cut off. It's a
little bit salesy, you know, it's a little bit salesy.
So I don't know how relevant it is to what
I was actually asking it. So, but these metrics can
help you get a sense of the quality of your
application without having to manually go in and inspect every
single input and response
and then configure help you figure out where you might
want to focus your debugging efforts. So I'd love to
actually take a moment to show you how you actually
can integrate the Prompt Flow Evaluators SDK into your application
to generate some metrics like these.
So if we continue on through the end of the
orchestrator, you can see there's an if statement here that
says if evaluate, go ahead and evaluate the article in
the background. And we pass all the contexts that was
generated as part of running the application to this evaluate
article in the background function that I have to find
here. And we can just go to definition in code
and see what that looks like.
So the cool thing about the Evaluation SDK too is
we have all these built-in evaluators where you can just
grab them and use them. But if you sometimes you
need to create custom ones as well, which is also
supported, which you can create a prompted custom evaluator and
just import that and do that. So maybe you want
to edit it and make it more specific to your
use case rather than using the out-of-the-box ones. Or you
want to add additional types of of evaluators so you
have the flexibility to use the built in ones or
create whichever ones you might need.
Yeah. So it seems like VS Code is decided not
to make Go to definition work in this particular moment,
but that's fine. But so the evaluate article and background
function here, just that it generates that collects the data
and organizes it into those 3 buckets I talked about
earlier. And then ultimately here, what we do is we
import the Prompt Flow Evals SDK and we import those
built-in evaluators so we can instantiate those in code using
our model configuration,
which is just a pointer to the Azure Open AI
endpoint and model we want to use. And then we
just call each evaluator with the data that we wanted
to evaluate, and it returns back a score from 1:00
to 5:00. So this is really easy for you to
integrate into your application. You can use evaluators in many
different ways and it really depends on how you want
to use that data to iterate and improve on your
application.
Let's see if Go to definition works. It'll take me
to the. OK, So
that what's really cool about Python is that this is
all just Python scripts that are running and you can
actually go to definition on these evaluators and you can
see
what the prompts are used. So these are implemented using
prompty files that we were showing earlier. And so you
can actually take those promptly files, copy them into your
code, and you can make your own custom evaluator starting
with the ones that we've defined as Microsoft. And actually
the tracing view will show us one of those prompts.
So here, for example, this is the prompt that we
use for relevance. And you can see that, you know,
relevance that it gives the definition of the metric, but
you can see some of the examples that it uses
are kind of generic. So one of the things I
might want to do if I'm, you know, trying to
tailor evaluation to my, my company is I might want
to give it some examples that are more relevant to
the type of information that my company might be evaluating.
And I can make my own customized scores that I
can use to help me gauge quality.
Which is a really hard problem to solve and this
is making it so much easier.
Yeah, totally. So let's go ahead and look at maybe
fixing that bug.
That we saw earlier.
OK, So what I did here was I just made
a couple of changes to my my code. And the
first thing I did was suggest that make the article
between 300 and 500 words. And then I also increased
the Max token setting to 1200, right? So these are
parameters that we can experiment with, right? So we can
run a lot of evaluation to see, you know, which
parameters to really dial these in. And that's where that
kind of experimentation comes in. You have to play
with the parameters you're passing in and see how the
application performs. So it's a little bit of a different
style of development when you're building these generative AI applications.
So I'm gonna go ahead and check in these changes
and push it off to my CICD environment where I've
got. The template has GitHub Actions that you can configure
to run on your Git repo. So when you push
changes, it will go ahead and run some actions. And
with the GitHub Actions extension in VS Code, I can
go ahead and see what actions are running and and
look at my history of actions. So there's two actions
that are kicked off. One, it actually runs the ASD
provision and deploy command that both Cassie and I ran.
So now we've got Cassie environment, we've got my environment
and our we've got an integration environment that all is
the same setup. And so we can eliminate sort of
all those environmental differences when we're trying to troubleshoot and
debug things, when we're chasing down bugs.
So I can go ahead and look at what's happening
right now when I'm running this GitHub action.
And you can see here, it's just pulling down the
AZD CLI and it's going to provision infrastructure. And this
is using Terraform, by the way. It's really advanced stuff.
I love it. And then it's going to go ahead
and deploy the application so that if we make any
changes to our infrastructure, any changes to our application code,
the GitHub action will make sure we're testing against the
latest version of everything.
Umm, and then the other action that we have is
we have our evaluate action. So this will actually, instead
of, you know, evaluating inline in the application, the evaluate
action will run a set of test articles through our
code and give us the scores, the evaluation scores that
we can use to check and see, you know, how
well is this code doing? This is going to take
just like another minute to run here.
This is cool because now we're able to batch information
in and batch tests. So we can have as many
different tests or items as we want as we start
batching and evaluating. And then we'll get those averages to
understand how the model is performing. Yeah, and the cool
thing is because we're using LM to evaluate, we don't
have to add ground truth. We just need the inputs
and then this will figure out the outputs. And then
we use the LM to figure out if it's performing
well or not.
Yeah. And so in another minute, when that completes, you
can see that here. It just shows here the different
articles that we ran through our CI CD system. So
I've got a few different articles requests that I'm using
for testing and the updated scores that I get after
I make that change. It looks like we're getting closer
to all fives across the board.
Awesome. Yeah, it looks like that helped.
Uhm, I think that's all I wanted to show, but
maybe I'll try one thing and just reloading my window
here in VS Code to see if I can show
you the prompting for the evaluator. Now that I'm done
with my demo, I'm willing to try this out.
And this is a pro tip if you're US code
developer.
Ohh. And one more fun thing, my CI CD environment
is my super duper potato. So I think that's pretty
chill.
Alright,
alright, so go to Definition works. That's really awesome. So
you can see it uses the relevance prompt file. So
this just uses prompt flow code just like we just
showed you. This is really easy for you to copy
and implement yourself. And if we dig into the site
packages, we can actually see the property file that's used.
And we can just copy and paste this, create our
own property file, create our own evaluator. Really easy to
build off of and customize.
And that's a good way to do it is to
grab the ones that are built in and then start
iterating on those to customize them to your your solution.
Yeah. Alright. So we've checked this change into our CI
CD environment and I think we're ready to ship it
off to production. But Cassie, why don't you tell us
what's going on in production?
Cool.
So here, just like we were using App Insights locally
to start pushing all that information. So everything you saw
locally, we now pushed up to our cloud, which is
running on our production endpoint. And we've set up a
dashboard to give us a nice view of what's happening
with our LM to allow us to monitor the production
deployment. So if you take a look first, we have
the average evaluation scores over time and you can see
that it started performing lower. So this is the one
without the fix yet.
By the way, his is still deploying. So these are
information on what was out there, and this could be
an indicator that maybe there's an issue and we need
to go take a look. Another metric we have here
is the average time per per LM call. I think
this one is super interesting because this GPT 4 model
is our writer and this is our evaluator. And if
you think about the way that these work, the GPT
4 model had a ton of tokens coming out. So
it's taking longer because every time it creates a token
is actually an inference with the model.
That's writing the article.
It's writing the article, right,
where with the evaluators, it's only ever giving one token.
It's a number between 1:00 and 5:00. And so that's
another interesting way to start seeing where you're spending and
how much time, I'm sorry, where how much time things
are are taking. We also have tokens used over time,
so we can see if that how that's trending and
make decisions based on that. And then we have tokens
used by model. So this one's really useful for cost,
right? If I have a ton of tokens being used,
maybe I need to rethink about what I'm doing to
maybe mitigate cost. And so if you look at this
one, the evaluator is using a ton of tokens.
And if you remember he's we're evaluating every single call
that comes in and that's four LLM calls, LLM call
for each of the evaluation.
And the prompts for those are actually pretty.
Sizable. That too, right?
So it's a lot of tokens being used for evaluation.
So maybe we want to think about doing a sampling
of the inputs coming in for the evaluators and reduce
cost there?
Yeah. So I just have that sampling parameter that I
can set my code, I can dial that down. I
can really decrease the cost used for evaluation, but still
get the same information, still understand how my evaluation is
doing
overtime, and this visibility really helps us make some of
those small tweaks that to to really bring down the
cost of our application.
Exactly. So another cool thing with this is we can
see each individual call in the operations view here from
the query. So we can see all these different parameters
are the different scores. And then we can grab this
operation ID and we can do a transaction search. And
now we have the exact same tracing that we had
locally in our production endpoint. So if we see an
issue, we can go in and actually dig into exactly
what was happening
in production and get detailed information at every step. And
you can see the token information in here as well.
So we can see this was before
we made that update.
Yeah, so what's really cool about this, by the way?
I know exactly what's happening. I just walked you through
that tracing view locally. I get that same view. So,
you know, if if you have something with a low
evaluation score and production, you can query against that in
App Insights. And then you can drill into exactly what's
happening. And you know, in each of these things, we're
capturing the inputs and the outputs. And so I can
figure out which component of my application I want to
focus on improving. And then I can take the data
from the application Insights, copy that into my development environment,
maybe write a little unit test
and start iterating until I get a better result and
then check that change in and back to production again.
Also, you may have remembered that I kicked off an
ACD up when we started this talk. If you take
a look, it took about 14 minutes to deploy this
entire application and now I can go browse and I
have a whole new instance set up.
Wow, that's a lot of KS clusters.
Being told to get to work and we can see
that our agent is now deployed and working
alright.
This could have been a four week talk.
Alright, so just like a little bit of a recap
because we went over a lot of really cool new
things. We have the new template scenarios. So these, the
one that you saw today is available at that link.
You can go use it right now as well as
other scenarios. They are all prompted enabled and have really
cool cutting edge things that you can start using. So
definitely go check that out.
We also saw how we can start doing GitHub actions
to do batch evaluations on every check in. So we
can see how our performance is changing as we're iterating
on our prompt. And since we have a prompted, we
also have that in our source. So we're actually being
able to look at what what happened with the change
in our prompt and then see that information in the
evaluations.
Then we saw the amazing tracing and debugging that is
just so useful when you're trying to figure out what's
happening. What are you sending to your LM, what's coming
back? Why are things performing a particular way? The tracing
and debugging makes a huge difference and starting to make
these real both when you're creating and then also in
production for monitoring.
So you saw the dashboard where we were able to
see the monitoring and production. And then we have these
new announcements, which we've talked about. So we have prompty
we had the tracing and debugging. We have the monitoring
for Jen AI and then we have the AD integration
for AI Studio. Another thing is so we showed you
monitoring and evaluation in App insights and using a KS.
You can also do all of this in AI Studio
and it has evaluations and tracing built in there. So
if you're deploying with I
do you have all the same tooling there so you
can choose the way that you want to deploy for
the type of solution that you're creating?
And you can log actually all of the evaluation results
and tracing results to a studio. So you can have
it there. And it's got some really good tools that
you can use to drill in and compare different runs.
So I think this tool tooling is all really game
changing, is changing how I'm approaching my development of generative
AI applications. And I'm really excited for all of you
to try it as well. Yeah. So thanks everyone and
yeah, happy allow them opposing.
Be sure to check out the property session tomorrow. Yes,
yeah, we're going.
To go session is going to be awesome. Yeah, so
we just showed you a little bit of promptly. There's
more prompty.
And For more information on more practical tips for building
copilots, be sure to check out the Jeff and Sandra
suggestion session tomorrow morning on building copilots, key lessons and
best practices. And yet there's more about Advanced RAG with
Azure a search as well. So enjoy the rest of
the conference, everyone.
All types of good AI.
5.0 / 5 (0 votes)