The New Stack and Ops for AI

OpenAI
13 Nov 202334:09

Summary

TLDR本次演讲介绍了如何将基于人工智能的应用程序从原型阶段过渡到生产阶段。演讲者Sherwin和Shyamal分别来自OpenAI的工程团队和应用团队,他们分享了构建用户友好体验、处理模型不一致性、通过评估迭代应用以及管理应用规模的策略。提出了包括控制不确定性、构建可信赖的用户体验、利用知识库和工具锚定模型、实施评估以及通过编排管理成本和延迟等概念。演讲强调了构建和维护大型语言模型(LLM)运营的新学科——LLM Ops的重要性。

Takeaways

  • 😀 大型语言模型(LLM)从原型到生产的过程需要一个指导框架,帮助开发者和企业构建和维护基于模型的产品。
  • 🚀 ChatGPT自2022年11月发布以来,已经从社交媒体上的玩具转变为企业和开发者试图集成到自己产品中的能力。
  • 🔒 构建原型相对简单,但要将应用从原型阶段带入生产阶段,需要解决模型的非确定性本质带来的挑战。
  • 🛠️ 提供了一个由栈图组成的框架,包括构建用户体验、处理模型不一致性、迭代应用以及管理规模等方面。
  • 👥 构建以人为中心的用户体验,通过控制不确定性和构建可操作性和安全性的护栏,来提高用户交互的质量。
  • 🔄 通过模型级功能和知识库工具,如JSON模式和可复现输出,来解决模型的不一致性问题。
  • 📝 使用评估来测试和监控模型性能,确保应用在部署过程中不会发生退化。
  • 💡 通过语义缓存和路由到成本更低的模型等策略,来管理规模,减少延迟和成本。
  • 🌐 介绍了LLM Ops(大型语言模型操作)的概念,作为应对使用LLM构建应用时所面临的独特挑战的新学科。
  • 🛑 强调了在构建基于LLM的产品时,需要考虑的长期平台和专业知识,而不是一次性工具。
  • 🌟 鼓励开发者和企业共同构建下一代助手和生态系统,探索和发现新的可能性。

Q & A

  • Stack and Ops for AI 是什么?

    -Stack and Ops for AI 是一个关于如何将人工智能应用从原型阶段带入生产阶段的讨论,由 Sherwin 和 Shyamal 主持,他们分别来自 OpenAI Developer Platform 的工程团队和应用团队。

  • 为什么从原型到生产阶段的转变很重要?

    -从原型到生产阶段的转变很重要,因为原型阶段虽然可以快速展示创意,但在生产环境中需要考虑模型的一致性、用户体验、成本和扩展性等多种因素。

  • ChatGPT 是什么时候推出的?

    -ChatGPT 是在2022年11月底推出的,至今尚未满一年。

  • GPT-4 的推出时间和它的特点是什么?

    -GPT-4 是在2023年3月推出的,它是一个旗舰模型,推出至今不到八个月,它代表了人工智能模型的最新进展。

  • 为什么说非确定性模型的扩展会感到困难?

    -非确定性模型的扩展会感到困难,因为这些模型的输出具有随机性,难以预测,这使得在生产环境中保持一致性和可靠性成为一个挑战。

  • 什么是知识库和工具(Knowledge Store and Tools)?

    -知识库和工具是一种方法,通过向模型提供额外的事实信息来减少模型的不一致性,帮助模型在回答问题时有更可靠的信息来源。

  • 模型级特性(Model-level features)有哪些?

    -模型级特性包括 JSON 模式和可复现输出等,这些特性可以帮助开发者更好地控制模型的行为,提高输出的一致性。

  • 什么是语义缓存(Semantic Caching)?

    -语义缓存是一种技术,通过存储先前查询的响应来减少对 API 的调用次数,从而降低延迟和成本。

  • 如何评价和测试 AI 应用的性能?

    -可以通过创建针对特定用例的评估套件(Evals)来评价和测试 AI 应用的性能,这包括手动和自动评估方法,以及使用 AI 模型自身进行评估。

  • 什么是 LLM Ops?

    -LLM Ops,即 Large Language Model Operations,是一种新兴的实践,涉及工具和基础设施,用于端到端地管理大型语言模型的运维。

  • 如何使用 GPT-4 创建训练数据集来微调 3.5 Turbo?

    -可以使用 GPT-4 生成一系列提示和相应的输出,这些输出可以作为训练数据集,用于微调 3.5 Turbo,使其在特定领域的表现接近 GPT-4。

  • 为什么需要在 AI 应用中建立用户信任?

    -建立用户信任对于 AI 应用至关重要,因为它可以帮助用户理解 AI 的能力和局限性,确保用户在使用过程中获得安全、可靠的体验。

  • 如何通过设计来控制 AI 应用中的不确定性?

    -通过设计可以控制不确定性,例如通过保留人为干预的选项、提供反馈机制、透明地沟通系统的能力与限制,以及设计引导性的用户界面。

  • 为什么说建立 guardrails 对于 AI 应用很重要?

    -建立 guardrails 可以作为用户界面和模型之间的约束或预防性控制,目的是防止有害和不想要的内容到达应用,同时增加模型在生产中的可引导性。

  • 如何使用检索系统(RAG 或向量数据库)来增强 AI 应用?

    -通过使用检索系统,可以在用户查询到来时,先向检索服务发送请求,检索服务返回相关的信息片段,然后将这些信息与原始查询一起传递给 API,以生成更准确的响应。

  • 自动化评估(Automated Evals)有哪些优势?

    -自动化评估可以减少人为参与,快速监控进度和测试回归,使开发人员能够更专注于处理复杂的边缘情况,从而优化评估方法。

  • 为什么说模型级特性和知识库的结合对于保持一致性很重要?

    -模型级特性可以帮助约束模型行为,而知识库提供了额外的事实信息,两者结合可以显著减少模型的随机性和不确定性,提高应用在生产环境中的一致性。

  • 如何通过微调来降低成本和提高性能?

    -通过微调 GPT-3.5 Turbo,可以创建一个针对特定用例优化的模型版本,这不仅可以降低成本,还可以提高模型在特定任务上的性能。

  • 什么是观察性(Observability)和追踪(Tracing)?

    -观察性和追踪是 LLM Ops 的重要组成部分,它们帮助识别和调试提示链和助手的失败,加快生产环境中问题的处理速度,并促进不同团队之间的协作。

  • 为什么说 LLM Ops 是应对构建 LLM 应用挑战的新兴领域?

    -LLM Ops 作为一门新兴学科,提供了应对构建大型语言模型应用时所面临的独特挑战的实践、工具和基础设施,它正在成为许多企业架构和堆栈的核心组件。

Outlines

00:00

🚀 AI应用从原型到生产的转型

本段落介绍了AI技术从原型到生产阶段的转型过程。Sherwin和Shyamal分别介绍了他们所在的团队,强调了OpenAI Developer Platform的API被200万开发者用于构建产品。他们讨论了将AI应用从原型阶段带入生产阶段的挑战,特别是非确定性模型的挑战,并提出了一个框架来指导这一过程。他们强调了用户体验、模型一致性、迭代信心和规模管理的重要性,并提出了构建在大型语言模型(LLM)上的产品需要考虑的不同策略。

05:03

🛠️ 构建以用户为中心的体验

本段落讨论了构建以用户为中心的体验的重要性,包括控制不确定性和建立可引导性和安全性的护栏。提出了两种策略:一是保持人为参与循环,允许用户迭代和改进;二是通过反馈控制提供修正错误的能力,并构建数据飞轮。此外,还强调了透明用户体验的重要性,包括向用户沟通AI的能力和限制,并设计用户界面来引导用户与AI的交互,以获得最有帮助和安全的响应。

10:03

🔗 模型一致性与知识库的结合

在本段落中,讨论了模型一致性问题,并介绍了如何通过知识库和工具来解决这一问题。提出了两种策略:一是在模型层面约束模型行为,二是通过知识库或自有工具来增强模型。介绍了OpenAI新推出的两个模型级特性:JSON模式和可复现输出。JSON模式可以强制模型输出JSON语法内的内容,而可复现输出则通过引入新的C参数来提高输出的一致性。这些特性有助于减少系统异常和提高用户体验。

15:03

🔍 利用检索服务和微服务提高模型准确性

本段落进一步探讨了如何通过检索服务和微服务来提高模型的准确性和减少其不确定性。介绍了使用检索服务来找到与用户查询相关的信息片段,并将这些片段与原始查询一起传递给API,以生成更准确的响应。此外,还讨论了使用OpenAI的功能调用API来访问自定义微服务,以获取最新的数据,例如当前的抵押贷款利率。这些方法有助于提高模型在特定用例中的表现。

20:04

📊 通过评估提高模型性能

本段落强调了评估模型性能的重要性,以确保用户体验的一致性和防止性能退化。介绍了创建针对特定用例的评估套件的策略,以及如何使用评估作为模型性能的量化实验。讨论了评估套件的构建,包括人工注释、自动化评估和模型评分评估。还介绍了如何使用GPT-4作为评估器,并提出了使用3.5 Turbo模型进行微调以适应特定用例的策略。

25:04

🌐 通过语义缓存和模型路由优化成本和延迟

本段落讨论了如何通过语义缓存和模型路由来管理规模,特别是成本和延迟。介绍了语义缓存的概念,即在应用和API之间添加逻辑层,以减少对API的调用次数。还讨论了将流量路由到成本更低的模型,如3.5 Turbo,以及如何通过微调3.5 Turbo来降低成本,同时保持服务质量。

30:08

🛤️ 大型语言模型运营(LLM Ops)的兴起

在本段落中,提出了大型语言模型运营(LLM Ops)这一新兴领域,它类似于早期的DevOps,旨在解决使用LLM构建应用程序所带来的独特挑战。LLM Ops涵盖了从监控、性能优化、安全合规到数据和嵌入管理等各个方面。讨论了LLM Ops如何帮助组织扩展到数千个应用程序和数百万用户,并强调了建立长期平台和专业知识的重要性。

Mindmap

Keywords

💡人工智能

人工智能是指由人制造出来的机器或软件系统所表现出来的智能,它能够执行通常需要人类智能才能完成的任务,如语言理解、学习、推理等。在视频中,人工智能是构建应用程序和用户体验的核心,特别是在讨论如何将原型转化为生产环境中的应用时。

💡原型

原型通常指的是产品开发的初期阶段,用于展示概念或功能的基本模型。在视频中,原型是将应用程序从概念阶段带入生产环境的起点,但存在从原型到生产环境的转换难题。

💡生产环境

生产环境指的是软件或系统正式运行并提供服务的实际操作环境。视频中提到,将原型转化为生产环境中的应用是一个挑战,需要解决模型的不确定性和规模问题。

💡非确定性模型

非确定性模型指的是在给定相同输入的情况下,其输出结果可能会有所不同的模型。视频中强调了这类模型在扩展到生产环境时带来的挑战,尤其是在没有明确指导框架的情况下。

💡知识库

知识库是一种信息存储系统,用于存储、管理和检索结构化或半结构化的数据。在视频中,通过将模型与知识库结合,可以减少模型输出的不一致性,提高回答的准确性。

💡API

API(应用程序编程接口)是一套规则和定义,允许不同的软件应用程序之间进行交互。视频中提到,超过200万的开发者使用OpenAI提供的API来构建基于其模型的产品。

💡用户体验

用户体验指的是用户与产品交互时的感受和反应,包括易用性、满意度等。视频中强调了构建以用户为中心的体验的重要性,并讨论了如何通过控制不确定性和建立防护措施来优化用户体验。

💡模型一致性

模型一致性指的是在不同时间或不同输入下,模型提供相同或相似的输出结果的能力。视频中讨论了如何通过模型级功能和知识库来提高模型的一致性。

💡评估

评估是衡量产品或系统性能的过程,确保其满足既定标准和用户期望。视频中提到使用评估来监控模型性能,防止在扩展过程中出现性能退化。

💡语义缓存

语义缓存是一种存储技术,通过保存先前查询的结果来减少对外部资源的重复请求。在视频中,语义缓存被用作减少API调用次数、降低延迟和成本的策略之一。

💡模型微调

模型微调是指根据特定任务或数据集调整预训练模型的过程,以提高模型在特定应用上的表现。视频中讨论了如何使用微调来创建针对特定用例优化的模型版本,同时降低成本。

💡LLM Ops

LLM Ops(Large Language Model Operations)是指管理和操作大型语言模型的实践、工具和基础设施。视频中提到,LLM Ops作为应对构建LLM应用程序所面临挑战的新学科,已成为许多企业架构的核心组成部分。

Highlights

介绍新Stack和Ops for AI,从原型到生产过程的讨论。

Sherwin领导OpenAI Developer Platform工程团队,团队构建并维护API,被200万开发者使用。

Shyamal分享了与创业公司和企业合作,帮助他们在平台上构建优秀产品和体验的经验。

讨论了将应用程序从原型阶段带入生产的重要性和挑战。

ChatGPT自2022年11月推出以来,对世界产生了巨大影响,但时间尚未满一年。

GPT-4于2023年3月推出,人们对旗舰模型的体验尚未满八个月。

GPT已经从社交媒体上的玩具转变为日常工作和企业产品中的工具。

原型阶段简单易行,但生产阶段面临模型非确定性的挑战。

介绍了一个框架,帮助指导应用程序从原型到生产的过渡。

讨论了构建以大型语言模型为中心的令人愉悦的用户体验的策略。

通过知识库和工具来解决模型不一致性问题。

使用评估来自信地迭代应用程序。

讨论了如何通过编排来管理应用程序的规模,考虑成本和延迟。

介绍了JSON模式和可复现输出等新的模型级特性,以约束模型行为。

讨论了使用知识库或自定义工具来增强模型的策略。

介绍了如何使用评估套件来测试和跟踪模型性能。

讨论了如何使用语义缓存和路由到更便宜的模型来管理成本和延迟。

介绍了LLM Ops(大型语言模型操作)的概念,作为应对构建LLM应用挑战的新学科。

强调了LLM Ops在企业架构和堆栈中的核心组成部分,以及其在监视、优化性能和安全合规方面的重要性。

讨论了如何使用GPT-4来创建用于微调3.5 Turbo的训练数据集,以降低成本并提高性能。

强调了LLM Ops作为长期平台和专业知识开发的重要性,以及它如何帮助加速下一代助手和生态系统的构建。

Transcripts

play00:00

[music]

play00:13

-Hi, everyone.

play00:14

Welcome to the new Stack and Ops for AI,

play00:16

going from prototype to production.

play00:18

My name is Sherwin, and I lead the Engineering team

play00:21

for the OpenAI Developer Platform,

play00:23

the team that builds and maintains the APIs that over 2 million developers,

play00:27

including hopefully many of you, have used to build products on top of our models.

play00:30

-I'm Shyamal,

play00:31

I'm part of the Applied team where I've worked

play00:33

with hundreds of startups and enterprises

play00:35

to help them build great products and experiences on our platform.

play00:38

-Today, we're really excited to talk

play00:41

to you all about the process of taking your applications

play00:43

and bringing them from the prototype stage into production.

play00:47

First, I wanted to put things into perspective for a little bit.

play00:50

While it might seem

play00:51

like it's been a very long time since ChatGPT

play00:53

has entered our lives and transformed the world,

play00:55

it actually hasn't even been a full calendar year since it was launched.

play00:59

ChatGPT was actually launched in late November 2022,

play01:02

and it hasn't even been a full 12 months yet.

play01:04

Similarly, GPT-4 was only launched in March 2023,

play01:08

and it hasn't even been eight months

play01:10

since people have experienced our flagship model

play01:13

and tried to use it into their products.

play01:15

In this time, GPT has gone from being a toy for us

play01:18

to play around with and share on social media

play01:20

into a tool for us to use in our day-to-day lives

play01:23

and our workplaces into now a capability that enterprises,

play01:26

startups, and developers everywhere are trying

play01:28

to bake into their own products.

play01:31

Oftentimes, the first step is to build a prototype.

play01:33

As many of you probably know,

play01:35

it's quite simple and easy

play01:37

to set up a really cool prototype using one of our models.

play01:39

It's really cool to come up with a demo and show it to all of our friends.

play01:42

However, oftentimes there's a really big gap in going from there into production,

play01:47

and oftentimes it's hard to get things into production.

play01:51

A large part of this is due to the nondeterministic nature of these models.

play01:54

Scaling non-deterministic apps from prototype

play01:57

into production can oftentimes feel quite difficult

play01:59

without a guiding framework.

play02:01

Oftentimes, you might feel something like this

play02:03

where you have a lot of tools out there for you to use.

play02:06

The field is moving very quickly.

play02:07

There's a lot of different possibilities,

play02:09

but you don't really know where to go and what to start with.

play02:11

For this talk, we wanted to give you all a framework to use

play02:15

to help guide you moving your app from prototype into production.

play02:18

This framework we wanted to provide to you

play02:20

is in the form of a stack diagram that is influenced

play02:22

by a lot of the challenges that our customers have brought to us

play02:25

in scaling their apps.

play02:26

We'll be talking

play02:27

about how to build a delightful user experience on top of these LLMs.

play02:31

We'll be talking about handling model inconsistency

play02:34

via grounding the model with Knowledge Store and Tools.

play02:36

We'll be talking about how to iterate on your applications

play02:39

in confidence using Evaluations.

play02:42

Finally, we'll be talking about how to manage scale

play02:44

for your applications and thinking about cost

play02:46

and latency using orchestration.

play02:48

For each one of these,

play02:49

we'll be talking about a couple of strategies

play02:51

that hopefully you all can bring back

play02:52

and use in your own different products.

play02:56

Oftentimes first, we just have a simple prototype.

play02:58

At this point, there isn't a whole stack like what I just showed.

play03:01

There's usually just a very simple setup here

play03:03

where you have your application

play03:04

and it's talking directly with their API.

play03:06

While this works great initially,

play03:07

very quickly you'll realize that it's not enough.

play03:10

Shyamal: Let's talk about the first layer of this framework.

play03:15

Technology is as useful as the user experience surrounding it.

play03:19

While the goal is to build a trustworthy, defensive,

play03:23

and delightful user experience,

play03:25

AI-assisted copilots and assistants present a different set

play03:29

of human-computer interaction and UX challenges.

play03:33

The unique considerations of scaling applications built

play03:36

with our models makes it even more important

play03:39

to drive better and safe outcomes for users.

play03:42

We're going to talk about two strategies here

play03:44

to navigate some of the challenges that come

play03:46

with building apps on top of our models,

play03:48

which are inherently probabilistic in nature.

play03:51

Controlling for uncertainty and building guardrails

play03:54

for steerability and safety.

play03:55

Controlling for uncertainty refers to proactively

play03:59

optimizing the user experience

play04:01

by managing how the model interacts and responds to the users.

play04:06

Until now, a lot of products have been deterministic

play04:09

where interactions can happen in repeatable and precise ways.

play04:13

This has been challenging with the shift

play04:15

towards building language user interfaces.

play04:18

It has become important to design for human centricity

play04:22

by having the AI enhance and augment human capabilities

play04:26

rather than replacing human judgment.

play04:29

When designing ChatGPT,

play04:31

for example, we baked in a few UX elements

play04:34

to help guide the users and control for this inherent uncertainty

play04:38

that comes with building apps powered by models.

play04:41

The first one, depending on the use case, the first strategy here

play04:44

is to keep human in the loop and understand

play04:46

that the first artifact created with generative AI

play04:49

might not be the final artifact that the user wants.

play04:52

Giving the users an opportunity to iterate

play04:55

and improve the quality over time is important

play04:59

for navigating uncertainty and building a robust UX.

play05:02

The feedback controls, on the other hand,

play05:05

also provide affordances for fixing mistakes

play05:08

and are useful signals to build a solid data flywheel.

play05:13

Another important aspect of building transparent UX

play05:17

is to communicate the system's capabilities and limitations to the users.

play05:21

The user can understand what the AI can or cannot do.

play05:26

You can take this further by explaining

play05:28

to the user how the AI can make mistakes.

play05:30

In ChatGPT's case, this takes the form of an AI notice at the bottom.

play05:35

This sets the right expectations with the user.

play05:38

Finally, a well-designed user interface can guide user interaction

play05:43

with AI to get the most helpful and safer responses

play05:46

and the best out of the interaction.

play05:47

This can take the form of suggestive prompts in ChatGPT,

play05:52

which not only help onboard the users to this experience,

play05:56

but also provide the user an opportunity to ask better questions,

play06:00

suggest alternative ways of solving a problem, inspire,

play06:04

and probe deeper.

play06:05

All three of these strategies

play06:07

really put the users in the center and at the control

play06:10

of the experience by designing a UX

play06:13

that brings the best out of working with AI products

play06:16

and creating a collaborative and human-centric experience.

play06:21

To establish a foundation of trust and for you

play06:25

to build more confidence in deploying your GPT-powered applications,

play06:29

it's not only important to build a human-centric UX

play06:32

but also to build guardrails for both steerability and safety.

play06:38

You can think of guardrails as essentially constraints

play06:41

or preventative controls that sit between the user experience and the model.

play06:46

They aim to prevent harmful and unwanted content getting to your applications,

play06:51

to your users, and also adding steerability to the models in production.

play06:56

Some of the best interaction paradigms

play06:59

that we've seen developers build have built safety

play07:02

and security at the core of the experience.

play07:04

Some of our best models are the ones

play07:07

that are most aligned with human values.

play07:10

We believe some of the most useful and capable UX

play07:13

brings the best out of safety and steerability for better, safer outcomes.

play07:19

To demonstrate an example of this,

play07:21

let's start with a simple prompt in DALL·E.

play07:23

Very timely for Christmas,

play07:25

to create an abstract oil painting of a Christmas tree.

play07:28

DALL·E uses the model to enhance the prompt by adding more details

play07:32

and specificity around the hues, the shape of the tree,

play07:36

the colors and brush strokes, and so on.

play07:38

Now, I'm not an artist,

play07:40

so I wouldn't have done a better job at this,

play07:42

but in this case,

play07:43

I'm using DALL·E as a partner to bring my ideas to imagination.

play07:48

Now, you might be wondering, how is this a safety guardrail?

play07:51

Well, the same prompt enrichment used to create better artifacts

play07:56

also functions as a safety guardrail.

play07:58

If the model in this case detects a problematic prompt

play08:01

that violates the privacy or rights of individuals,

play08:04

it'll suggest a different prompt rather than refusing it outright.

play08:08

In this case, instead of generating an image of a real person,

play08:12

it captures the essence and then creates an image of a fictional person.

play08:18

We shared one example of a guardrail that can help

play08:22

with both steerability and safety, but guardrails can take many other forms.

play08:28

Some examples of this are compliance guardrails,

play08:31

security guardrails, and guardrails to ensure

play08:34

that the model outputs are syntactically and semantically correct.

play08:38

Guardrails become essentially important

play08:41

when you're building interfaces for highly regulated industries

play08:45

where there's low tolerance for errors and hallucination

play08:48

and where you have to prioritize security and compliance.

play08:53

We built a great user experience with both steerability and safety,

play08:58

but our journey doesn't end there.

play09:02

-At this point, you've built a delightful user experience

play09:05

for all of your users that can manage around some

play09:06

of the uncertainty of these models.

play09:08

While this works really great as a prototype,

play09:11

when the types of queries

play09:13

that you'll be getting from your users are pretty constrained,

play09:15

as you scale this into production,

play09:17

you'll very quickly start running into consistency issues

play09:19

because as you scale out your application,

play09:21

the types of queries and inputs that you'll get will start varying quite a lot.

play09:24

With this, we want to talk about model consistency,

play09:27

which introduces the second part

play09:28

of our stack involving grounding the model with the knowledge store and tools.

play09:33

Two strategies that we've seen in our customers adopt pretty well here

play09:36

to manage around the inherent inconsistency of these models include one,

play09:41

constraining the model behavior at the model level itself,

play09:44

and then two, grounding the model

play09:46

with some real-world knowledge using something like a knowledge store or your own tools.

play09:50

The first one of these is constraining the model behavior itself.

play09:54

This is an issue because oftentimes it's difficult

play09:58

to manage around the inherent probabilistic nature of LLMs

play10:00

and especially as a customer of our API

play10:02

where you don't have really low-level access to the model,

play10:05

it's really difficult to manage around some of this inconsistency.

play10:09

Today, we actually introduced two new model-level features

play10:13

that help you constrain model behavior,

play10:15

and wanted to talk to you about this today.

play10:18

The first one of these is JSON mode

play10:20

which if toggled on will constrain the output of the model

play10:23

to be within the JSON grammar.

play10:25

The second one is reproducible outputs

play10:27

using a new parameter named C that we're introducing into chat completions.

play10:32

The first one of these, JSON mode,

play10:34

has been a really commonly asked feature from a lot of people.

play10:37

It allows you to force the model to output within the JSON grammar.

play10:40

Often times this is really important to developers

play10:42

because you're taking the output from an LLM

play10:45

and feeding it into a downstream software system.

play10:47

A lot of times, in order to do that you'll need a common data format

play10:50

and JSON is one of the most popular of these.

play10:52

While this is great, one big downfall of inconsistency here

play10:56

is when the model outputs invalid JSON

play10:58

it will actually break your system and throw an exception

play11:00

which is not a great experience for your customers.

play11:03

JSON mode that we introduce today should significantly

play11:05

reduce the likelihood of this.

play11:07

The way it works is something like this where in chat completions,

play11:10

we've added a new argument known as JSON Schema.

play11:14

If you pass and type object into that parameter and you pass it into our API,

play11:20

the output that you'll be getting from our system

play11:22

or from the API will be constrained to within the JSON grammar.

play11:25

The content field there will be constrained to the JSON grammar.

play11:29

While this doesn't remove 100% of all JSON errors in our evals

play11:33

that we've seen internally,

play11:34

it does significantly reduce the error rate

play11:36

for JSON being output by this model.

play11:38

The second thing is getting significantly more reproducible outputs

play11:42

via a C parameter in chat completions.

play11:45

A lot of our models are non-deterministic

play11:49

but if you look under the hood,

play11:50

they're actually three main contributors

play11:52

to a lot of the inconsistent behavior happening behind the scenes.

play11:55

One of these is how the model samples its tokens based off

play11:59

of the probability that it gets.

play12:00

That's controlled by the temperature and the top P parameters

play12:03

that we already have.

play12:04

The second one is the C parameter

play12:06

which is the random number that the model uses

play12:08

to start its calculations and base it off of.

play12:11

The third one is this thing called system fingerprint

play12:13

which describes the state of our engines that are running

play12:16

in the backend and the code that we have deployed on those.

play12:18

As those change, there will be some inherent non-determinants when that happens.

play12:21

As of today, we only give people access to temperature and top P.

play12:26

Starting today, we'll actually be giving developers access

play12:30

to the C parameter as in input and giving developers visibility

play12:33

into system fingerprint in the responses of the chat completions model.

play12:37

In practice, it looks something like this

play12:39

where in chat completions there will now be a seed parameter

play12:43

that you can pass in which is an integer.

play12:45

If you're passing a seed like one, two, three, four, five,

play12:48

and you're controlling the temperature setting it to something like zero,

play12:51

your output will be significantly more consistent over time.

play12:55

If you send this particular request over to us five times,

play12:58

the output that you will be getting under choices

play13:01

will be significantly more consistent.

play13:03

Additionally, we're giving you access

play13:06

to the system fingerprint parameter

play13:08

which on every response from the model

play13:11

will tell you a fingerprint about our engine system under the hood.

play13:14

If you're getting the exact same system fingerprint back

play13:18

from earlier responses,

play13:18

and you passed in the same seed and temperature zero you're almost certainly

play13:22

going to get the same response.

play13:27

Cool, so those are model-level behaviors

play13:29

that you can actually very quickly pick up

play13:30

and just try with even today.

play13:32

A more involved technique is called grounding the model

play13:36

which helps reduce the inconsistency of the model behavior

play13:38

by giving it additional facts to base its answer off of.

play13:40

The root of this

play13:42

is that when it's on its own a model

play13:44

can often hallucinate information as you all are aware of.

play13:47

A lot of this is due to the fact that we're forcing the model

play13:50

to speak

play13:50

and if it doesn't really know anything it will have to try

play13:52

and say something and a lot of the times it will make something up.

play13:55

The idea behind this is to ground the model

play13:58

and give it a bunch of facts so that it doesn't have nothing to go off of.

play14:01

Concretely, what we'd be doing here

play14:04

is in the input context explicitly giving the model some grounded facts

play14:07

to reduce the likelihood of hallucinations from the model.

play14:10

This is actually quite a broad sentiment.

play14:14

The way this might look in a system diagram is like this

play14:17

where a query will come in from your user,

play14:19

hits our servers and instead of first passing it over to our API,

play14:22

we're first going to do a round trip to some type of grounded fact source.

play14:25

Let's say we pass the query in there.

play14:27

Then in our grounded fact source,

play14:29

it will ideally return some type of grounded fact for us

play14:32

and then we will then take the grounded fact

play14:34

and the query itself and pass it over to our API.

play14:36

Then ideally the API takes that information

play14:40

and synthesizes some type of response using the grounded fact here.

play14:43

To make this a little bit more concrete,

play14:45

one way that this might be implemented is using RAG

play14:47

or vector databases which is a very common and popular technique today.

play14:50

In this example, let's say I'm building a customer server spot

play14:53

and a user asks, how do I delete my account?

play14:55

This might be specific to my own application or my own product

play14:57

so the API by itself won't really know this.

play15:00

Let's say, I have a retrieval service like a vector database

play15:03

that I've used to index a bunch of my internal documents

play15:05

and a bunch of my FAQs about support

play15:07

and it knows about how to delete documents.

play15:09

What I would do here first is do a query to the retrieval service

play15:12

with how do I delete my account.

play15:13

Let's say it finds a relevant snippet for me here that says,

play15:16

in the account deletion FAQ, you go to settings,

play15:19

you scroll down and click here, whatever.

play15:21

We would then pass that along with the original query to our API

play15:25

and then the API would use that fact to ground some response back to the user.

play15:29

In this case, it would say, to delete your account,

play15:31

go to settings, scroll down, click here.

play15:32

This is one implementation,

play15:34

but actually, this can be quite broad

play15:35

and with OpenAI function calling in the API,

play15:38

you can actually use your own services

play15:39

and we've seen this used to great effect by our customers.

play15:42

In this case, instead of having a vector database,

play15:45

we might use our own API or own microservice here.

play15:48

In this case, let's say a customer is asking

play15:50

for what the current mortgage rates are which of course,

play15:52

even our LMS don't know immediately because this changes all the time.

play15:55

Let's say we have a microservice

play15:57

that's doing some daily sync job that's downloading

play16:01

and keeping track of the current mortgage rates.

play16:03

In this case, we would use function calling.

play16:05

We would tell our model that has it access to this function known

play16:08

as get_mortgage_rates(), which is within our microservice.

play16:11

We'd first send a request over to the API

play16:13

and it would express its intent to call this get_mortgage_rates() function.

play16:17

We would then fulfill that intent by calling our API with get_mortgage_rates().

play16:22

Let's say it returns something like 8% mortgage rates

play16:25

for a 30-year fixed mortgage and then the rest looks very similar

play16:28

where you're passing that into the API with the original query

play16:31

and the model is then responding with a ground response,

play16:33

saying something like, not great.

play16:35

Current 30-year fixed rates are actually at 8% already.

play16:38

At a very broad level, you're using this grounded fact source in a generic way,

play16:43

to help ground the model and help reduce model inconsistency.

play16:46

I just wrote two different examples of this,

play16:48

but the grounded fact source can also be other things

play16:51

like a search index even elastics earch or some type of more general search index.

play16:56

It can be something like a database.

play16:57

It could even be something like browsing the internet

play16:59

or trying some smart mechanism that grab additional facts.

play17:02

The main idea is to give something for the model to work.

play17:06

One thing I wanted to call out is that the OpenAI Assistants API

play17:09

that we just announced today,

play17:10

actually offers an out-of-the-box retrieval setup for you

play17:13

to use and build on top of with retrieval built right in

play17:17

in a first-class experience.

play17:18

I'd recommend checking it out.

play17:20

Shyamal: So far we talked

play17:22

about building a transparent human-centric user experience.

play17:26

Then we talked about how do you consistently deliver

play17:29

that user experience through some of the model-level features

play17:32

we released today and then by grounding the model.

play17:34

Now we're going to talk about how do we deliver

play17:38

that experience consistently without regressions.

play17:42

This is where evaluating the performance of the model

play17:45

becomes really important.

play17:46

We're going to talk about two strategies here

play17:49

that will help evaluate performance for applications built with our models.

play17:53

The first one is to create evaluation suites for your specific use cases.

play18:00

Working with many orgs, we hear time and time again

play18:04

that evaluating the model and the performance

play18:06

and testing progressions is hard, often slowing down development velocity.

play18:12

Part of that problem is for developers to not think

play18:16

about a systematic process for evaluating the performance

play18:19

of these models and also doing evaluations too late.

play18:23

Evaluations are really the key to success here.

play18:26

In measuring the performance of the models

play18:29

on real product scenarios is really essential

play18:32

to prevent regressions and for you to build confidence

play18:35

as you deploy these models at scale.

play18:39

You can think of evals as essentially unit tests

play18:42

for the large language models.

play18:44

People often think of prompting as a philosophy,

play18:48

but it is more of a science.

play18:51

When you pair it with evaluations,

play18:52

you can treat it like a software product or delivery.

play18:56

Evals can really transform ambiguous dialogues into quantifiable experiments.

play19:02

They also make model governance, model upgrades,

play19:06

much easier setting expectations around what's good or bad.

play19:10

Capabilities, evaluations, and performance really go hand-in-hand

play19:14

and they should be the place where you begin your AI engineering journey.

play19:18

In order to build evals, let's say we start simple

play19:23

and have human annotators evaluate the outputs

play19:25

of an application as you're testing.

play19:28

A typical approach in this case

play19:30

is where you have an application with different sets of prompts

play19:33

or retrieval approaches and so on and you'd want to start

play19:36

by building a golden test data set of evals by looking

play19:40

at these responses and then manually grading them.

play19:43

As you annotate this over time,

play19:45

you end up with a test suite that you can then run in online

play19:49

or offline fashion or part of your CICD pipelines.

play19:52

Due to the nature of large language models,

play19:55

they can make mistakes, so do humans.

play19:57

Depending on your use case,

play19:59

you might want to consider building evals to test for things

play20:03

like bad output formatting or hallucinations,

play20:06

agents going off the rails, bad tone, and so on.

play20:12

Let's talk about how to build an Eval.

play20:14

Earlier this year, we open-sourced the evals framework,

play20:17

which has been an inspiration for many developers.

play20:20

This library contains a registry of really challenging evals

play20:25

for different specific use cases and verticals,

play20:27

and a lot of templates, which can come in handy

play20:29

and can be a solid starting point for a lot of you

play20:32

to understand the kind of evaluations

play20:34

and tests you should be building for your specific use cases.

play20:39

After you've built an eval suite, a good practice

play20:42

and hygiene here is to log and track your eval runs.

play20:46

In this case, for example, we have five different eval runs,

play20:49

each scored against our golden test dataset,

play20:53

along with the annotation feedback and audit of changes.

play20:57

The audit of changes could include things like changes to your prompt,

play21:01

to your retrieval strategy, few short examples,

play21:03

or even upgrade to model snapshots.

play21:07

You don't need complicated tooling to start with tracking something like this.

play21:11

A lot of our customers start with just a spreadsheet,

play21:13

but the point is each run should be stored

play21:16

at a very granular level so you can track it accordingly.

play21:21

Although human feedback and user evals are the highest signal in quality,

play21:27

it's often expensive or not always practical,

play21:30

for example, when you cannot use real customer data for evals.

play21:34

This is where automated evals can help developers monitor progress

play21:39

and test for regressions quickly.

play21:43

Let's talk about model-graded evals or essentially using AI to grade AI.

play21:50

GPT-4 can be a strong evaluator.

play21:53

In fact, in a lot of natural language generation tasks,

play21:56

we've seen GPT-4 evaluations to be well correlated with human judgment

play22:01

with some additional prompting methods.

play22:04

The benefit of model-graded evals here

play22:06

is that by reducing human involvement in parts of the evaluation process

play22:10

that can be handled by language models,

play22:12

humans can be more focused on addressing some of the complex edge cases

play22:17

that are needed for refining the evaluation methods.

play22:22

Let's look at an example of what this could look like in practice.

play22:26

In this case, we have an input query and two pairs of completions.

play22:31

One that is the ground truth and one that is sampled from the model.

play22:36

The evaluation here is a very simple prompt that asks GPT-4

play22:40

to compare the factual content of the submitted answer with the expert answer.

play22:45

This is passed to GPT-4 to grade, and in this case,

play22:48

GPT-4's observation is there's a disparity

play22:51

between the submitted answer and the expert answer.

play22:55

We can take this further by improving our evaluation prompt

play22:58

with some additional prompt engineering techniques like chain of thought and so on.

play23:04

In the previous example, the eval was pretty binary.

play23:06

Either the answer matched the ground truth or it did not.

play23:10

In a lot of cases, you'd want to think about eval metrics,

play23:14

which are closely correlated with what your users would expect

play23:18

or the outcomes that you're trying to derive.

play23:21

For example, going back to Sherwin's example of a customer service assistant,

play23:26

we'd want to eval for custom metrics like the relevancy of the response,

play23:30

the credibility of the response, and so on,

play23:33

and have the model essentially score against those different metrics

play23:36

or the criteria that we decide.

play23:39

Here's an example of what that criteria or scorecard would look like.

play23:44

Here we have provided GPT-4 essentially this criteria for relevance, credibility,

play23:50

and correctness, and then use GPT-4 to score the candidate outputs.

play23:56

A good tip here is a show rather than tell,

play23:59

which basically including examples

play24:01

of what a score of one or a FI could look like,

play24:04

would really help in this evaluation process

play24:06

so that model would really appreciate the spread of the criteria.

play24:10

In this case, GPT-4 has effectively learned an internal model of language quality,

play24:15

which helps it to differentiate between relevant text and low-quality text.

play24:20

Harnessing this internal scoring mechanism allows us

play24:23

to do auto valuation of new candidate outputs.

play24:27

When GPT-4 is expensive or slow for evals,

play24:30

even after today's price drops, you can fine-tune a 3.5 turbo model,

play24:34

which essentially distills GPT-4's outputs

play24:38

to become really good at evaluating your use cases.

play24:41

In practice, what this means is you can use GPT-4

play24:44

to curate high-quality data for evaluations,

play24:47

then fine-tune a 3.5 judge model

play24:50

that gets really good at evaluating those outputs,

play24:52

and then use that fine-tuned model to valuate the performance of your application.

play24:57

This also helps reduce some of the biases that come

play25:00

with just using GPT-4 for evaluations.

play25:04

The key here is to adopt evaluation-driven development.

play25:08

Good evaluations are the ones

play25:11

which are well correlated to the outcomes that you're trying

play25:14

to derive or the user metrics that you care about.

play25:17

They have really high end-to-end coverage in the case of RAG

play25:20

and they're scalable to compute.

play25:23

This is where automated evaluations really help.

play25:28

-At this point, you've built a delightful user experience,

play25:31

you're able to deliver it consistently to your users

play25:34

and you're also able to iterate on the product in confidence using evaluations.

play25:38

If you do all this right,

play25:39

oftentimes you'll find yourselves with a product that's blowing up

play25:41

and really, really popular.

play25:43

If the last year has shown us anything,

play25:45

it's that the consumer appetite and even the internal employee appetite

play25:48

for AI is quite insatiable.

play25:50

Oftentimes, you'll now start thinking about how to manage scale.

play25:54

Oftentimes, managing scale means managing around latency

play25:57

and managing around cost.

play25:59

With this, we introduce the final part of our stack,

play26:02

known as orchestration, where you can manage around scale

play26:04

by adding a couple of additional mechanisms and forks into your application.

play26:10

Two strategies that we've seen in managing costs

play26:12

and latency involve using semantic caching

play26:16

to reduce the number of round trips that you're taking

play26:18

to our API as well as routing to the cheaper models.

play26:23

The first one of these is known as semantic caching.

play26:28

What semantic caching looks like in practice from a systems perspective,

play26:31

it you're going to be adding a new layer in your logic

play26:35

to sit between us and your application.

play26:38

In this case, if a query comes in asking when was GPT-4 released,

play26:42

you would first go to your semantic cache and do a lookup there

play26:46

and see if you have anything in your cache.

play26:49

In this case, we don't and then you would just pass this request over to our API.

play26:53

Then the API would respond to something like March 14th, 2023,

play26:57

and then you'd save this within your semantic cache,

play26:59

which might be a vector database or some other type of store.

play27:02

The main point here is you're saving the March 14th, 2023 response

play27:07

and keying it with that query of when was GPT-4 released

play27:10

and then you pass this back over to your users.

play27:13

This is fine, but let's say, a month or a week from now,

play27:17

another request comes in where a user asks GPT-4 release date?

play27:21

Now, this isn't the exact same query that you had before,

play27:24

but it is very semantically similar

play27:25

and can be answered by the exact same response.

play27:27

In this case, you would do a semantic lookup in your cache,

play27:31

realize that you have this already

play27:32

and you'd just return back to the user with March 14th, 2023.

play27:35

With this setup, you've actually saved latency

play27:38

because you're no longer doing a round trip to our API

play27:40

and you've saved costs because you're no longer hitting

play27:42

and paying for additional tokens.

play27:44

While this works great,

play27:46

oftentimes, it might be a little bit difficult to manage

play27:49

and there are often even more capable ways of managing cost and latency.

play27:55

This is where we start thinking about routing to cheaper models

play27:58

and where orchestration really comes into play.

play28:00

When I talk about routing to cheaper models,

play28:03

oftentimes the first thing to think about is to go from GPT-4 into 3.5 Turbo,

play28:06

which sounds great because GPT-3.5 Turbo is so cheap, so fast,

play28:11

however, it's obviously not nearly as smart as GPT-4.

play28:14

If you were to just drag and drop 3.5 Turbo into your application,

play28:17

you'll very quickly realize that you're not delivering

play28:20

as great of a customer experience.

play28:22

However, the GPT-3.5 Turbo Finetuning API

play28:26

that we released only two months ago has already become a huge hit

play28:29

with our customers and it's been a really great way for customers

play28:32

to reduce costs by fine-tuning a custom version of GPT-3.5 Turbo

play28:36

for their own particular use case,

play28:38

and get all the benefits of the lower latency and the lower cost.

play28:41

There's obviously a full talk about fine-tuning earlier,

play28:44

but just in a nutshell, the main idea here is to take your own curated dataset.

play28:49

This might be something like hundreds or even thousands of examples at times,

play28:53

describing the model on how to act in your particular use case.

play28:57

You'd pass in that curated dataset

play28:59

into our fine-tuning API maybe tweak a parameter or two here

play29:02

and then the main output here is a custom fine-tuned version of 3.5 Turbo specific

play29:07

to you and your organization based off of your dataset.

play29:10

While this is great, oftentimes actually,

play29:13

there's a huge activation energy associated with doing this

play29:16

and it's because it can be quite expensive to generate this curated data set.

play29:20

Like I mentioned, you might need hundreds, thousands,

play29:23

sometimes even tens of thousands of examples for your use case,

play29:25

and oftentimes you'll be manually creating these yourself

play29:28

or hiring some contractors to do this manually as well.

play29:31

However, one really cool method

play29:34

that we've seen a lot of customers adopt

play29:35

is you can actually use GPT-4 to create the training dataset

play29:39

to fine-tune 3.5 Turbo.

play29:41

It's starting to look very similar

play29:42

to what Shyamal just mentioned around evals as well,

play29:45

but GPT-4 is at an intelligence level

play29:47

where you can actually just give it a bunch of prompts,

play29:50

it'll output a bunch of outputs for you here,

play29:52

and that output can just be your training set.

play29:54

You don't need any human manual intervention here.

play29:57

What you're effectively doing here

play29:58

is you're distilling the outputs from GPT-4

play30:01

and feeding that into 3.5 Turbo so it can learn.

play30:04

Oftentimes, what this does is that in your specific narrow domain,

play30:08

it helps this fine-tuned version of 3.5 Turbo be almost as good as GPT-4.

play30:14

If you do take the effort in doing all of this,

play30:17

the dividends that you get down the line are actually quite significant,

play30:20

not only from a latency perspective, because GPT-3.5 Turbo is obviously a lot faster,

play30:24

but also from a cost perspective.

play30:26

Just to illustrate this a little bit more concretely, if you look at the table,

play30:30

even after today's GPT-4 price drops,

play30:33

a fine-tuned version of 3.5 Turbo is still 70% to 80% cheaper.

play30:38

While it's not as cheap as the vanilla 3.5 Turbo,

play30:41

you can see it's still quite a bit off from GPT-4,

play30:45

and if you switch over to fine-tuned 3.5 Turbo,

play30:49

you'll be saving on a lot of cost.

play30:54

-All right, so we talked about a framework

play30:56

that can help you navigate the unique considerations

play30:59

and challenges that come with scaling applications built with our models,

play31:03

going from prototype to production.

play31:05

Let's recap.

play31:06

We talked about how to build a useful, delightful,

play31:10

and human-centric user experience by controlling for uncertainty

play31:14

and adding guardrails.

play31:15

Then we talked about how do we deliver

play31:18

that experience consistently through grounding the model

play31:20

and through some of the model-level features.

play31:23

Then we talked about consistently delivering that experience

play31:26

without regressions by implementing evaluations.

play31:29

Then finally, we talked about considerations that come with scale,

play31:33

which is managing latency and costs.

play31:37

As we've seen,

play31:38

building with our models increases surface area for what's possible,

play31:42

but it has also increased the footprint of challenges.

play31:46

All of these strategies we talked about,

play31:48

including the orchestration part of the stack,

play31:50

have been converging into this new discipline called LLM Ops

play31:54

or Large Language Model Operations.

play31:56

Just as DevOps emerged in the early 2000s

play31:59

to streamline the software development process,

play32:02

LLM Ops has recently emerged in response to the unique challenges

play32:06

that are posed by building applications with LLMs

play32:09

and they've become a core component of many enterprise architecture and stacks.

play32:14

You can think of LLM Ops as basically the practice, tooling,

play32:18

and infrastructure that is required

play32:20

for the operational management of LLMs end-to-end.

play32:24

It's a vast and evolving field, and we're still scratching the surface.

play32:29

While we won't go into details,

play32:31

here's a preview of what this could look like.

play32:33

LLM Ops capabilities help address challenges like monitoring,

play32:37

optimizing performance,

play32:39

helping with security compliance, managing your data and embeddings,

play32:43

increasing development velocity,

play32:45

and really accelerating the process of reliable testing

play32:48

and evaluation at scale.

play32:50

Here, observability and tracing become especially important

play32:54

to identify and debug failures with your prompt chains and assistants

play32:58

and handle issues in production faster,

play33:00

making just collaboration between different teams easier.

play33:03

Gateways, for example, are important to simplify integrations,

play33:07

can help with centralized management of security, API keys, and so on.

play33:13

LLM Ops really enable scaling to thousands of applications

play33:18

and millions of users, and with the right foundations here,

play33:22

organizations can really accelerate their adoption.

play33:24

Rather than one-off tools,

play33:26

the focus should be really developing

play33:29

these long-term platforms and expertise.

play33:31

Just like this young explorer standing at the threshold,

play33:35

we have a set of wide field of opportunities in front of us

play33:39

to build the infrastructure and primitives

play33:42

that stretch beyond the framework we talked about today.

play33:46

We're really excited to help you build the next-generation assistants

play33:50

and ecosystem for generations to come.

play33:53

There's so much to build and discover,

play33:56

and we can only do it together.

play33:58

Thank you.

play33:58

[applause]

play34:01

[music]

Rate This

5.0 / 5 (0 votes)

Related Tags
人工智能原型开发生产部署用户体验模型一致性知识库评估测试成本管理延迟优化LLM运维技术框架
Do you need a summary in English?