LlamaIndex Webinar: Build an Open-Source Coding Assistant with OpenDevin

LlamaIndex
22 May 202453:30

Summary

TLDR今回お届けするのは、オープンソースプロジェクト「オープンデオン」についてのウェビナーシリーズです。デオンは自律型AIソフトウェアエンジニアのオープンソース版を目指しているプロジェクトで、コミュニティが共同で開発しています。重要な貢献者であるRobertが、デモンストレーションと議論を行い、開発者が自律ソフトウェアエンジニアを安全に実行し、監視するためのツールを提供する取り組みを紹介します。また、彼らはソフトウェア開発をより創造的で魅力的なタスクに変えることを目指しており、MITライセンスで公開されています。現在2ヶ月で116人の貢献者と24,000以上のスターを獲得しており、短期間で大きな進歩を遂げています。

Takeaways

  • 😀 オープンデオンは完全にオープンソースの自律型AIソフトウェアエンジニアを構築するプロジェクトです。
  • 😀 自律型ソフトウェアエージェントがコードを書いたり、ソフトウェアプロジェクト全体を構築することに焦点を当てています。
  • 😀 コードベース内のファイルの編集、テストの実行、デバッグ、修正などのタスクを自律的に実行できます。
  • 😀 現在、116人のユニークな貢献者が参加しており、700以上のプルリクエストがマージされています。
  • 😀 ChatGPTやGitHub Copilotなどのツールと比較して、エージェントは既存のコードベースと連携し、開発者のワークフローに直接組み込むことができます。
  • 😀 エージェントは大規模なコードベース内で適切なファイルを見つけて編集することができます。
  • 😀 自動化されたソフトウェア開発エージェントが、コードの書き換えやテストの実行、コードの修正などを行うことができます。
  • 😀 プロジェクトの透明性と安全性を確保するために、Sandbox内でエージェントが実行されます。
  • 😀 エージェントの能力を評価するために、SweBenchライトベンチマークを使用してパフォーマンスを測定しています。
  • 😀 オープンデオンのインターフェースは、チャットウィンドウ、ターミナル、コードエディタ、ウェブブラウザを備えており、開発者がエージェントとやり取りするための統合環境を提供します。

Q & A

  • オープンデヴォンとは何ですか?

    -オープンデヴォンは、自律的なAIソフトウェアエンジニアのオープンソースプロジェクトです。MITライセンスで提供されており、コミュニティが完全に理解し、開発に参加できるようになっています。

  • オープンデヴォンが目指すものは何ですか?

    -オープンデヴォンは、ソフトウェア開発エージェントを構築し、運用するためのオープンソースプラットフォームを提供することを目指しています。具体的には、AIと連携してソフトウェアを構築するツールを提供し、開発者が自律的にコードを生成し、デバッグする手助けをします。

  • オープンデヴォンの主な特徴は何ですか?

    -オープンデヴォンの主な特徴は、エージェントが既存のコードベースと上手く連携し、開発者のワークフローに直接組み込まれ、デバッグと修正のループを実行できることです。また、Webブラウザやコマンドライン、ファイルシステムへのアクセスを持ち、幅広いタスクを自律的に実行できます。

  • オープンデヴォンのエージェントはどのように動作しますか?

    -オープンデヴォンのエージェントは、LLM(大規模言語モデル)を中心に、ユーザーからのタスクを受け取り、外部データやコードベース、ランタイム環境を利用してタスクを実行します。エージェントは繰り返しのループを通じて、プロンプトを生成し、アクションを実行し、観察結果を元に次のステップを決定します。

  • エージェントの性能はどのように評価されますか?

    -エージェントの性能は、SWEベンチという評価基準に基づいて評価されます。これは、GitHubの実際の問題を解決する能力をテストするもので、問題を修正するためのPRとその結果を評価します。

  • 現在のオープンデヴォンのエージェントの性能はどの程度ですか?

    -現在のオープンデヴォンのエージェントは、SWEベンチライトの評価で約21%の問題を解決できる性能を持っています。これは、既存の最高性能のエージェントを上回る結果です。

  • オープンデヴォンのエージェントはどのように開発者のワークフローに統合されますか?

    -オープンデヴォンのエージェントは、WebインターフェースやVSコードプラグイン、コマンドラインなど、さまざまな開発環境に統合できます。また、GitHubのPRや課題のコメントにエージェントを直接組み込むことも可能です。

  • オープンデヴォンプロジェクトに参加する方法は?

    -オープンデヴォンプロジェクトに参加するには、GitHubリポジトリでプルリクエストを作成するか、イシューを作成して提案やフィードバックを提供することができます。また、既存のオープンイシューに取り組むことも歓迎されています。

  • 今後のオープンデヴォンの開発計画は?

    -今後の開発計画には、エージェントの性能向上、新しいエージェントの追加、より汎用的なアーキテクチャの構築などが含まれています。また、さまざまな開発環境でエージェントを使用できるようにするためのインターフェースの改良も進められています。

  • オープンデヴォンを使用するための最小限のコンピュータリソースは?

    -オープンデヴォン自体はそれほど多くのコンピュータリソースを必要としませんが、主要な計算は外部の大規模言語モデル(LLM)に依存しています。GPT-4やClaudeなどの強力なLLMを使用することで、エージェントの性能が大幅に向上します。

Outlines

00:00

🤖オープンソースプロジェクトopen Devonの紹介

この段落では、ウェビナーで紹介されたオープンソースプロジェクトopen Devonについて説明しています。open Devonは、自律型AIソフトウェアエンジニアのオープンソースバージョンを構築することを目指しているプロジェクトです。コミュニティが共同で開発しており、MITライセンスで公開されています。プロジェクトはソフトウェア開発をより創造的で魅力的なタスクに変えることを目指しており、ボランティアによって進められています。

05:00

🔧 open Devonのビジョンとその機能

open Devonのビジョンは、ソフトウェア開発エージェントの構築と実行のためのプラットフォームを提供することです。AIとLLMsを活用してソフトウェア開発に関わるツールを提供し、研究者や開発者、そして開発エージェントを利用するソフトウェア開発者向けに使いやすいツールを目指しています。また、プロジェクトは非常に早く進んでおり、GitHub上で多くのスターとコントリビューターを獲得しています。

10:01

🛠️ エージェントの役割と機能

この段落では、エージェントが持つべき機能とその重要性について説明しています。エージェントは既存のコードベースと調和し、開発者のワークフローに組み込まれることで生産性を高めるとされています。また、デバッグやテスト実行などの反復プロセスにも活用でき、大規模なタスクを小さなステップに分割して遂行することができると強調されています。

15:03

🔬 エージェントの動作とフレームワーク

エージェントの動作について説明しており、ユーザーからのタスクを受け取り、外部データやLLM、ランタイムを活用してタスクを遂行します。また、マイクロエージェントフレームワークについても紹介されており、これは複雑なロジックを簡素化し、マークダウンベースでエージェントを作成できるようにするものです。

20:05

📊 エージェントの評価とベンチマーク

ソフトウェア開発に関するエージェントの品質を評価する基準として、sbenchというベンチマークが紹介されています。これはGitHub上で発生した現実の問題を解決する能力をテストするもので、open Devonはこのベンチマークで高い点を獲得しています。

25:06

🌐 open Devonのインターフェースとデモ

open Devonのユーザーインターフェースについて説明されており、チャットウィンドウ、ターミナル、コードエディタ、ウェブブラウザなどからなる開発環境が用意されています。さらに、デモとして、open Devonがコードベース内での問題を特定し修正する様子が紹介されています。

30:06

👥 コミュニティと貢献の呼びかけ

最後に、コミュニティへの参加と貢献を促しています。open Devonは分散型で運営されており、誰もがプロジェクトに貢献できるようになっています。ウェビナーでは、参加者がプロジェクトに積極的に参加し、新しいアイデアや機能を提供するよう呼びかけています。

Mindmap

Keywords

💡オープンソース

オープンソースとは、ソフトウェアのソースコードを公開し、誰でも利用・修正・配布できるようにするライセンスのことです。このビデオでは、オープンデヴォンがMITライセンスのオープンソースプロジェクトであることが強調されています。オープンソースコミュニティがどのようにプロジェクトを支えているか、またその透明性についても述べられています。

💡オートノマスAI

オートノマスAIとは、自律的に動作し、タスクを遂行する人工知能のことを指します。ビデオでは、オープンデヴォンが自律的なソフトウェアエンジニアとして機能することを目指していると説明されています。このAIは、コードの生成やテスト、修正などを自動的に行うことができるとされています。

💡エージェント

エージェントとは、特定のタスクを遂行するために設計されたソフトウェアプログラムのことです。ビデオでは、オープンデヴォンのエージェントがコードを生成し、テストし、修正する能力について述べられています。エージェントは、ユーザーの指示に基づいて行動し、ソフトウェア開発プロセスを支援します。

💡コンテキストウィンドウ管理

コンテキストウィンドウ管理とは、AIがタスクを実行するために必要な情報を適切に保持し、利用するための方法です。ビデオでは、オープンデヴォンのエージェントがコードの編集やテスト結果などの情報を管理し、次のアクションを決定するために利用すると説明されています。

💡スイベンチ

スイベンチ(SWE-Bench)は、AIエージェントの性能を測定するためのベンチマークテストです。ビデオでは、オープンデヴォンがこのベンチマークで21%のスコアを達成し、他のエージェントよりも高性能であると述べられています。

💡マイクロエージェント

マイクロエージェントとは、小規模で特定のタスクに特化したエージェントのことです。ビデオでは、オープンデヴォンがマイクロエージェントを使用して、特定のタスクを効率的に処理する方法について説明されています。例えば、コミットメッセージの生成やGitHubリポジトリの調査などに利用されています。

💡GPT-4

GPT-4は、オープンAIが開発した大規模言語モデルの一つです。ビデオでは、オープンデヴォンのエージェントがGPT-4を利用してコード生成やエラー修正を行うと述べられています。高性能な言語モデルを使用することで、エージェントの精度と効率が向上します。

💡コミュニティ駆動

コミュニティ駆動とは、プロジェクトが多くのボランティアやユーザーの貢献によって進行することを指します。ビデオでは、オープンデヴォンが完全にコミュニティ駆動であり、多くの貢献者が自発的に参加していることが強調されています。

💡開発者ワークフロー

開発者ワークフローとは、ソフトウェア開発における一連の作業手順やプロセスのことです。ビデオでは、オープンデヴォンが開発者のワークフローに直接統合され、開発プロセスを効率化する方法について述べられています。特に、コード編集やテストの自動化によって生産性が向上するとされています。

💡ベクトルデータベース

ベクトルデータベースは、データをベクトル形式で保存し、効率的な検索と類似性の評価を可能にするデータベースです。ビデオでは、オープンデヴォンがコードベースをベクトルデータベースにインデックス化して、関連するコードのスニペットを動的に取得する方法について述べられています。

Highlights

オープンデヴォンは完全にオープンソースの自律型AIソフトウェアエンジニアのプロジェクトであり、コミュニティが完全に理解し、構築できることが特徴です。

このプロジェクトは、ソフトウェア開発エージェントの構築と実行をサポートするオープンプラットフォームを提供することを目指しています。

オープンデヴォンはMITライセンスのもとで運営されており、全ての作業はコミュニティによって行われています。

このプロジェクトには116人のユニークな貢献者が参加し、700以上のPRがマージされ、GitHubで24,000以上のスターを獲得しています。

エージェントは既存のコードベースと上手く連携し、開発者のワークフローに直接組み込まれるため、効率的なデバッグやテストが可能です。

エージェントは長期的な記憶を持たず、現在のコンテキストウィンドウと事前学習された知識を基に動作します。

マイクロエージェントは特定のタスクを小さな単位で処理するために設計されており、テンプレート化されたマークダウンで定義されます。

SWEベンチライトベンチマークでは、最先端のエージェントが15〜20%の成功率を示し、オープンデヴォンの最新の評価では21%に達しました。

オープンデヴォンのユーザーインターフェースは、チャットウィンドウ、ターミナル、コードエディター、ウェブブラウザー、Jupyterノートブックを含み、様々な開発環境に対応しています。

オープンデヴォンは、エージェントが安全に動作し、何をしているのか透明性を持たせるためのツールを提供しています。

エージェントの性能向上には、ブラウザーエージェントの統合や、コードベース全体をベクトルデータベースにインデックス化することが含まれます。

オープンデヴォンは、様々なインターフェースでエージェントが動作できるように設計されており、コマンドラインやGitHubプラグインとの統合も予定されています。

エージェントの構築に関しては、全ての基盤インフラを提供し、エージェントの評価を容易にするパイプラインを提供しています。

オープンデヴォンは、エージェントの開発とユーザーのワークフローにおけるエージェントの統合を推進しており、将来的には開発者の生産性を大幅に向上させることを目指しています。

オープンデヴォンは、エージェントの強化とプラットフォームの拡張を通じて、より多くのソフトウェア開発タスクを自動化し、開発者の負担を軽減することを目指しています。

Transcripts

play00:00

hey everyone uh welcome to another

play00:02

episode of The Llama index webinar

play00:04

series uh today we have a special guest

play00:06

with us and we're excited to feature

play00:08

open Devon um and so open Devon is an

play00:11

open source project uh that you know is

play00:14

trying to build uh an open source

play00:16

version of Devon the autonomous AI

play00:18

software engineer um so it's very

play00:20

exciting because you know this is

play00:22

actually completely open source and so

play00:23

therefore the community is actually able

play00:25

to understand what it takes to actually

play00:27

build a fully autonomous coding

play00:29

assistant obviously there's a lot of

play00:31

interest in agents these days especially

play00:33

around assistants that can autonomously

play00:35

you know code like do tasks for you and

play00:38

then actually build an entire software

play00:40

project um and so excited to feature

play00:42

Robert one of the core contributors to

play00:44

open open Deon um who will be walking us

play00:47

through it seems like a few slides as

play00:49

well as a demo um and we'll have a

play00:51

pretty fun discussion on you know agents

play00:53

how to build them as well as the future

play00:55

and what's coming next so without

play00:57

further Ado I'll pass it to Robert

play01:00

awesome thanks a bu Jerry I'm super

play01:02

excited to be here uh and you know thank

play01:04

you to the to the Llama index team as a

play01:05

whole for uh for having

play01:07

us um so yeah like Jerry said you know

play01:09

we are we're building open Devon it's a

play01:12

it's an MIT licensed open source project

play01:15

uh we were really inspired when we saw

play01:17

the first uh demo of De Devon um and uh

play01:21

you know a community formed very quickly

play01:24

uh in hopes of basically

play01:26

recreating uh you know what we saw in

play01:28

that demo because it was so exciting and

play01:29

so so promising uh for us as Engineers

play01:32

to to see just how powerful that tool

play01:35

was

play01:37

um so you know over the last two months

play01:40

or so as we've been building that that

play01:42

Vision has uh has deepened a bit um we

play01:45

we consider ourselves an open source

play01:46

platform for both building and running

play01:48

software development agents so basically

play01:50

anything that's that's interfacing with

play01:52

AI with llms to build software uh we

play01:55

want to be able to provide uh you know a

play01:58

set of uh tools that uh academics and

play02:02

agent Builders can leverage in order to

play02:05

uh see their agents kind of come to life

play02:06

actually work on software things like

play02:08

that and then we also want to provide

play02:10

agent users so you know your average

play02:12

software developer who wants to run an

play02:14

autonomous software engineer the tools

play02:16

that they need to run those agents

play02:18

safely uh to understand what they're

play02:19

doing to have some transparency into

play02:21

what the agent's doing how it's running

play02:23

you know where it might need to be

play02:24

redirected things like

play02:27

that uh so big big emphasis on open you

play02:30

know like I said this is an MIT licensed

play02:32

project uh we're totally Community

play02:34

Driven right now uh everybody who's

play02:37

working on the project is is a volunteer

play02:39

uh you know working working out in the

play02:41

open um you know we have a bunch of

play02:45

academic folks uh we have a bunch of

play02:48

average developers like myself folks who

play02:51

uh maybe aren't as uh as deeply uh uh

play02:55

into the academic side of agent building

play02:56

but are very interested in uh software

play02:59

development application development as

play03:01

well as you know just a bunch of end

play03:02

users who you know may or may not be

play03:04

software Engineers themselves but who

play03:05

are interested in using autonomous

play03:07

software agents to uh help build

play03:09

software uh and all of us are working

play03:11

together really to try and figure out uh

play03:14

you know what is the best uh end user

play03:17

experience that we can drive for here

play03:18

and what is the best way we can get

play03:20

there

play03:21

technologically um and all of us really

play03:24

have this goal of making software making

play03:26

the development of software a more

play03:28

creative engaging task you know today

play03:30

there's a lot of um you know just slog

play03:34

involved in uh in you know getting

play03:36

something done when writing software uh

play03:39

and being able to uh push away all the

play03:42

parts that are kind of annoying and rote

play03:44

to an autonomous engineer and really be

play03:47

able to focus uh as humans on the

play03:49

creative part of the task on you know

play03:51

what is what does the end user want what

play03:52

do I want this software to be like and

play03:54

do uh if I can focus on that aspect of

play03:57

things that makes the job of writing

play03:58

software a lot more fun

play04:00

so that's really what we're what we're

play04:02

aiming for and and again we want to do

play04:03

this all all in the open uh we believe

play04:06

that software development is generally

play04:08

uh uh lends itself to being done in the

play04:11

open in an open source way and so we're

play04:13

very excited about uh the open source

play04:15

Community

play04:18

here just some kind of quick stats on

play04:20

the project uh we're very early on we're

play04:22

only about two months old uh we already

play04:24

have 116 unique contributors to the

play04:26

project uh We've merged over 700 PRS

play04:30

uh we have over 24,000 stars on GitHub

play04:33

um this is a huge amount of uh of

play04:36

progress for an open source project

play04:38

that's this young um and uh you know

play04:41

even even Beyond these numbers just the

play04:43

the quality of the application that

play04:45

we've been able to build and uh the the

play04:48

uh quality of the agent that we have

play04:49

running in there which we'll talk a

play04:50

little bit about later on thanks to some

play04:52

academic contributions uh really blown

play04:54

me away I uh I would not have expected

play04:57

to see an open source project be able to

play04:58

move this quickly

play05:00

and build this much um and it's been

play05:02

really really impressive to see and

play05:04

really exciting to just be a part

play05:07

of well so you know why why are we

play05:09

building this can't can't chat GPT

play05:11

already write code isn't this already

play05:13

possible and and that's definitely true

play05:15

chat GPT can can write bits of code uh

play05:17

it can even you know inside of chat GPT

play05:20

it can run them to check them uh there's

play05:22

there's a pretty good uh chat gbt is

play05:25

actually exceptionally good as well as

play05:26

the other large LS are exceptionally

play05:28

good at writing code

play05:31

um we also have a GitHub co-pilot uh

play05:34

which hopefully you all are familiar

play05:36

with this is probably you know within my

play05:38

workflow the thing that has given me the

play05:39

biggest productivity boost uh is having

play05:41

this uh llm driven code completion right

play05:44

there at my fingertips inside of my

play05:46

development workflow um I I absolutely

play05:48

love co-pilot it's a huge uh kind of

play05:51

design inspiration for uh our long-term

play05:53

Vision around open Devon uh this idea of

play05:56

something that really integrates

play05:57

directly into the developer workflow in

play06:00

a way that's uh invisible when it's not

play06:02

necessary but is always kind of there

play06:04

when you when you need it when it can be

play06:05

helpful and it's just you know you hit

play06:06

the Tab Key and you know you're just

play06:08

moving that much faster um I'm a huge

play06:11

huge fan of

play06:12

co-pilot uh but chat GPT and uh you know

play06:15

just a raw llm and co-pilot they don't

play06:18

they don't really uh you know satisfy

play06:20

every need here there's some stuff that

play06:22

that agents can do which make this much

play06:25

much

play06:26

better uh so agents play nicely with

play06:28

existing code BAS

play06:30

um you know the the tough thing about

play06:32

say working with uh you know if you're

play06:34

just like working directly inside of

play06:36

chat GPT uh it's really great at writing

play06:38

Green Field code if you're just like hey

play06:40

I need an algorithm that does X uh it'll

play06:42

write that function for you but if

play06:43

you're like hey I have this 100,000 line

play06:46

codebase and uh I need you to find the

play06:48

right file to edit within that codebase

play06:51

uh it's not really capable of doing that

play06:52

you can't really just like dump your

play06:54

whole codebase into the context window

play06:56

and have it try and pick out the right

play06:57

spot it's probably too big of a codebase

play06:59

for that context window um and it's

play07:01

going to cost you a lot of money to do

play07:03

that um agents can also embed directly

play07:07

into the developer workflow kind of

play07:08

similar to what we saw with with

play07:10

co-pilot uh I don't know if you all have

play07:12

had this experience but very frequently

play07:14

I have uh when writing code I will uh

play07:17

drop into chat GPT to try and you know

play07:19

write some function uh and then I copy

play07:21

the contents out of chat GPT into my uh

play07:24

code editor and then I run it and then I

play07:27

paste an error message back into chat

play07:28

GPT and there's this like really

play07:30

annoying copy paste going back and forth

play07:32

and uh with an agent it can really just

play07:35

embed into your existing system it can

play07:37

write code directly to your file system

play07:38

it can run the test for you uh so it's

play07:40

really it's a really neat way to kind of

play07:42

embed into the developer

play07:44

Loop uh it can run in this in this debug

play07:47

fix Loop which is something that like

play07:48

co-pilot doesn't really do well um right

play07:51

it can it can edit the code uh it can

play07:53

run tests it can you know run the code

play07:55

itself to see what the result is uh you

play07:58

know notice any problems and then fix

play08:00

itself uh and really it it works like a

play08:02

human engineer it can have access to a

play08:04

web browser to look at

play08:06

documentation um you know it's it's

play08:08

instead of doing this like kind of

play08:09

step-by-step process that you might do

play08:11

with chat GPT where you're picking off

play08:13

these small little bite-sized tasks you

play08:15

can feed an agent a very large unbounded

play08:19

task and it can break it down into

play08:21

manageable steps and make several back

play08:23

and forth trips to the large language

play08:25

model to achieve those you know more

play08:27

bite-sized tasks

play08:32

uh so overall you how do how do agents

play08:34

work um I see them as kind of a hub

play08:36

between several different uh uh data

play08:39

sources and um interaction points um so

play08:43

at the top here you have you have the

play08:44

user who's you know ultimately you know

play08:47

driving driving the agent deciding you

play08:48

know what do I want this agent to work

play08:49

on uh so they'll send the agent a task

play08:52

like you know please fix the test to my

play08:53

codebase please add this new feature

play08:56

whatever it is the agent can also pull

play08:58

in external data U so that might be uh

play09:01

you know code from your code base it

play09:03

might be a data from an external uh like

play09:06

knowledge base something like that uh

play09:08

could be external documentation uh un

play09:10

likee you know the API docs for a

play09:11

particular API that's that's

play09:14

out um it has access to a large language

play09:17

model and this is you know really the

play09:19

core uh you know it's kind of like the

play09:21

CPU for the agent right it's it's what's

play09:23

uh kind of driving the atomic

play09:25

instructions that the that the agent you

play09:26

know the atomic operations the agent is

play09:28

taking all goes through the

play09:30

llm uh and then it has access to some

play09:32

kind of runtime right it has access to a

play09:35

web browser to a command line to a file

play09:37

system where it can actually uh you know

play09:40

do the things that the llm is telling it

play09:42

to

play09:45

do here's here's kind of some pseudo

play09:47

code for you know what this what this

play09:49

Loop looks like um so it's it's

play09:52

basically a you know a loop that's built

play09:54

on top of the llm um so basically we

play09:57

take the current state of things uh

play09:58

which at the is probably just you know

play10:00

whatever the prompt the user gave um we

play10:04

send some kind of prompt to the llm and

play10:05

this is kind of where uh the the core um

play10:10

specifics of a particular agent live is

play10:12

is within this you know how do we

play10:13

generate this

play10:15

prompt uh we take the response from the

play10:17

llm and come up with some action to take

play10:20

uh we take that action in the runtime

play10:22

somehow uh and come up with an

play10:24

observation so that might be the

play10:25

contents of a file that might be uh the

play10:27

HTML output or a screenshot of a page it

play10:29

might be uh the result of running a

play10:31

command on the command

play10:33

line and then we update our state with

play10:35

that

play10:36

observation um so we basically add add

play10:39

you know the output of that command or

play10:40

the contents that file to the state and

play10:42

then we loop back to the beginning we

play10:44

generate a new prompt to send of the

play10:45

LM given the output of that most recent

play10:48

command or the results of that most

play10:49

recent action the LM can take one step

play10:52

forward to generate a new

play10:57

action uh at the end of the day it's

play10:59

it's really about context window

play11:01

management right so the llm has uh

play11:04

basically no no long-term memory it

play11:06

knows exactly what you're putting into

play11:07

the context window right now it has its

play11:09

World Knowledge that it's that it's

play11:10

learned in training uh and it has to use

play11:13

those two things to figure out how to

play11:15

advance uh Advance one step forward

play11:17

towards our goal um and so you as the as

play11:20

the agent Builder basically need to

play11:22

figure out how do I intelligently based

play11:24

on you know what what the LM has done so

play11:26

far or what the agent has done so far uh

play11:28

and based on what I know about the code

play11:30

base but I know about the user's

play11:31

intention what do I stick into the

play11:33

context window to move one step closer

play11:35

towards the goal uh that could be Co

play11:37

code it could be command output it could

play11:39

be a history of the actions that the

play11:41

agent has taken it could be a working

play11:43

plan that the agent came up with at the

play11:45

beginning of the process uh could be any

play11:47

number of things that you pull into that

play11:48

context window this is really where the

play11:51

like uh all the all the work in uh

play11:54

designing an agent goes into is you know

play11:56

at each turn of that Loop what am I

play11:57

going to stick into the context window

play11:59

for the llm in order to drive things

play12:01

forward One

play12:04

Step uh to kind of help with this uh

play12:07

with this you know building agents and

play12:09

um you know help with this Loop uh what

play12:11

we've done is is um we tried to abstract

play12:14

away as much of this as possible into a

play12:16

framework we that we call

play12:18

microagents um our our best agents are

play12:21

not microagents to be clear our best

play12:23

agents are written in Python they

play12:24

involve some complex Logic for moving

play12:27

from state to state as they progress to

play12:29

towards their task um but microagents

play12:31

are a very powerful way to take on nice

play12:34

little bite-sized tasks um in a way that

play12:37

uh kind of drives that that Loop that we

play12:39

just talked about of uh you know make a

play12:42

prompt take an action and then make a

play12:43

new prompt and so you can see here uh

play12:47

this is a micro agent that is actually

play12:49

part of the open Devon system today uh

play12:51

for generating a good commit message

play12:53

based on what's in the git staging area

play12:56

um so you can see we have this long form

play12:58

this long kind of prompt uh telling the

play13:00

agent you know what its job is um you

play13:03

know it's a software engineer its goal

play13:04

is to write a good commit message here's

play13:06

how it's going to find the code changes

play13:08

to describe uh and then you can see

play13:10

we've templated out a bunch of kind of

play13:12

the boiler plate that goes into uh

play13:15

creating an agent um so at each turn of

play13:17

the cycle the agent is going to get some

play13:20

instructions on uh you know what um what

play13:23

its history is is uh you know how to

play13:25

interpret this history Json that it's

play13:27

going to get and it's going to get a

play13:29

history of the last 10 things that it's

play13:30

done as well as the observations of

play13:32

those things so that might be you know

play13:33

it ran get diff and this was the result

play13:35

of the get diff command it uh opened up

play13:38

a particular file to read the contents

play13:40

and uh this is what it saw inside that

play13:42

file uh so it's going to see you know

play13:44

step by step what are the last 10 things

play13:45

that it's done uh and then it has three

play13:48

actions that are available to it it can

play13:49

run commands it can reject the task and

play13:52

say you know there's nothing in the get

play13:53

staging area or there's you know I'm not

play13:54

in a get repo I can't do this or I can

play13:57

finish and say you know here's what the

play13:59

messages and then it gets some

play14:01

instructions on uh basically telling it

play14:04

you have to respond to Json format uh

play14:07

the Json has to be in this structure it

play14:09

has to represent an action um and so

play14:11

basically you know this this these

play14:13

templates take care of a lot of kind of

play14:15

the boo plate of uh writing an agent um

play14:18

and this uh this will basically this

play14:21

prompt will run in a loop against the

play14:22

llm uh in order to drive towards this

play14:25

task of uh creating a commit message

play14:31

well so once once we you've got an agent

play14:32

built uh a really good question is how

play14:35

good is that agent um so a lot of work

play14:38

has been done in Academia to measure uh

play14:41

agent quality basically how how well

play14:43

does an agent uh solve uh tasks uh

play14:46

related to software

play14:48

engineering uh the biggest Benchmark

play14:50

here is called sbench it came out of

play14:52

Princeton uh and basically tests the

play14:55

ability of an agent to solve real world

play14:57

issues on GitHub um so these folks uh

play15:00

pulled you know over 2,000 issue issue

play15:02

PR pairs from popular python

play15:04

repositories and basically what they did

play15:06

was they looked for PRS where the pr uh

play15:10

added unit tests uh and added some code

play15:13

changes to address a particular issue

play15:14

that was available on on the repo um and

play15:17

so this makes it really easy to verify

play15:19

that the agent is doing the right thing

play15:21

basically you um you clone the repo at

play15:24

the place where it was before the issue

play15:26

was fixed you add the unit tests that

play15:28

were added within that PR uh you run the

play15:31

test once they fail predictably uh and

play15:34

then you tell the agent hey go and uh

play15:36

address the issue uh you know as it's

play15:39

written you know from GitHub like that's

play15:41

that's the prompt that's fed to the

play15:42

agent and then the agent is allowed to

play15:44

do its thing it's allowed to edit code

play15:46

it's allowed to browse the web Etc it

play15:48

makes a set of changes uh we apply those

play15:50

changes to the repo uh we run the test

play15:53

to see if they pass and if they pass you

play15:54

get a thumbs up if they don't pass you

play15:56

get a thumbs down uh and what we see is

play15:59

that uh on on uh Alo so to to continue

play16:03

forward a little bit there the sbench

play16:05

folks also released a subset of those

play16:07

issues uh of 300 issues that are a

play16:09

little bit more self-contained a little

play16:10

bit easier for agents to tackle uh and

play16:14

uh you know it's a bit smaller of a set

play16:15

this is actually a very expensive

play16:17

evaluation to run on say GPT 4 uh cost

play16:20

several thousand dollars you know

play16:22

roughly speaking depending on your agent

play16:23

and how it behaves uh to run a full

play16:26

evaluation so swe bench light uh for for

play16:28

for our eval has cost you know roughly

play16:30

$500 $600 to run uh so we've been

play16:33

running swe Bunch light to test our

play16:35

agents uh and roughly what you see is

play16:37

that uh state-of-the-art agents right

play16:39

now score around uh 15 to 20% on the

play16:43

sweet bench light Benchmark meaning they

play16:45

can they can fix about 15 to 20% of

play16:48

reasonably scoped issues that have been

play16:50

found on GitHub um so there's a ton of

play16:52

Headroom here right these agents are

play16:53

still failing at most issues um but it's

play16:57

also pretty impressive that they're able

play16:58

to for you know roughly a fifth of

play17:00

issues a sixth of issues they're able to

play17:02

get to a solution uh and if you think

play17:04

about you know if any of you manage a a

play17:07

codebase uh or an open source project uh

play17:10

being able to just automatically tackle

play17:12

20% of your issues be a huge step

play17:14

forward um so we're we're super excited

play17:17

about the progress here uh and we

play17:19

actually just announced uh two days ago

play17:22

uh our latest score on swe bench FL is

play17:26

21% um and that's a step up of three

play17:29

percentage points or roughly like I

play17:30

think

play17:31

177% uh in absolute terms over uh the

play17:35

next best agent in uh from an

play17:37

Academia um so really you know really

play17:40

quick movement here it's it's only in

play17:41

the last few months that these SCS have

play17:43

started being posted um so we're we're

play17:46

super excited about our progress and

play17:47

about the progress of the technology as

play17:49

a whole but again there's there's a lot

play17:51

of room for us to grow a lot of room for

play17:53

for these agents to get better and

play17:54

they're getting better every day

play17:59

well so uh in terms of working with

play18:01

agents uh for those of you who have seen

play18:03

the Devon demo this user interface uh

play18:06

will look somewhat familiar um uh this

play18:09

is a snapshot of the open Devon

play18:12

interface um you can see uh you know it

play18:15

has the you know a main chat window

play18:16

where you can interact with the agent

play18:18

tell it what to do it can ask for

play18:19

feedback it can stop and ask for further

play18:22

instructions for for the directions um

play18:24

they can check uh check in and say like

play18:27

hey am I going in the right direction or

play18:28

do you want me to ship

play18:30

gears um and so you can kind of interact

play18:32

with the agent there it has access to a

play18:34

terminal where it can run commands uh it

play18:37

has access to a code editor uh where you

play18:40

can see all the files that are currently

play18:42

in its workspace and you know it can it

play18:44

can read and write to different files

play18:46

and you can you know watch it as it's as

play18:47

it's writing those files uh it has

play18:49

access to a web browser where it can

play18:50

browse the internet uh we've actually

play18:52

recently added a a Jupiter notebook

play18:55

where the agent can uh if you have a

play18:57

task that's uh a little bit more

play18:59

oriented towards like I'm just trying to

play19:01

analyze some data right now you can work

play19:03

with it inside of a Jupiter notebook um

play19:06

so it's really uh you know kind of aare

play19:08

programming environment with the

play19:10

agent uh but this isn't the only way we

play19:13

see folks interacting with agents uh you

play19:15

know personally I'm looking for an agent

play19:17

that can integrate much more directly

play19:19

into my development Loop um you know I

play19:21

don't really want to you know leave my

play19:24

my Vim environment and like go into a

play19:27

web browser and interact with an agent

play19:29

there and then like drop back to Vim

play19:31

once I'm done with the web browser I

play19:32

want something that's a little bit more

play19:33

Co like co-pilot which can come with me

play19:36

you know as I edit code uh help me make

play19:38

edits help me run tests in the

play19:40

background um help suggest paths forward

play19:43

things like that uh and so you know

play19:45

we're really trying uh again to build an

play19:48

open platform where we can drive not

play19:50

just a a web experience like the one we

play19:52

saw in the previous slide but where we

play19:54

can drive vs code plugins uh Vim plugins

play19:58

uh where we can interact directly inside

play20:00

of GitHub where uh you know you could

play20:02

just tag an issue and open Devon could

play20:04

go and fix it uh where you could leave

play20:07

uh comments on a PR and open Deon could

play20:09

address those comments for you um where

play20:13

you can interact with it on the

play20:14

commandline or within a cicd environment

play20:16

uh really we want agents to be able to

play20:20

uh interact with your codebase anywhere

play20:21

a software engineer would um they should

play20:23

really be like a a remote teammate who's

play20:26

collaborating with you

play20:30

cool and now I have a demo to share

play20:34

we're going to hope the demo goes

play20:36

smoothly uh I already know for a fact

play20:39

that I have ruined the file system

play20:41

permissions on my current system because

play20:44

I'm trying to debug an issue with users

play20:46

who are running Devon as root uh so

play20:49

we're going to see it fail when it tries

play20:50

to edit an existing file but first we'll

play20:54

start with a uh a simple task

play20:59

uh kind of my favorite uh you know hello

play21:02

world task right A bash script prints

play21:18

hello you can see as it's going about

play21:20

its work it'll kind of describe what

play21:21

it's doing uh you can see it running

play21:23

these commands as they go um should see

play21:27

hello.sh pops up here pretty

play21:30

straightforward right it just wrote a

play21:31

brand new bash script um I can also you

play21:34

know ask it to make edits so I can say

play21:36

add a command line AR for the user's

play21:41

name um and so you can do this kind of

play21:43

iterative development where uh you know

play21:46

it achieves some some task and you can

play21:47

say okay now I have some extra

play21:49

requirement that I want to add to this

play21:52

um and

play21:54

so you can see it edited the script to

play21:59

uh add hello one uh

play22:03

let's test it with the name

play22:10

Jerry so we can see it you know can it

play22:12

can run it test it things like

play22:15

that um but you know writing writing a

play22:17

hello world script not the not the most

play22:19

exciting thing um what I really want to

play22:23

show you

play22:24

all

play22:25

is uh so what I'm going to do is

play22:29

attach open

play22:31

Devon not to an empty staging area an

play22:35

empty

play22:36

workspace but I'm going to attach it to

play22:38

the open Devon code base

play22:40

itself um and I'll actually show you all

play22:44

what I've done

play22:55

here so I just changed the name of an

play22:57

argument with in the open Devon code

play22:59

base from DN to- X and uh that's going

play23:02

to basically break this ARG par test

play23:06

it's GNA say uh you know it's expecting

play23:09

it to be

play23:10

dashn uh so what I'm going to do

play23:15

is tell open Devon

play23:18

to work inside of this code base

play23:21

basically you just have to uh as you

play23:23

start the app point it to uh whichever

play23:27

folder you want it to work in

play23:29

uh so I'm going to restart the

play23:32

browser share my screen

play23:47

again uh well so you can see you know

play23:49

now we're inside of the open Devon code

play23:52

base um I should be able to

play23:56

see see if I can find my way around

play23:58

around here open Dev cor

play24:02

config.py um this is the file that I've

play24:04

edited with a problem uh I added this

play24:08

this DX to be

play24:09

a-n uh and I'm going to

play24:12

say uh change the code so that

play24:19

uh what was the name of my

play24:23

test so that AR parer test passes

play24:30

and now so you know by default the agent

play24:32

doesn't know anything about this

play24:34

codebase um so it's going to start by

play24:36

saying okay you gave me a file name I'm

play24:37

going to look at that file uh so it's

play24:40

opened it to see the contents you can

play24:41

see it you know digging through it on

play24:43

the command line here um the our core

play24:46

agent the coda agent which is the one

play24:47

that uh

play24:49

recently uh uh set a new record on the S

play24:51

bench score the sweet bench light

play24:54

score uh it uh does a lot of really

play24:57

creative stuff to be able to navigate

play25:00

through file contents bits at a time

play25:02

instead of dumping whole file contents

play25:03

into the context window uh that's one of

play25:05

the things that makes it extremely

play25:07

powerful is that it can kind of scroll

play25:08

through line by line and see see what's

play25:10

up

play25:13

um it looks like it is now trying to uh

play25:17

set up its environment to make sure that

play25:18

it can actually run uh run the code here

play25:40

this might take a little while I think

play25:41

it's trying to install all the packages

play25:42

one by

play25:47

one okay here it goes just question like

play25:50

the human can actually just message

play25:52

stuff in the middle

play25:54

right uh correct we're so we're working

play25:58

on um getting that uh working a little

play26:03

bit more strongly um so you can see like

play26:05

right now the agent so the agent has

play26:07

said okay I've identified the issue um

play26:10

the command line arguments uh was

play26:13

changed from DX to dashn uh and it's

play26:15

saying you know

play26:18

uh it's asking me to

play26:21

to actually scroll for it that's

play26:24

interesting uh that's the

play26:29

diagnosis pleas fix it see what it does

play26:32

so it will occasionally just stop and

play26:34

say hey like here's here's where I'm

play26:36

going uh please you know check my

play26:39

thinking let me know if I'm going in the

play26:40

right

play26:41

direction um but there should be this

play26:44

ability to like interrupt it and say

play26:46

like hey maybe you should instead of

play26:47

installing these packages one by one uh

play26:49

just run you know poetry install and

play26:52

it'll it'll figure everything

play26:53

out I think one of the questions from

play26:56

the audience uh was do you have some

play26:58

guard rails to prevent it from doing

play27:00

like rm-

play27:01

RF yeah so that's a that's a really

play27:03

great question and that's one of the big

play27:05

pieces of kind of our core platform here

play27:07

is that everything runs inside of a

play27:08

Sandbox it's uh it's extremely difficult

play27:10

to get things working properly if the

play27:13

agent is working inside of a Sandbox

play27:14

that's outside of your developer

play27:16

environment you need to make sure that

play27:17

as I was mentioning earlier file

play27:19

permissions right now on my file system

play27:21

are all screwed up because it's uh it's

play27:23

very difficult to make sure that the the

play27:26

edits that are being made in the sandbox

play27:27

are run the same user ID that that your

play27:30

user ID is and uh it takes a lot of

play27:32

Plumbing to make sure that everything's

play27:34

correct um so uh yeah it's it's that's a

play27:39

big piece of uh the investment we've

play27:41

made into making sure that the the agent

play27:43

is able to run things

play27:44

safely um and it's actually I think it

play27:47

just I show you my console here it just

play27:50

hit my file permission

play27:53

issue uh yeah you can see it's trying to

play27:56

edit the file but I have permission

play27:58

screwed up so it can't so it's just

play28:00

stuck here um but I'll I'll stop the

play28:02

demo there of course the demo Gods never

play28:04

never fully cooperate

play28:06

but yeah this is uh this is great I um

play28:10

yeah it's a super cool demo it seems

play28:11

like the ux is really coming together as

play28:13

well um and so that you know you really

play28:15

have a lot of the core components like

play28:17

the chat interface like the the brow

play28:19

like the the terminal um and actually be

play28:21

able to stream the responses as a as an

play28:23

outputs um so that's uh that's super

play28:25

exciting I um I want to go through some

play28:29

of the questions from the audience uh

play28:30

but maybe the the first thing I want to

play28:32

ask because I think some of the

play28:34

questions also touched on this is Could

play28:35

you actually uh give an overview of the

play28:38

current agent architecture slam like

play28:41

some um examples of the micro agents so

play28:43

for instance I know you mentioned like

play28:45

Coda as one of the Asian Loops if you

play28:47

could even just like explain at a high

play28:49

level like what it's doing um maybe like

play28:51

why it's it's good that type of thing

play28:53

and then um uh maybe you like show some

play28:55

examples of like the microagent

play28:58

I think that'd be really helpful for the

play29:00

audience yeah um yeah so we have at at

play29:03

kind of the the lowest level we have

play29:05

this agent

play29:06

abstraction uh which basically says that

play29:08

you know the it's a very small footprint

play29:10

it basically says uh at each point in

play29:13

time we're going to pass in a state that

play29:15

has um a history of all the interactions

play29:18

that have been done you know basically

play29:20

everything that's happened that the

play29:21

agent might want to know about is in

play29:22

that state um and uh it's going to take

play29:25

one step forward generally make one call

play29:27

to another m in the step and that step

play29:29

returns an action uh which is one of

play29:32

about a dozen different actions that are

play29:33

built into the system that are things

play29:35

like run a command read a file write a

play29:37

file uh browse the web Um send a message

play29:41

uh that sort of

play29:42

thing um so you know it's a very it's a

play29:45

very tightly constrained action space uh

play29:48

and then you know a very kind of broad

play29:50

state that gets passed into the agent

play29:52

and then kind of under the hood the

play29:53

agent is you know can do whatever it

play29:55

wants with that State uh and uh

play29:58

basically to construct a prompt to the

play30:00

llm but typically what you'll see is it

play30:02

puts in uh a history of interactions it

play30:06

might highlight any um the most recent

play30:09

interactions and then it might like say

play30:12

okay the last thing you did was run a

play30:13

command uh maybe uh analyze the output

play30:17

and send a message about uh you know

play30:19

what you think happened there and what

play30:20

your next step should

play30:22

be um so that's that's kind of the rough

play30:24

the rough Loop that every agent

play30:26

follows um

play30:28

um the uh the the idea behind micro

play30:34

agents is to really um abstract away

play30:38

even even more of that process so

play30:40

instead of uh managing State at every

play30:43

turn of the loop and instead of like

play30:44

writing python code to decide you know

play30:47

what goes into the prompt and like like

play30:49

constructing this big prompt string by

play30:51

hand uh all you have to do to write a

play30:53

micro agent is write some markdown uh

play30:55

with some template strings for like

play30:57

where the history go U what kind of

play30:59

actions the the agent should be uh

play31:01

trying to run that sort of thing um and

play31:04

I can I can you know share my screen and

play31:06

kind of show what a few of those micro

play31:08

agents are looking like uh they're

play31:11

very um uh we're we're really looking

play31:14

for contributions here what I what I am

play31:16

hoping for is a library of like hundreds

play31:18

or thousands of microagents that we can

play31:20

delegate to uh for simple tasks like you

play31:23

know open up a PR on GitHub uh you know

play31:25

write a unit test to address the changes

play31:28

that are in the G staging area uh things

play31:30

like that

play31:33

um so uh all of our agents are in this

play31:36

uh this agent Hub directory inside of

play31:38

the open Devon

play31:39

repo um the uh microagents all go in

play31:44

this folder um and so like for instance

play31:46

we have this math agent that will just

play31:48

like if there's a bit of math that needs

play31:50

to be done we need to figure out you

play31:52

know the area of a circle um you know it

play31:55

will uh basically uh some python codee

play31:59

run it and then like stick the answer

play32:00

into the output

play32:03

um uh we have an agent that like manages

play32:05

postrest migrations uh which I haven't

play32:07

actually used we just kind of like wrote

play32:09

it as an example um we have an agent

play32:12

that will look through a given GitHub

play32:14

repository and just like investigate

play32:16

what does this repository do what is it

play32:17

about what is the code structure like

play32:19

and it can pass that on to the next

play32:20

agent that's actually supposed to write

play32:22

code changes even an agent that fixes

play32:24

typos in plain text docs like

play32:26

readms um

play32:28

we have an agent that like will verify

play32:30

any changes that have been made so it'll

play32:31

take the original task uh it is given

play32:35

codebase where some changes have been

play32:36

made and it has to like run commands to

play32:39

make sure that those changes are good um

play32:42

they can be kind of as big or as small

play32:43

as you want um so I'm I'm kind of I'm

play32:46

excited about what the community might

play32:48

be able to do with micro agents and uh

play32:50

contributing these kind of like you know

play32:52

bite-sized tasks that agents can take on

play32:55

and then uh what's neat is that these

play32:56

tasks can um the these agents and any

play32:58

agent can as part of its workflow it can

play33:01

one of the big actions that we have that

play33:03

it could take along with reading files

play33:04

writing files browsing the web is

play33:06

delegating to another agent um so it can

play33:08

say okay here's a subtask that I want to

play33:11

pass off to the math agent because I

play33:13

need to know uh you know what the you

play33:15

know what the result of this you know

play33:17

kind of complex math problem is in order

play33:19

to move forward um or I can delegate out

play33:22

to uh the postgress agent to write a

play33:24

postgress migration for me because it's

play33:26

going to do a better job than the like

play33:28

you know very vanilla like Coda agent

play33:31

can um so yeah this is this is uh super

play33:35

interesting so actually one related one

play33:37

of the questions I was going to ask is

play33:38

um how are these micro agents invoked

play33:40

from like a higher level agent reasoning

play33:43

Loop yeah so um it all goes through that

play33:47

that uh delegate action and I can

play33:49

actually show you we have this delegator

play33:51

agent that uh works it's it's just

play33:53

python um and basically what it does

play33:58

is so you can see here's the step

play33:59

function I was talking about before

play34:01

where gets passed in a state and it's

play34:03

going to generate an action um and

play34:06

basically what the delegator agent does

play34:08

is it just decides uh okay what was the

play34:10

last thing I did

play34:13

um uh if uh I was currently delegating

play34:17

to the agent that's studying the repo

play34:19

next thing I'm going to do is uh

play34:21

delegate over to the agent that writes

play34:22

code if the last thing I was doing was

play34:25

writing code then I'm going to delegate

play34:26

over to the agent that does verification

play34:29

if the last thing I was doing was

play34:30

verifying and the verifier said yep

play34:32

you're good then I'm going to finish

play34:34

otherwise I'm going to kick things back

play34:35

to the coder and I'm going to tell it uh

play34:37

hey uh here's here's what's wrong here's

play34:40

what the verifier found um and so

play34:42

basically the delegator agent is only

play34:43

delegating out to other agents uh and

play34:45

it's it's basically managing this in a

play34:47

loop saying okay start by very start by

play34:49

studying the repo and then move back and

play34:51

forth between coder and verifier in

play34:53

order to move towards the end

play34:56

goal and then um uh so I'm assuming like

play34:59

the delegator agent can also delegate

play35:01

out to to micro agents or is that

play35:02

handled by one of the sub agents well so

play35:05

this this delegator agent is is like

play35:07

kind of a demo of the delegation uh

play35:10

stuff it could delegate out to any agent

play35:12

hypothetically the way it's coded right

play35:14

now it's just got these three agents

play35:16

hardcoded um I can also show you uh

play35:20

within the micro agents there's this

play35:23

manager agent um which uh a little

play35:27

easier to read in raw raw

play35:31

format um but it's given uh the delegate

play35:34

action um as a you know one of the

play35:37

actions that I can take and we say hey

play35:40

here's the list of all the agents that

play35:41

are in the open Devon system uh that you

play35:44

might want to delegate out to and we

play35:45

describe each one um this isn't

play35:49

something that's like working super well

play35:50

today it's just like a lot of random

play35:52

agents all kind of strewn together we

play35:54

could definitely do a much better job of

play35:55

kind of guiding it towards which agent

play35:56

is good for which task ask uh I think

play35:58

it's it's basically just guessing based

play36:00

on the name of the agent right now um

play36:01

but this is kind of how that how that

play36:03

would look like is uh you know you give

play36:05

it a list of agents and say Here's

play36:06

here's everybody you have access to who

play36:08

do you want to delegate to

play36:09

next and um the the last kind of

play36:12

question on this topic which is audience

play36:14

question is uh so when you think about

play36:16

like agent orchestration um so you have

play36:18

like a delegator agent and sub agents

play36:20

and coordination between them do you

play36:22

think about it more as like let the

play36:24

agent figure out like how to orchestrate

play36:26

between the sub agents um or do you

play36:29

explicitly Define flows yourself as like

play36:31

a programmer um like do you kind of

play36:34

hardcode like okay I must go through

play36:35

this and then this or do you just like

play36:38

you know agents figure that out

play36:40

dynamically what's what's cool about the

play36:42

system is that it allows you to kind of

play36:43

work through both

play36:44

workflows um so right now I actually

play36:46

have a PR open to create a uh an agent

play36:50

that will take whatever changes are

play36:52

staged right now and open up a GitHub PR

play36:55

um so it's a micro agent and it it it

play36:57

has it's a very like hard-coded workflow

play36:59

it's like uh add whatever file changes

play37:02

are there right now that's just like a

play37:04

hard-coded command um and then it has a

play37:09

it has a sub agent for you know come up

play37:11

with a good commit message come up with

play37:12

a good Branch name uh figure out where

play37:15

the Upstream repo is and where the fork

play37:17

Rebo is um or if you should be pushing

play37:20

to the to the you know directly to the

play37:21

Upstream Rebo on a branch um so kind

play37:24

like kind of figures out those pieces

play37:25

dynamically uh and then sends a request

play37:28

to the GitHub API uh in order to open up

play37:30

the pr and so that's that's a pretty

play37:32

like hard-coded

play37:34

workflow um but you know as agents get

play37:36

better and better they should be able to

play37:39

uh uh take on you know a lot more of

play37:42

that without having to hardcode

play37:44

everything uh the nice thing about hard

play37:46

coding is it gives you a lot of kind of

play37:48

control as a developer over exactly how

play37:50

the agent's going to behave what it's

play37:51

going to do it gives you a lot of

play37:53

transparency over like what's going to

play37:55

happen um you know leaving a little bit

play37:57

more open-ended requires a little bit

play37:59

less upfront work and investment um but

play38:02

it uh it leaves things a little bit more

play38:04

open-ended for the agent it might do

play38:05

something a little bit less predictable

play38:08

um so we really see supporting both

play38:10

workflows as

play38:12

important great um the uh next question

play38:16

from the audience is um what like given

play38:18

that you know the performance on on

play38:20

sweet bench flight is 21% what are some

play38:22

of the main issues uh that you're seeing

play38:25

like common sources like hallucination

play38:28

um common like error modes for open de

play38:31

and and this relates to the kind of like

play38:33

General topic of just like challenges

play38:35

with the current agent

play38:36

architecture yeah great question um uh I

play38:40

would say first off we're we're still

play38:42

analyzing our eval results we haven't

play38:44

gone through all the test cases yet to

play38:46

like figure out okay where where is it

play38:47

falling down and kind of like where's

play38:49

the low hanging fruit for us to get the

play38:50

next five 10 percentage points on the on

play38:52

the

play38:52

Benchmark um but the the I would say the

play38:55

the broad patterns are

play38:58

um edits that that require like a lot of

play39:02

uh kind of like uh Meta Meta level

play39:05

knowledge about how the codebase works

play39:07

and functions together right like we can

play39:09

you know it's very easy to fix a unit

play39:11

test that where you're just editing a

play39:13

single function um you know kind of like

play39:15

the example I showed earlier where you

play39:17

know we have a unit test failing it just

play39:19

needs to make a oneline edit to an

play39:20

existing function like it can find its

play39:22

way around the codebase it can find the

play39:23

right file to edit it can make the right

play39:25

edits um

play39:28

but uh something where it's like hey we

play39:30

want to add this new feature uh it

play39:32

involves like you know adding a database

play39:35

migration and then you know changing the

play39:37

API to work with the new database

play39:39

structure and then changing the front

play39:41

end to ingest that new API like that's a

play39:44

that's a very big uh you know it's a

play39:47

very there would be a very common GitHub

play39:49

issue is like hey please add a new

play39:51

feature uh that's going to involve

play39:53

front-end work backend work database

play39:54

work um and then like figuring out how

play39:56

to make those those large scale edits

play39:58

across an existing code base that's

play40:00

really where a lot of the um the current

play40:04

agents uh have room for

play40:08

improvement great um the kind of uh next

play40:12

question is what is sorry I'm trying to

play40:15

find the question oh what's like the

play40:17

minimum compute recommended to to run

play40:19

this

play40:20

locally so open Devon itself does not

play40:23

need much compute right it basically

play40:25

runs commands in a Docker file all the

play40:27

compute is going on all the heavy

play40:29

compute is going on in the

play40:31

LM and that is

play40:34

API yeah so we basically we use light

play40:37

llm which is a python library that lets

play40:39

us interface with any llm through the

play40:41

same function basically um so uh we can

play40:45

run on open AI we can run on anthropics

play40:48

Claud we can run uh on any of the Azure

play40:51

uh llms uh we can run against uh AMA

play40:55

running locally um so BAS llm agnostic

play40:58

what I will say is gp4 and Claude Far

play41:01

and Away worth the best um the smaller

play41:05

local llms tend to get stuck in Loops

play41:07

they get tend to uh um you know not

play41:10

really make the right changes the the

play41:13

the the underlying llm is really uh the

play41:17

power of that llm is really what drives

play41:19

uh the power of the agent um and so

play41:21

we're really only as powerful as the llm

play41:23

that that you're using um so you know

play41:27

while folks are interested in running

play41:29

locally with you know llama 7 billion

play41:31

and stuff like that it tends to only

play41:33

work for really simple

play41:36

stuff um one question for my end

play41:38

actually you mentioned that you know one

play41:40

of the kind of challenges of building

play41:42

this is context window management of

play41:44

figuring out like how to put different

play41:46

things into the prompt how like what are

play41:48

the specific components that you're uh

play41:51

like putting into the prompt and how are

play41:52

you like indexing the code base right

play41:54

and are you doing like retrieval that

play41:56

type of stuff to actually return

play41:58

relevant

play41:59

context yeah that's a that's a great

play42:01

question so I think the the the hardest

play42:04

part of the context window management is

play42:05

managing the history of interaction so

play42:08

the agent can see here's what I've done

play42:09

up to now and figure out what what's my

play42:11

next step um that uh works really well

play42:15

up to a point and then eventually that

play42:17

history gets so long especially if

play42:18

you're reading like the entire contents

play42:20

of files uh you end up having to one

play42:23

thing we've experimented with is like

play42:25

kind of progressively summarizing the

play42:26

history so you know as the history gets

play42:28

longer and longer and longer you know

play42:30

summarize the first half of the history

play42:32

so that you can cut down on the amount

play42:33

of context you're shoving into that

play42:35

context window assuming that those those

play42:37

early actions are now like not super

play42:39

relevant to what the agent is working on

play42:40

now um so that's that's one bit of

play42:43

context window management that we've

play42:44

worked on um in terms of pulling in code

play42:48

our initial agent would just dump the

play42:50

entire contents of a file into the

play42:52

context window if the agent was like I

play42:53

need to look at f.p we just dump f.p

play42:56

into the context window and sometimes

play42:57

food. Pi is like 10,000 lines of code

play43:00

and like you immediately hit the context

play43:02

window limit uh so one thing we've done

play43:05

here is

play43:06

um the uh the code act agent exposes

play43:10

some functions for uh searching through

play43:12

code and for looking at files uh just

play43:15

chunks at a time and kind of scrolling

play43:17

up and down inside of that file with

play43:18

these Dynamic commands and that works

play43:21

really really well uh for basically

play43:23

grabbing the part of the file that's

play43:25

relevant to what the agent is doing

play43:26

without poll including the context

play43:27

window with a bunch of irrelevant

play43:29

information uh so it can search it can

play43:32

um uh you know it can dig through the

play43:34

context window I think one uh bit of

play43:37

loing fruit that we have not yet

play43:39

explored is basically ingesting the

play43:41

whole uh the whole code base into a

play43:44

vector database uh which we we have been

play43:47

experimenting with with FL index for um

play43:50

in order to do some more Dynamic

play43:52

retrieval of relevant Snippets of code

play43:54

for the agent to work with

play43:57

yeah makes sense right now you're

play43:58

representing it as like almost like code

play44:00

search functions that the agent can just

play44:01

call and it's relatively puristic based

play44:04

I'm assuming like if you search for

play44:06

things it just like keyword search right

play44:08

not like vector search but then if you

play44:09

actually pre-index it with a bunch of

play44:11

vectors then you can do some like

play44:13

semantic similar up yeah I think I think

play44:15

we're literally using grp under the hood

play44:17

so it's like you know exact you know

play44:20

yeah exact Str match yeah it makes sense

play44:21

I mean grap is a pretty powerful

play44:23

Baseline for what it's worth um I was

play44:25

just curious how you were yeah um okay

play44:28

and and so um in terms of the um like

play44:32

you you you like in the uax you

play44:34

mentioned some initial set of like tools

play44:36

so you have like a code editor as well

play44:38

as a terminal um of course like I know

play44:40

the the like full Devon app has like web

play44:42

search and and potentially like other

play44:44

tools as well are there plans to

play44:46

basically kind of expand the

play44:47

capabilities to uh even solving like

play44:50

more General sweet bench problems um as

play44:52

opposed to like sweet bench flight or

play44:53

just being able to pull in like external

play44:55

information and and if so like how how

play44:57

do you plan to do

play44:58

that yeah so we have a a couple Folks at

play45:01

CMU that are actively working on the

play45:03

browsing functionality uh they've done

play45:05

work on browser agents separate from

play45:06

coding agents uh and they're very

play45:08

interested in what they can do to push

play45:10

our ability to browse the web pull an

play45:12

API documentation Etc um so we've we've

play45:16

uh integrated browser gym at this point

play45:18

um Kodak agent is not yet using the

play45:21

browser uh which is probably some of the

play45:23

lowest hanging fruit for us to improve

play45:25

uh the coda agent

play45:27

um so we are uh that's kind of like our

play45:30

next one of our next steps for pushing

play45:31

codak agent forward uh to solving more

play45:34

of the sbench light problem Set uh we

play45:36

definitely want to get to the Fuller

play45:38

sbench uh problem set and we we we also

play45:41

want to do a full sbench eval of uh of

play45:44

open Devon today to see how it performs

play45:47

relative to Devon itself which is

play45:48

published at least a partial sweep buch

play45:50

score I think they evaluated like 25% of

play45:52

the sweet bench set um and how it

play45:55

compares to you know other academic

play45:57

agents um you know like like s agent um

play46:02

so that is definitely the goal is to

play46:03

take on the full basically be able to

play46:05

solve any any issue you find on GitHub

play46:07

any any issue that a software engineer

play46:08

could take on we want to be able to take

play46:10

on with uh with our automated agents um

play46:13

I think it's probably uh you're probably

play46:16

going to see years of effort going into

play46:18

that where you'll see you know some some

play46:20

very quick pce development over the next

play46:22

six months or so and then there's going

play46:23

to be a long taale of really difficult

play46:26

issues that last 20% of issues will

play46:28

probably take several years to to fully

play46:30

take

play46:31

down and it seems like you have a few

play46:34

agents in open Deon you have Kodak you

play46:37

have all these micro agents you have

play46:38

like browser agents that people are

play46:39

working on you mentioned the overall

play46:41

goal is to have a platform for uh you

play46:43

know AI Engineers to just like pull in

play46:45

components that build autonomous

play46:46

software Engineers um how if you able to

play46:49

like elaborate upon that Vision like

play46:51

what is the unit of abstraction that you

play46:53

expect like a developer to be able to

play46:54

pull into kind of build something is it

play46:57

like they're able to import uh an agent

play46:59

from like a suite of different agents

play47:00

that you offer is it that you just like

play47:02

they can clone this entire thing and

play47:04

they can just like modify the ux I'm

play47:06

kind of curious how you're thinking

play47:07

about

play47:08

that yeah so there there's kind of two

play47:10

sides to the platform right we want to

play47:11

appeal to folks that are building agents

play47:14

and folks that are kind of at the

play47:15

bleeding edge of what agents can do and

play47:17

we want to appeal to the end users who

play47:18

are running these agents right um and so

play47:20

on that first side what we really want

play47:21

to do is take care of um all the

play47:24

nitty-gritty work that goes into build

play47:27

and evaluating agents um so that's uh

play47:30

like a having an opinion on what what

play47:32

the agent Loop looks like how actions

play47:34

are structured things like that um and B

play47:38

like really providing all the stuff like

play47:41

the docker sandbox running commands so

play47:43

you know this thing's not going to rm-

play47:44

RF you know your root directory um uh

play47:48

things like that so providing all that

play47:50

kind of like base infrastructure for for

play47:52

running the agents as you're building

play47:53

them and testing them uh then we've also

play47:55

actually packaged up uh sbench uh sbench

play47:58

light so far but working on swe bench in

play48:01

a very easy to run pipeline uh so we

play48:04

want to basically make it easy for

play48:05

anybody who has built an agent um to you

play48:09

know wrap it in our interface uh which

play48:11

should be generic enough to handle any

play48:13

agent out there uh wrap it in our

play48:15

interface and and very quickly be able

play48:17

to run an eval and see how do you do

play48:20

compared to other folks um and what we'd

play48:23

love is for everybody to uh kind of

play48:26

consolidate on on open Devon as a um a

play48:30

standard for how you write agents even

play48:32

though under the hood they could be

play48:33

doing very different things they could

play48:34

be structured in very different ways as

play48:36

long as they kind of conform to this

play48:38

very simple interface of like you know

play48:40

pass in state step forward one step and

play48:42

produce an action at each step um as

play48:45

long as they conform to that interface

play48:47

we'll be able they'll be able to plug

play48:48

into all this uh amazing extra uh

play48:52

architecture that we've built for

play48:53

running the agents evaluating the agents

play48:55

Etc

play48:57

um and then on the on the user side you

play48:59

know right now like I said we have a web

play49:01

interface um but what we're working on

play49:03

now is a re

play49:04

architecture uh to enable many different

play49:07

types of interfaces whether it's a CLI

play49:09

or a vs code plugin to uh interact with

play49:12

these agents and be able to pull in

play49:14

their results and uh you know run the

play49:15

actions things like

play49:17

that yeah super exciting um take us

play49:21

through and you know just in the last

play49:22

few minutes uh yeah like what's what's

play49:24

coming next like what what are things

play49:26

that you're excited about uh what is

play49:28

like a general landscape of um like is

play49:31

it just adding more agents and adding

play49:33

better agents like adding better uxes uh

play49:36

like what are you excited about in the

play49:37

next few months yeah I think um I think

play49:40

one of the things that's that's most

play49:41

exciting right now is just uh being able

play49:43

to push the agent itself forward um you

play49:47

know like I said we just announced the

play49:49

other day uh our results on sweet bench

play49:51

light which were super exciting um got a

play49:54

bunch of excitement from uh from the

play49:56

community which is great um and I think

play49:59

we still know of a lot of loow hanging

play50:01

fruit for pushing that agent forward um

play50:03

and so sh who's a a PhD student at

play50:06

Illinois is the brains behind the Koda

play50:09

agent U has published a paper around it

play50:11

uh he's he's going to be working on

play50:14

really pushing that that core ability

play50:16

forward further and further uh so very

play50:19

excited about that and then there's

play50:20

there's also the Tailwind of uh you know

play50:22

llms themselves getting better you know

play50:24

when GPT 5 comes out our Tech is going

play50:27

to get that much better uh because we'll

play50:29

just be able to flip a switch start

play50:30

using the latest LM and uh the agents

play50:33

will just naturally get better as a

play50:34

result of that so I'm very excited about

play50:36

where the agents themselves can

play50:38

go uh but in addition to that you know

play50:40

the thing that I'm I'm most focused on

play50:42

now while Shing Yao is is focused on

play50:44

pushing the agent forward is I'm I'm

play50:46

focused on building this you know very

play50:49

uh more more scalable more generic

play50:51

architecture where uh we can uh start

play50:54

interacting with agents and more than

play50:56

just that web interface uh you know it

play50:57

can be on GitHub issues it can be on the

play51:00

command line it can be inside your code

play51:02

editor um and so that's that's what I'm

play51:04

really excited about personally is um

play51:07

really being able to uh integrate these

play51:09

agents and especially the codea agent

play51:11

into my personal day-to-day

play51:15

workflow makes a ton of sense um and

play51:18

last question is basically how you know

play51:20

we talk about agent and orchestration

play51:22

how do you um you know you and the other

play51:24

contributors orchestrate amongst

play51:26

yourself basically I mean it's is this

play51:28

like a decentralized thing do you like

play51:29

meet once a week and then also you know

play51:31

from the audience like if they're

play51:32

interested in contributing to open de

play51:34

like how can you basically becomea

play51:35

contributor to open do yeah um yeah I

play51:39

would say it is it is pretty

play51:40

decentralized especially now that we

play51:42

have an evaluation pipeline set up my My

play51:44

Hope Is that it's basically a little bit

play51:46

of a competition kind of like you know

play51:48

the hugging face leaderboard style right

play51:50

we uh we can tell who's who's the best

play51:52

agent um and so if you contribute an

play51:55

agent and you beat Kodak on uh on the

play51:57

sweet buch light score you'll probably

play51:59

become our new default agent um and uh

play52:02

we really want to kind of get that that

play52:03

friendly competition feeling going uh

play52:05

and what's fun is like you could build

play52:06

an agent that delegates 60% of its tasks

play52:09

out to Koda agent and delegates the

play52:11

other 40% out to a different agent uh

play52:13

and as long as you're you're managing

play52:15

that delegation better than anybody else

play52:16

you'll do better and you'll you'll

play52:18

become the default agent um and so it is

play52:20

this kind of decentralized data driven

play52:22

way of deciding you know what what is

play52:24

the best agent out there what is the

play52:25

best architecture out there Etc um and

play52:28

yeah the best way to contribute is to

play52:29

just you know open up a poll request um

play52:31

you know maybe file an issue and let us

play52:32

know what you're thinking about building

play52:33

what kind of feature you want we have a

play52:35

ton of open features or open issues that

play52:37

are labeled like a first issue uh a ton

play52:40

of you know potential feature requests a

play52:42

ton of requests for uh there are a bunch

play52:44

of agents that have been built in

play52:45

Academia um a bunch of like interesting

play52:47

agent architectures that are out there

play52:49

that uh we have open issues to just like

play52:51

hey let's let's build an agent that

play52:53

implements this and see how it does um

play52:56

so so there's there's tons of room for

play52:58

for uh folks to contribute

play53:00

here great well I think that's a great

play53:02

way to conclude uh this webinar thanks

play53:04

so much Robert for joining um for those

play53:06

of you who want to check out open Deon

play53:08

please check out the repo um check out

play53:11

the the website as well the docs and the

play53:13

demo and then uh we'll have this on

play53:15

YouTube very shortly so thank you Robert

play53:17

again for your time and thanks for

play53:19

joining thanks so much for having me

Rate This

5.0 / 5 (0 votes)

Related Tags
AIエンジニアオープンソースコード生成自律型ソフトウェア開発デモエージェントコミュニティプラットフォーム機械学習
Do you need a summary in English?