New OPEN SOURCE Software ENGINEER Agent Outperforms ALL! (Open Source DEVIN!)

TheAIGRID
2 Apr 202416:02

Summary

TLDR该视频脚本介绍了一款先进的开源软件工程代理,它能够在GitHub仓库中自主解决问题,与Devon这款软件工程基准的准确性相似,平均用时93秒。该代理完全开源,允许GPT-4轻松编辑和运行代码。视频还讨论了开源模型与闭源模型的比较、代理的工作方式、新设计的界面、信息限制、未来研究的可扩展性、成本效益以及开源模型的潜在使用。

Takeaways

  • 🚀 开源软件工程代理的出现标志着在GitHub仓库自动解决问题的新系统,与Devon相似的准确性但开源且平均耗时仅93秒。
  • 🔍 开源代理在软件工程基准测试中的表现与Devon相近,仅低1.55%,这表明开源项目在短时间内能取得显著成果。
  • 💡 软件工程代理通过与专门终端交互来工作,支持文件浏览、编辑、语法检查和测试执行,强调了为GPT-4设计的友好界面的重要性。
  • 🌟 通过限制AI系统查看文件的行数(如每次仅100行),可以提高其效率和准确性,这可能有助于模型更好地处理和理解任务。
  • 🔧 开源软件工程代理的设计允许易于配置和扩展,促进了未来软件工程代理研究的发展。
  • 🔗 提供了一个演示链接,通过它可以直观地了解软件工程代理的工作流程和内部机制。
  • 📜 预计将在4月10日发布一篇论文,详细介绍技术细节、基准测试、模型微调和有效性实验。
  • 💰 尽管AI系统执行复杂任务的成本较高,但该软件工程代理将每个任务的成本限制在平均$4以下。
  • ⏱️ 软件工程代理平均在93秒内解决问题,显示出其高效的性能。
  • 📈 尽管开源模型具有隐私和本地运行的优势,但目前闭源模型由于其强大的性能和巨额投资,仍然是首选。

Q & A

  • 开源软件工程代理的发布有何重要性?

    -这个开源软件工程代理的发布非常重要,因为它在性能上与之前发布的Devon相当,但使用了更少的资本和时间。这表明开源社区能够快速地取得显著的技术进步,并且有可能在未来超过商业封闭源代码的解决方案。

  • 开源代理与Devon在性能上有何差异?

    -开源代理在软件工程基准测试中的准确率与Devon相近,Devon为13.84%,而开源代理为12.29%。这表明开源代理的性能几乎与Devon相同,但开源代理的开发成本更低,时间更短。

  • 开源代理是如何工作的?

    -开源代理通过与专门设计的终端交互来工作,这个终端允许它打开、滚动和编辑文件,还能进行语法检查和编写执行测试。这个为GPT-4优化的接口对提高代理性能至关重要。

  • 为什么需要为语言模型设计友好的代理计算机接口?

    -为了让语言模型更有效地工作,需要为其设计一个友好的代理计算机接口。类似于人类需要好的用户界面设计,代理计算机接口可以帮助模型更好地理解任务,并提供及时反馈,从而避免错误并提高效率。

  • 限制AI系统查看文件行数的策略有何影响?

    -限制AI系统一次只查看100行代码,而不是200行或300行,甚至整个文件,可以提高模型处理任务的效率。这可能是因为较少的信息量降低了模型处理的复杂性,使其能够更专注和有效地执行任务。

  • 开源软件工程代理如何促进未来的研究和发展?

    -由于这个软件工程代理是完全开源的,任何人都可以对其进行实验和改进,为代理与计算机的交互方式贡献新的想法。这种开放性可能会吸引更多的开发者和公司参与,从而加速代理技术的发展。

  • 开源代理的演示链接在哪里可以找到?

    -开源代理的演示链接可以在相关的网页中找到。通过这个链接,用户可以实际看到代理是如何解决软件工程问题的,包括它在工作区的步骤和终端的操作。

  • 开源代理的技术细节将在何时公布?

    -开源代理的技术细节预计将在4月10日发布。这份技术论文将详细介绍代理的工作原理、使用的基准测试、如何微调模型,以及他们的初步实验结果。

  • 运行一个AI任务的平均成本是多少?

    -运行一个AI任务的平均成本被限制在4美元以内,但实际花费通常更低。这个成本控制对于确保AI技术在日常应用中的可行性非常重要。

  • 开源代理平均需要多长时间解决一个任务?

    -开源代理平均需要93秒来解决一个任务,这比之前的系统如Devon要快得多,后者可能需要5到10分钟。

  • 开源代理目前主要使用哪种模型?

    -尽管开源代理是完全开源的,但目前主要使用的是封闭源代码模型,如GPT-4和Claude Opus,因为这些模型在性能上更强大。尽管开源模型在隐私和可执行性上有优势,但在当前阶段,封闭源代码模型由于更大的投资和更高的效率而被优先选择。

Outlines

00:00

🚀 开源软件工程代理的突破性进展

本段落主要介绍了一款先进的开源软件工程代理的发布,该代理在GitHub仓库中自动解决问题,与之前引起行业轰动的Devon软件工程师相当。视频将提供关于这个开源代理的10个关键点,包括它的功能和效果。开源代理在软件工程基准测试中的准确性与Devon相似,平均只需93秒,而且完全开源。此外,为了使GPT-4更容易编辑和运行代码,设计了一个新的代理计算机界面。

05:00

🌟 开源与闭源模型的比较及新设计

这段落讨论了开源模型与Devon等闭源模型的比较,指出开源模型在较短时间内取得了与Devon相似的结果,展示了开源社区的潜力。同时,介绍了一种新的设计,即为了使语言模型更有效地工作,需要精心设计的代理计算机界面。通过限制AI系统查看文件的行数,发现只查看100行比查看更多行更有效,这可能是因为减少了复杂性,使模型能更好地处理信息。

10:01

🛠️ 软件工程代理的工作方式

详细解释了软件工程代理是如何通过与专门终端交互来工作的,它可以打开、滚动和编辑文件,执行测试,并进行自动语法检查。代理通过思考、行动和观察的循环来解决问题。这种内部工作机制表明,开源软件代理能够进行长期或迭代规划,这对于未来的软件工程代理的发展和研究具有重要意义。

15:03

🔍 演示与技术细节

提供了一个链接,可以观看软件工程代理的演示,了解其内部工作机制。演示展示了代理如何解决问题,包括查找相关文件、编辑代码和执行测试。此外,提到了即将发布的技术细节论文,预计将在4月10日发布,届时将提供更多关于代理如何工作、使用的技术、基准测试和初步实验结果的信息。

Mindmap

Keywords

💡开源软件

开源软件指的是其源代码可以被公众访问并且可以自由使用、修改和分发的软件。在视频中,提到了一款先进的开源软件工程代理,这是一个重要的里程碑,因为它与之前提到的Devon相比,在软件工程基准测试中取得了相似的准确度,但耗时更短,展示了开源社区在技术创新方面的潜力和效率。

💡软件工程代理

软件工程代理是一种人工智能程序,它能够自动执行软件工程任务,如代码编写、调试和测试。在视频中,新的开源软件工程代理被设计为与GPT-4接口兼容,使其能够编辑和运行代码,这表明了人工智能在软件工程领域的应用潜力。

💡GPT-4

GPT-4是一个先进的语言预测模型,属于OpenAI开发的生成预训练变换器系列的一部分。它在视频中被提及,是因为新的软件工程代理是为了与GPT-4接口兼容而设计的,以便更有效地编辑和运行代码。

💡基准测试

基准测试是一种评估和比较系统性能的方法,通常通过特定的测试案例来衡量。在视频中,开源软件工程代理与Devon进行了基准测试比较,以展示其在软件工程任务上的性能。

💡用户界面设计

用户界面设计是创建易于使用和理解的界面的过程,它关注于用户与系统之间的交互。在视频中,提到了为语言模型设计友好的代理计算机界面的重要性,这类似于为人类用户设计良好的用户界面。

💡长期规划

长期规划是指系统或个体为了实现未来目标而制定的一系列连续的、有策略的行动。在视频中,软件工程代理展示了通过与终端的交互来进行长期规划的能力,它能够思考、行动并观察结果,然后根据观察再次进行思考和行动。

💡配置和扩展

配置和扩展是指对系统或软件进行修改以满足特定需求或增加新功能的过程。在视频中,开源软件工程代理可以被任何人配置和扩展,这意味着社区可以贡献新的想法,从而推动软件工程代理的发展。

💡限制信息

限制信息是指在处理任务时有意减少输入数据量或复杂性的做法。在视频中,提到通过限制AI系统一次只查看100行代码,而不是整个文件,可以提高其处理任务的效率和准确性。

💡成本效益

成本效益是指在考虑成本的同时追求最大的效益。在视频中,新的软件工程代理被设计为在成本和效率之间取得平衡,每个任务的平均成本被限制在4美元以内,这使得它在实际应用中更具可行性。

💡技术细节

技术细节是指关于技术或产品的具体和详细的信息,包括其工作原理、设计和实现方式。在视频中,提到了即将发布的技术细节论文,这将提供关于软件工程代理如何工作、使用的基准和实验结果的深入信息。

💡闭源模型

闭源模型是指其源代码不公开的软件或系统,通常由私人公司或组织开发和维护。在视频中,提到了闭源模型如GPT-4和Claude Opus在性能上优于开源模型,因此被选用于软件工程代理的开发。

Highlights

新发布的高级开源软件工程代理在GitHub仓库中自主解决问题

该代理在软件工程基准测试中的准确度与Devon相似

开源代理平均耗时93秒解决问题

为GPT 4设计了新的代理计算机界面以便于编辑和运行代码

开源代理在开源基准测试中的表现略低于Devon

开源开发者团队在较短时间内以较少资本取得显著成果

软件工程代理通过与专门终端交互来工作

代理计算机界面对性能至关重要

LM需要精心设计的代理计算机界面以提高效率

限制AI系统查看文件内容以提高性能

开源软件工程代理易于配置和扩展

开源代理的发展可能会因社区贡献而加速

提供了一个演示链接来展示软件工程代理的工作过程

技术细节将在4月10日发布的论文中公布

运行一个任务的平均成本被限制在4美元以内

平均而言,解决任务的耗时为93秒

目前主要使用闭源模型,因为它们更强大

开源模型与闭源模型相比仍有差距

Transcripts

play00:00

so there has been an announcement of a

play00:03

advanced level open-source software

play00:06

engineering agent and you can see here

play00:08

that this is really really striking

play00:11

because it was only recently that we had

play00:13

Devon be the first autonomous software

play00:16

engineer and it was something that took

play00:17

the industry by storm so in this video

play00:20

I'm going to be giving you guys 10 of

play00:22

the key takeaways on this open- Source

play00:24

agent and what is it able to do

play00:26

effectively so here we can see the

play00:28

announcement it says swe agent is our

play00:31

new system for autonomously solving

play00:34

issues in GitHub repos it gets similar

play00:37

accuracy to Devon on the software

play00:39

engineering benchmarks and takes 93

play00:42

seconds on average and its open source

play00:44

and we designed a new agent computer

play00:46

interface to make it easy for GPT 4 to

play00:49

edit and run code so let's take a look

play00:52

at some of the 10 things that you should

play00:54

know one of the first things that you

play00:55

should know is that it is open source

play00:58

and in completely open source you can

play01:00

see right here that this is absolutely

play01:03

incredible on the open-source

play01:05

comparative benchmarks we can see that

play01:07

it achieves a

play01:09

12.29% compared to Devon's

play01:12

13.84% now why is this so crazy well if

play01:16

you remember Devon an actual software

play01:19

engineering agent that was an open

play01:20

source and actually had a $25 million

play01:23

series a funding round in comparatively

play01:26

this small team of open- source

play01:28

developers have managed to achieve a

play01:30

relatively similar results in I would

play01:32

argue a Shor amount of time and with a

play01:35

lot less Capital which goes to show that

play01:37

open source could be achieving rapidly

play01:40

remarkable results in shorter time spans

play01:43

than bigger teams which does mean that

play01:46

it's quite surprising at how effective

play01:48

this team is in order to build something

play01:51

as quickly and as effective as they have

play01:54

now what's crazy as well is the distance

play01:56

between these two it's not like there is

play01:58

a massive difference it's only literally

play02:00

1% so you could argue that these models

play02:03

are practically similar and it will be

play02:05

interesting to see how in future

play02:07

versions the actual abilities do

play02:10

increase with scale and with future

play02:12

models like GPT 5 or upgraded versions

play02:15

of GPT 4.5 now like I said before if you

play02:19

compare this and point number two is

play02:20

that if we actually look once again at

play02:23

Devon's Benchmark we can see that the

play02:25

other systems that it was comparing

play02:27

itself against Were Far Far lower in

play02:30

comparison but if we look at the

play02:32

compared benchmarks in point number two

play02:34

we can see that it really really has you

play02:37

know jumped up the Gap if it exists just

play02:40

here because that means now that open

play02:42

source has pretty much caught up to the

play02:44

state-of-the-art closed Source in terms

play02:46

of what is capable which just goes to

play02:48

show that open source models unlike llms

play02:51

it actually could potentially catch up

play02:54

or even overtake them and I'm guessing

play02:56

the reason that in comparative

play02:57

benchmarks why open source could catch

play03:00

up to close source is because both of

play03:01

these models are using the base level of

play03:04

GPT 4 or potentially clawed Opus

play03:07

considering they do have advanced

play03:09

planning and coding capabilities

play03:11

natively built into them now let's move

play03:14

on to point number three point number

play03:16

three is exactly how this works so how

play03:18

does this software engineering agent

play03:20

work so the software engineering works

play03:23

by interacting with a specialized

play03:25

terminal which allows it to open scroll

play03:28

and edit through files it also allows it

play03:30

to edit specific lines with automatic

play03:32

syntax checks and it allows it to write

play03:34

and execute tests and the custombuilt

play03:37

interface is critical for good

play03:39

performance so this is something where

play03:41

they essentially describe how it works

play03:43

in a specialized terminal which allows

play03:46

it to think through its actions we can

play03:48

see that by looking at the demonstration

play03:51

we can see that there are thoughts and

play03:52

actions and then there are observations

play03:55

where it is able to check what it is

play03:57

doing so we can see right here it states

play03:59

that our production strip confirms the

play04:01

reported issue maximum and minimum are

play04:03

not being converted to R let's search

play04:05

for files related to R code generation

play04:07

it searches then we can see the

play04:09

observation and then we can see the

play04:11

thoughts and actions so right here we

play04:14

can see exactly what the system is

play04:16

thinking and then its action so we can

play04:18

see the first of course right here this

play04:20

is the thought of the AI system the

play04:22

responsible file code is likely here we

play04:24

should do this then of course we can see

play04:26

the action at the end so the thought is

play04:29

in here and then of course the action

play04:31

and then we can see from the observation

play04:33

then it looks at the observation and

play04:35

then goes back to thought and action so

play04:37

it seems like the system we have here

play04:40

internally in point number three how

play04:42

this works is that it just thinks then

play04:43

it acts then it observes what's being

play04:45

done and then it thinks once again so

play04:48

what's cool about this is that we can

play04:49

see that this is an open-source software

play04:52

agent capable of seemingly long-term

play04:55

planning or at least iteratively

play04:57

planning as it moves forward now Point

play05:00

number four is rather fascinating

play05:02

because I did see something that I

play05:03

didn't think we would see so essentially

play05:05

what we did also see in point number

play05:07

four was the fact that there is a new

play05:09

design so it says simply connecting an

play05:11

LM to a vanilla bash terminal does not

play05:13

work well our key Insight is that LMS

play05:16

require carefully designed agent

play05:18

computer interfaces similar to how

play05:20

humans like good user interface design

play05:23

for example when the Lup messes up

play05:25

identation our editor prevents it and

play05:27

gives it feedback so essentially what we

play05:29

can see here is that the language model

play05:31

needs an agent computer interface that

play05:34

is very friendly in order for it to work

play05:36

very effectively and they said that

play05:38

connecting it to Simply a vanilla bash

play05:40

terminal just doesn't work well so

play05:42

they've essentially designed a new

play05:44

design that works well natively with

play05:47

these llms to make sure that they

play05:50

understand exactly what is going on and

play05:52

that they're more effective and we can

play05:54

see right here we can see that there is

play05:56

an LM friendly environment feedback it

play05:58

goes to the agent agent then of course

play06:00

the agent computer interface has very

play06:03

simple commands that it can use such as

play06:05

navigate repo search files you the file

play06:07

viewer edit the lines and then of course

play06:09

it then converts that into the computer

play06:12

and then it comes back so we have this

play06:14

entire system of how this works up the

play06:17

editor prevents mistakes and allows it

play06:20

to work more effectively so it's clear

play06:22

that the new design made a huge

play06:24

difference in terms of the performance

play06:26

capabilities of this now there was also

play06:28

something else on the sign in point

play06:30

number five in point number five we also

play06:32

saw that they were basically limiting

play06:34

the information on this AI system so

play06:36

they said here that another example is

play06:38

that we discovered that for viewing

play06:40

files letting swe agents only view 100

play06:44

lines at a time was better than letting

play06:46

it view 200 lines or 300 lines and much

play06:49

better than letting it view the entire

play06:51

file so essentially what they're stating

play06:53

here is that they didn't want to give

play06:55

the a system all of the entire files

play06:57

when letting it complete the task they B

play06:59

basically said that it's only allowed to

play07:01

view 100 lines at a time and it was much

play07:04

better than letting it view 200 or 300

play07:07

because I'm guessing that this likely

play07:09

increased the complexity of what was

play07:11

being done maybe it confused the model

play07:14

and I'm guessing lower lines allowed the

play07:16

model to process what was going you know

play07:18

better instead so it seems that from

play07:21

this we could judge that you know the

play07:23

internal system the internal agent works

play07:25

better when it has less things to do and

play07:27

I'm guessing that that is something that

play07:29

is not not too surprising but you know

play07:31

you would think that if an agent has

play07:33

access to the entire file maybe it would

play07:35

perform better but I'm guessing that

play07:37

just showing it at 100 lines at a time

play07:39

allows it to plan better and allows it

play07:41

to be more effective and dedicate all of

play07:44

the compute to ensuring what it does is

play07:46

correct and they also say good Agent

play07:49

computer design is even more important

play07:51

when using GPT 4 so if you are building

play07:54

a advanced AI software engineering agent

play07:57

it is possible that you limiting the

play07:58

software agent to only viewing 100 lines

play08:01

at a time might be better than viewing

play08:03

it to 200 or 300 lines at a time so that

play08:06

is something interesting and I wonder in

play08:08

future if there will be an optimal

play08:10

number of lines that you can have for a

play08:12

software engineering agent to view or if

play08:15

there's going to be multiple software

play08:16

engineering agents prob possibly

play08:18

collaborating you know maybe you got

play08:20

like four or three you know

play08:21

collaborating on different parts of the

play08:23

entire codebase fixing it at one time

play08:26

now Point number six was also rather

play08:28

fascinating so the software engineering

play08:30

agent additionally can be easily

play08:32

configured and extended to improve

play08:34

future research on software engineering

play08:36

agents since the software engineering

play08:39

agent is open source anyone can

play08:41

experiment and contribute new ways for

play08:44

agents to interact with computers so

play08:46

this is something that I do find quite

play08:48

fascinating because now we have a system

play08:51

that is completely open- source which

play08:53

means that the development is likely

play08:55

going to increase even more remember if

play08:57

we looked back at Point number two in

play09:00

the compared benchmarks to Devon we

play09:02

could see that Devon wasn't that far off

play09:04

because you know the advanced software

play09:06

engineering agent the Open Source One

play09:08

especially in the announcement

play09:09

benchmarks was only

play09:11

12.29% which is only around roughly 1%

play09:14

lower than the closed Source Devon which

play09:17

essentially means that since this is

play09:18

easily configured and since is now

play09:20

completely open source we know that

play09:23

further development by other individuals

play09:25

and maybe even perhaps companies could

play09:27

take this to a whole entire new level

play09:29

which which is thus going to increase

play09:30

competition and I do wonder what kind of

play09:33

software engineering agents people are

play09:35

going to be building with this because

play09:36

this seems to be very very effective and

play09:39

so far this seems very very promising

play09:41

for the future because this is something

play09:43

that I would argue has been built

play09:45

remarkably quickly compared to open

play09:47

source llms if you remember compared to

play09:49

the release of GPT 4 and GPT 3.5 open

play09:52

source chat Bots did take quite some

play09:54

time considering the rigorous amount of

play09:56

training pre-training and all of the

play09:58

aligning that need to do to the model

play10:00

but with this it seems like people are

play10:02

easy easily able to build on top of

play10:05

existing Clos Source models and then of

play10:07

course get these Advanced software

play10:09

engineering as Point number seven is

play10:11

also rather fascinating they actually

play10:13

leave a link to a pretty cool demo in

play10:16

which you can use it and see how the

play10:19

entire thing works and I'm going to show

play10:20

you that right now before we get to some

play10:22

of the other points so this is point

play10:23

number seven and this is where we have

play10:25

the advanced software engineering agent

play10:27

and you can see how it works Works

play10:29

internally so this is of course the web

play10:31

page where they do have a lot of stuff

play10:32

but we can see the demo right here so

play10:35

essentially if you just click Next Step

play10:37

you're going to be able to see exactly

play10:38

how it works so we've got the issue

play10:40

right here and this is the issue that we

play10:42

are trying to solve and you can see that

play10:44

this is all of the code that you've you

play10:46

know put in the thing okay and of course

play10:48

you describe the bug right here you can

play10:50

see this is exactly what you want and

play10:52

then we can see the next step so it says

play10:53

to start addressing the issue we should

play10:55

do this then of course you can see in

play10:57

the terminal what the system is doing

play10:59

and of course it's trying to reduce the

play11:01

bug and then of course it's saying we're

play11:02

going to paste it in and then of course

play11:04

you can see there of course it's done

play11:05

that and then of course you can see

play11:07

exactly how this works now I'm not a

play11:09

software developer but if you are a

play11:11

software developer this is really good

play11:12

because you can see exactly how it works

play11:14

and what steps it takes on the left hand

play11:17

side in its workpace and of course it

play11:19

has its terminal it's had its editor and

play11:21

essentially you can see that this entire

play11:23

thing took around 38 steps in order for

play11:26

it to be complete so you can see the

play11:28

error has been successfully reproduced

play11:30

which confirms the described issue the

play11:32

error message yada y y before proceeding

play11:34

with the fix let's do this let's

play11:36

navigate to here and then we can see

play11:38

exactly how it works you can see it's

play11:40

opening the tools Library um and it's

play11:42

yeah it's it's really really effective

play11:44

as showing you exactly how things work

play11:46

you can also make this full screen you

play11:48

can also go ahead and make this full

play11:49

screen right here and of course you can

play11:51

see the terminal and I think that this

play11:53

is really cool because it actually shows

play11:55

you how the AI system is working and

play11:57

with Devon we did get to see a a few

play11:59

demos but I really do think that this

play12:01

website is really really effective now

play12:03

in addition at Point number eight they

play12:05

did talk about a paper release one of

play12:07

the things that many people do want is

play12:09

of course they do want technical details

play12:11

and on the Discord they said that we are

play12:13

aiming to release by April 10th so for

play12:16

the paper release they're aiming April

play12:17

10th because that is when they think

play12:20

that they're going to be able to get the

play12:21

paper out and of course if you don't

play12:22

know what the paper is that's just

play12:24

essentially where the technical details

play12:26

of exactly what's going on should be you

play12:28

know released in in terms of how it

play12:29

works all the benchmarks what open

play12:31

source or close Source systems they used

play12:33

how they fine-tuned it and some of their

play12:35

initial experiments on what was

play12:37

effective and what wasn't effective so

play12:39

next Wednesday should be the release of

play12:42

the entire paper where you can dive into

play12:44

more details Point number nine was

play12:46

rather interesting because this is how

play12:48

expensive this is to run one thing you

play12:50

probably know about AI systems already

play12:52

is that a lot of agentic tasks where you

play12:54

have to do multiple different reasoning

play12:56

steps do require the model to Output a

play12:58

lot more tokens than an initial simple

play13:01

zero shot with a simple task now what's

play13:04

crazy is that they said we limit this at

play13:06

$4 per task and on average we spend much

play13:09

less for each solved task and we'll have

play13:11

a number in the papers next week and

play13:13

we'll have a number on average of tokens

play13:16

in/out so I think right here they talk

play13:18

about how of course they don't want this

play13:20

to be an extremely expensive model and

play13:23

that is completely understandable

play13:24

considering the fact that in order for

play13:26

this to be usable in order for it to be

play13:28

viable for something that that people

play13:29

can use on a day-to-day basis it

play13:31

shouldn't be very very expensive I mean

play13:33

if you can get your you know software

play13:35

engineering issue solved for like 50

play13:36

cents or something that is going to be

play13:38

something that is very very effective

play13:40

but if every task took $10 to solve that

play13:43

would be very expensive very very

play13:46

quickly because there are a bunch of

play13:47

different tasks and if you're trying to

play13:49

use this at a scale it wouldn't be that

play13:51

costeffective to use because you know

play13:53

you're going to rack up the bill pretty

play13:56

quickly now of course you know other

play13:58

models are coming out and other models

play13:59

are getting cheaper and more effective

play14:01

so I do think like I said before over

play14:03

time the cost per token is going to go

play14:06

down quite a lot but for now they set a

play14:08

limit at $4 per task but for solve tasks

play14:12

that is actually how it works and in

play14:14

terms of the you know how long it takes

play14:16

to solve 93 seconds on average is pretty

play14:19

pretty incredible because I think on

play14:21

Devon if I remember correctly it did

play14:23

take around 5 to 10 minutes to solve but

play14:26

I can't verify that so don't take that

play14:27

as me you know hating on Devon what not

play14:29

but 93 seconds on average is very very

play14:33

very impressive now the last point and

play14:35

point number 10 is of course is that

play14:37

will they use open source models and

play14:39

they said that you know that could be

play14:41

great but right now they mainly use

play14:43

closed Source models because they are

play14:45

quite strong and in the original

play14:46

software engineering Benchmark paper we

play14:48

found that a lot of existing open source

play14:50

models were fairly far behind so

play14:52

basically they're just saying here that

play14:54

they could use models like llama 2 or

play14:56

you know mistra but the point is it's is

play14:59

that these closed Source models like

play15:00

gbd4 and Claude Opus are you know quite

play15:03

a lot better than some of these open

play15:06

source models and due to that fact

play15:08

they're just going to continue using

play15:09

them because it does make sense now

play15:11

there are benefits to you using you know

play15:13

closed Source I mean open source models

play15:15

because of course they can run locally

play15:17

and that's of course really good in

play15:19

terms of privacy and Effectiveness but

play15:21

once again of course close SS models

play15:23

have you know billions of dollars in

play15:25

Investments and they are just far more

play15:27

effective than open source models at

play15:29

this time so it seems that they won't be

play15:31

using any open source models but maybe

play15:33

they're going to allow that if you're

play15:34

going to be able to do that but I wonder

play15:36

how effective that would be considering

play15:38

that you know open source models aren't

play15:40

as great as close Source models so let

play15:42

me know what you think about this do you

play15:44

think that this is very effective do you

play15:46

think that this is something that's

play15:47

really cool do you think that this is

play15:49

going to be something that takes down

play15:50

Devon because it is right hot on the

play15:52

heels and I wonder if open source could

play15:54

actually take down closed Source in the

play15:56

very near future with that being said

play15:59

it's been the AI grid and I'll see you

play16:00

all in the next video

Rate This

5.0 / 5 (0 votes)

Related Tags
开源AI软件工程Devon对比GitHub应用AI编程技术革新效率提升GPT-4行业动态未来展望
Do you need a summary in English?