New OPEN SOURCE Software ENGINEER Agent Outperforms ALL! (Open Source DEVIN!)
Summary
TLDR该视频脚本介绍了一款先进的开源软件工程代理,它能够在GitHub仓库中自主解决问题,与Devon这款软件工程基准的准确性相似,平均用时93秒。该代理完全开源,允许GPT-4轻松编辑和运行代码。视频还讨论了开源模型与闭源模型的比较、代理的工作方式、新设计的界面、信息限制、未来研究的可扩展性、成本效益以及开源模型的潜在使用。
Takeaways
- 🚀 开源软件工程代理的出现标志着在GitHub仓库自动解决问题的新系统,与Devon相似的准确性但开源且平均耗时仅93秒。
- 🔍 开源代理在软件工程基准测试中的表现与Devon相近,仅低1.55%,这表明开源项目在短时间内能取得显著成果。
- 💡 软件工程代理通过与专门终端交互来工作,支持文件浏览、编辑、语法检查和测试执行,强调了为GPT-4设计的友好界面的重要性。
- 🌟 通过限制AI系统查看文件的行数(如每次仅100行),可以提高其效率和准确性,这可能有助于模型更好地处理和理解任务。
- 🔧 开源软件工程代理的设计允许易于配置和扩展,促进了未来软件工程代理研究的发展。
- 🔗 提供了一个演示链接,通过它可以直观地了解软件工程代理的工作流程和内部机制。
- 📜 预计将在4月10日发布一篇论文,详细介绍技术细节、基准测试、模型微调和有效性实验。
- 💰 尽管AI系统执行复杂任务的成本较高,但该软件工程代理将每个任务的成本限制在平均$4以下。
- ⏱️ 软件工程代理平均在93秒内解决问题,显示出其高效的性能。
- 📈 尽管开源模型具有隐私和本地运行的优势,但目前闭源模型由于其强大的性能和巨额投资,仍然是首选。
Q & A
开源软件工程代理的发布有何重要性?
-这个开源软件工程代理的发布非常重要,因为它在性能上与之前发布的Devon相当,但使用了更少的资本和时间。这表明开源社区能够快速地取得显著的技术进步,并且有可能在未来超过商业封闭源代码的解决方案。
开源代理与Devon在性能上有何差异?
-开源代理在软件工程基准测试中的准确率与Devon相近,Devon为13.84%,而开源代理为12.29%。这表明开源代理的性能几乎与Devon相同,但开源代理的开发成本更低,时间更短。
开源代理是如何工作的?
-开源代理通过与专门设计的终端交互来工作,这个终端允许它打开、滚动和编辑文件,还能进行语法检查和编写执行测试。这个为GPT-4优化的接口对提高代理性能至关重要。
为什么需要为语言模型设计友好的代理计算机接口?
-为了让语言模型更有效地工作,需要为其设计一个友好的代理计算机接口。类似于人类需要好的用户界面设计,代理计算机接口可以帮助模型更好地理解任务,并提供及时反馈,从而避免错误并提高效率。
限制AI系统查看文件行数的策略有何影响?
-限制AI系统一次只查看100行代码,而不是200行或300行,甚至整个文件,可以提高模型处理任务的效率。这可能是因为较少的信息量降低了模型处理的复杂性,使其能够更专注和有效地执行任务。
开源软件工程代理如何促进未来的研究和发展?
-由于这个软件工程代理是完全开源的,任何人都可以对其进行实验和改进,为代理与计算机的交互方式贡献新的想法。这种开放性可能会吸引更多的开发者和公司参与,从而加速代理技术的发展。
开源代理的演示链接在哪里可以找到?
-开源代理的演示链接可以在相关的网页中找到。通过这个链接,用户可以实际看到代理是如何解决软件工程问题的,包括它在工作区的步骤和终端的操作。
开源代理的技术细节将在何时公布?
-开源代理的技术细节预计将在4月10日发布。这份技术论文将详细介绍代理的工作原理、使用的基准测试、如何微调模型,以及他们的初步实验结果。
运行一个AI任务的平均成本是多少?
-运行一个AI任务的平均成本被限制在4美元以内,但实际花费通常更低。这个成本控制对于确保AI技术在日常应用中的可行性非常重要。
开源代理平均需要多长时间解决一个任务?
-开源代理平均需要93秒来解决一个任务,这比之前的系统如Devon要快得多,后者可能需要5到10分钟。
开源代理目前主要使用哪种模型?
-尽管开源代理是完全开源的,但目前主要使用的是封闭源代码模型,如GPT-4和Claude Opus,因为这些模型在性能上更强大。尽管开源模型在隐私和可执行性上有优势,但在当前阶段,封闭源代码模型由于更大的投资和更高的效率而被优先选择。
Outlines
🚀 开源软件工程代理的突破性进展
本段落主要介绍了一款先进的开源软件工程代理的发布,该代理在GitHub仓库中自动解决问题,与之前引起行业轰动的Devon软件工程师相当。视频将提供关于这个开源代理的10个关键点,包括它的功能和效果。开源代理在软件工程基准测试中的准确性与Devon相似,平均只需93秒,而且完全开源。此外,为了使GPT-4更容易编辑和运行代码,设计了一个新的代理计算机界面。
🌟 开源与闭源模型的比较及新设计
这段落讨论了开源模型与Devon等闭源模型的比较,指出开源模型在较短时间内取得了与Devon相似的结果,展示了开源社区的潜力。同时,介绍了一种新的设计,即为了使语言模型更有效地工作,需要精心设计的代理计算机界面。通过限制AI系统查看文件的行数,发现只查看100行比查看更多行更有效,这可能是因为减少了复杂性,使模型能更好地处理信息。
🛠️ 软件工程代理的工作方式
详细解释了软件工程代理是如何通过与专门终端交互来工作的,它可以打开、滚动和编辑文件,执行测试,并进行自动语法检查。代理通过思考、行动和观察的循环来解决问题。这种内部工作机制表明,开源软件代理能够进行长期或迭代规划,这对于未来的软件工程代理的发展和研究具有重要意义。
🔍 演示与技术细节
提供了一个链接,可以观看软件工程代理的演示,了解其内部工作机制。演示展示了代理如何解决问题,包括查找相关文件、编辑代码和执行测试。此外,提到了即将发布的技术细节论文,预计将在4月10日发布,届时将提供更多关于代理如何工作、使用的技术、基准测试和初步实验结果的信息。
Mindmap
Keywords
💡开源软件
💡软件工程代理
💡GPT-4
💡基准测试
💡用户界面设计
💡长期规划
💡配置和扩展
💡限制信息
💡成本效益
💡技术细节
💡闭源模型
Highlights
新发布的高级开源软件工程代理在GitHub仓库中自主解决问题
该代理在软件工程基准测试中的准确度与Devon相似
开源代理平均耗时93秒解决问题
为GPT 4设计了新的代理计算机界面以便于编辑和运行代码
开源代理在开源基准测试中的表现略低于Devon
开源开发者团队在较短时间内以较少资本取得显著成果
软件工程代理通过与专门终端交互来工作
代理计算机界面对性能至关重要
LM需要精心设计的代理计算机界面以提高效率
限制AI系统查看文件内容以提高性能
开源软件工程代理易于配置和扩展
开源代理的发展可能会因社区贡献而加速
提供了一个演示链接来展示软件工程代理的工作过程
技术细节将在4月10日发布的论文中公布
运行一个任务的平均成本被限制在4美元以内
平均而言,解决任务的耗时为93秒
目前主要使用闭源模型,因为它们更强大
开源模型与闭源模型相比仍有差距
Transcripts
so there has been an announcement of a
advanced level open-source software
engineering agent and you can see here
that this is really really striking
because it was only recently that we had
Devon be the first autonomous software
engineer and it was something that took
the industry by storm so in this video
I'm going to be giving you guys 10 of
the key takeaways on this open- Source
agent and what is it able to do
effectively so here we can see the
announcement it says swe agent is our
new system for autonomously solving
issues in GitHub repos it gets similar
accuracy to Devon on the software
engineering benchmarks and takes 93
seconds on average and its open source
and we designed a new agent computer
interface to make it easy for GPT 4 to
edit and run code so let's take a look
at some of the 10 things that you should
know one of the first things that you
should know is that it is open source
and in completely open source you can
see right here that this is absolutely
incredible on the open-source
comparative benchmarks we can see that
it achieves a
12.29% compared to Devon's
13.84% now why is this so crazy well if
you remember Devon an actual software
engineering agent that was an open
source and actually had a $25 million
series a funding round in comparatively
this small team of open- source
developers have managed to achieve a
relatively similar results in I would
argue a Shor amount of time and with a
lot less Capital which goes to show that
open source could be achieving rapidly
remarkable results in shorter time spans
than bigger teams which does mean that
it's quite surprising at how effective
this team is in order to build something
as quickly and as effective as they have
now what's crazy as well is the distance
between these two it's not like there is
a massive difference it's only literally
1% so you could argue that these models
are practically similar and it will be
interesting to see how in future
versions the actual abilities do
increase with scale and with future
models like GPT 5 or upgraded versions
of GPT 4.5 now like I said before if you
compare this and point number two is
that if we actually look once again at
Devon's Benchmark we can see that the
other systems that it was comparing
itself against Were Far Far lower in
comparison but if we look at the
compared benchmarks in point number two
we can see that it really really has you
know jumped up the Gap if it exists just
here because that means now that open
source has pretty much caught up to the
state-of-the-art closed Source in terms
of what is capable which just goes to
show that open source models unlike llms
it actually could potentially catch up
or even overtake them and I'm guessing
the reason that in comparative
benchmarks why open source could catch
up to close source is because both of
these models are using the base level of
GPT 4 or potentially clawed Opus
considering they do have advanced
planning and coding capabilities
natively built into them now let's move
on to point number three point number
three is exactly how this works so how
does this software engineering agent
work so the software engineering works
by interacting with a specialized
terminal which allows it to open scroll
and edit through files it also allows it
to edit specific lines with automatic
syntax checks and it allows it to write
and execute tests and the custombuilt
interface is critical for good
performance so this is something where
they essentially describe how it works
in a specialized terminal which allows
it to think through its actions we can
see that by looking at the demonstration
we can see that there are thoughts and
actions and then there are observations
where it is able to check what it is
doing so we can see right here it states
that our production strip confirms the
reported issue maximum and minimum are
not being converted to R let's search
for files related to R code generation
it searches then we can see the
observation and then we can see the
thoughts and actions so right here we
can see exactly what the system is
thinking and then its action so we can
see the first of course right here this
is the thought of the AI system the
responsible file code is likely here we
should do this then of course we can see
the action at the end so the thought is
in here and then of course the action
and then we can see from the observation
then it looks at the observation and
then goes back to thought and action so
it seems like the system we have here
internally in point number three how
this works is that it just thinks then
it acts then it observes what's being
done and then it thinks once again so
what's cool about this is that we can
see that this is an open-source software
agent capable of seemingly long-term
planning or at least iteratively
planning as it moves forward now Point
number four is rather fascinating
because I did see something that I
didn't think we would see so essentially
what we did also see in point number
four was the fact that there is a new
design so it says simply connecting an
LM to a vanilla bash terminal does not
work well our key Insight is that LMS
require carefully designed agent
computer interfaces similar to how
humans like good user interface design
for example when the Lup messes up
identation our editor prevents it and
gives it feedback so essentially what we
can see here is that the language model
needs an agent computer interface that
is very friendly in order for it to work
very effectively and they said that
connecting it to Simply a vanilla bash
terminal just doesn't work well so
they've essentially designed a new
design that works well natively with
these llms to make sure that they
understand exactly what is going on and
that they're more effective and we can
see right here we can see that there is
an LM friendly environment feedback it
goes to the agent agent then of course
the agent computer interface has very
simple commands that it can use such as
navigate repo search files you the file
viewer edit the lines and then of course
it then converts that into the computer
and then it comes back so we have this
entire system of how this works up the
editor prevents mistakes and allows it
to work more effectively so it's clear
that the new design made a huge
difference in terms of the performance
capabilities of this now there was also
something else on the sign in point
number five in point number five we also
saw that they were basically limiting
the information on this AI system so
they said here that another example is
that we discovered that for viewing
files letting swe agents only view 100
lines at a time was better than letting
it view 200 lines or 300 lines and much
better than letting it view the entire
file so essentially what they're stating
here is that they didn't want to give
the a system all of the entire files
when letting it complete the task they B
basically said that it's only allowed to
view 100 lines at a time and it was much
better than letting it view 200 or 300
because I'm guessing that this likely
increased the complexity of what was
being done maybe it confused the model
and I'm guessing lower lines allowed the
model to process what was going you know
better instead so it seems that from
this we could judge that you know the
internal system the internal agent works
better when it has less things to do and
I'm guessing that that is something that
is not not too surprising but you know
you would think that if an agent has
access to the entire file maybe it would
perform better but I'm guessing that
just showing it at 100 lines at a time
allows it to plan better and allows it
to be more effective and dedicate all of
the compute to ensuring what it does is
correct and they also say good Agent
computer design is even more important
when using GPT 4 so if you are building
a advanced AI software engineering agent
it is possible that you limiting the
software agent to only viewing 100 lines
at a time might be better than viewing
it to 200 or 300 lines at a time so that
is something interesting and I wonder in
future if there will be an optimal
number of lines that you can have for a
software engineering agent to view or if
there's going to be multiple software
engineering agents prob possibly
collaborating you know maybe you got
like four or three you know
collaborating on different parts of the
entire codebase fixing it at one time
now Point number six was also rather
fascinating so the software engineering
agent additionally can be easily
configured and extended to improve
future research on software engineering
agents since the software engineering
agent is open source anyone can
experiment and contribute new ways for
agents to interact with computers so
this is something that I do find quite
fascinating because now we have a system
that is completely open- source which
means that the development is likely
going to increase even more remember if
we looked back at Point number two in
the compared benchmarks to Devon we
could see that Devon wasn't that far off
because you know the advanced software
engineering agent the Open Source One
especially in the announcement
benchmarks was only
12.29% which is only around roughly 1%
lower than the closed Source Devon which
essentially means that since this is
easily configured and since is now
completely open source we know that
further development by other individuals
and maybe even perhaps companies could
take this to a whole entire new level
which which is thus going to increase
competition and I do wonder what kind of
software engineering agents people are
going to be building with this because
this seems to be very very effective and
so far this seems very very promising
for the future because this is something
that I would argue has been built
remarkably quickly compared to open
source llms if you remember compared to
the release of GPT 4 and GPT 3.5 open
source chat Bots did take quite some
time considering the rigorous amount of
training pre-training and all of the
aligning that need to do to the model
but with this it seems like people are
easy easily able to build on top of
existing Clos Source models and then of
course get these Advanced software
engineering as Point number seven is
also rather fascinating they actually
leave a link to a pretty cool demo in
which you can use it and see how the
entire thing works and I'm going to show
you that right now before we get to some
of the other points so this is point
number seven and this is where we have
the advanced software engineering agent
and you can see how it works Works
internally so this is of course the web
page where they do have a lot of stuff
but we can see the demo right here so
essentially if you just click Next Step
you're going to be able to see exactly
how it works so we've got the issue
right here and this is the issue that we
are trying to solve and you can see that
this is all of the code that you've you
know put in the thing okay and of course
you describe the bug right here you can
see this is exactly what you want and
then we can see the next step so it says
to start addressing the issue we should
do this then of course you can see in
the terminal what the system is doing
and of course it's trying to reduce the
bug and then of course it's saying we're
going to paste it in and then of course
you can see there of course it's done
that and then of course you can see
exactly how this works now I'm not a
software developer but if you are a
software developer this is really good
because you can see exactly how it works
and what steps it takes on the left hand
side in its workpace and of course it
has its terminal it's had its editor and
essentially you can see that this entire
thing took around 38 steps in order for
it to be complete so you can see the
error has been successfully reproduced
which confirms the described issue the
error message yada y y before proceeding
with the fix let's do this let's
navigate to here and then we can see
exactly how it works you can see it's
opening the tools Library um and it's
yeah it's it's really really effective
as showing you exactly how things work
you can also make this full screen you
can also go ahead and make this full
screen right here and of course you can
see the terminal and I think that this
is really cool because it actually shows
you how the AI system is working and
with Devon we did get to see a a few
demos but I really do think that this
website is really really effective now
in addition at Point number eight they
did talk about a paper release one of
the things that many people do want is
of course they do want technical details
and on the Discord they said that we are
aiming to release by April 10th so for
the paper release they're aiming April
10th because that is when they think
that they're going to be able to get the
paper out and of course if you don't
know what the paper is that's just
essentially where the technical details
of exactly what's going on should be you
know released in in terms of how it
works all the benchmarks what open
source or close Source systems they used
how they fine-tuned it and some of their
initial experiments on what was
effective and what wasn't effective so
next Wednesday should be the release of
the entire paper where you can dive into
more details Point number nine was
rather interesting because this is how
expensive this is to run one thing you
probably know about AI systems already
is that a lot of agentic tasks where you
have to do multiple different reasoning
steps do require the model to Output a
lot more tokens than an initial simple
zero shot with a simple task now what's
crazy is that they said we limit this at
$4 per task and on average we spend much
less for each solved task and we'll have
a number in the papers next week and
we'll have a number on average of tokens
in/out so I think right here they talk
about how of course they don't want this
to be an extremely expensive model and
that is completely understandable
considering the fact that in order for
this to be usable in order for it to be
viable for something that that people
can use on a day-to-day basis it
shouldn't be very very expensive I mean
if you can get your you know software
engineering issue solved for like 50
cents or something that is going to be
something that is very very effective
but if every task took $10 to solve that
would be very expensive very very
quickly because there are a bunch of
different tasks and if you're trying to
use this at a scale it wouldn't be that
costeffective to use because you know
you're going to rack up the bill pretty
quickly now of course you know other
models are coming out and other models
are getting cheaper and more effective
so I do think like I said before over
time the cost per token is going to go
down quite a lot but for now they set a
limit at $4 per task but for solve tasks
that is actually how it works and in
terms of the you know how long it takes
to solve 93 seconds on average is pretty
pretty incredible because I think on
Devon if I remember correctly it did
take around 5 to 10 minutes to solve but
I can't verify that so don't take that
as me you know hating on Devon what not
but 93 seconds on average is very very
very impressive now the last point and
point number 10 is of course is that
will they use open source models and
they said that you know that could be
great but right now they mainly use
closed Source models because they are
quite strong and in the original
software engineering Benchmark paper we
found that a lot of existing open source
models were fairly far behind so
basically they're just saying here that
they could use models like llama 2 or
you know mistra but the point is it's is
that these closed Source models like
gbd4 and Claude Opus are you know quite
a lot better than some of these open
source models and due to that fact
they're just going to continue using
them because it does make sense now
there are benefits to you using you know
closed Source I mean open source models
because of course they can run locally
and that's of course really good in
terms of privacy and Effectiveness but
once again of course close SS models
have you know billions of dollars in
Investments and they are just far more
effective than open source models at
this time so it seems that they won't be
using any open source models but maybe
they're going to allow that if you're
going to be able to do that but I wonder
how effective that would be considering
that you know open source models aren't
as great as close Source models so let
me know what you think about this do you
think that this is very effective do you
think that this is something that's
really cool do you think that this is
going to be something that takes down
Devon because it is right hot on the
heels and I wonder if open source could
actually take down closed Source in the
very near future with that being said
it's been the AI grid and I'll see you
all in the next video
Weitere ähnliche Videos ansehen
Python Advanced AI Agent Tutorial - LlamaIndex, Ollama and Multi-LLM!
Devin AI - Are Software Engineers finally doomed?
17th Int. gvSIG Conference: Version Control System on gvSIG Desktop
Intro to software suite: ARES Commander and Undet point cloud tools
你不一定非得Cursor不可,Claude dev和Continue的组合也棒极了!| AI IDE | 破除迷思
Ilya Sutskever (OpenAI Chief Scientist) on Open Source vs Closed AI Models
5.0 / 5 (0 votes)