Cosines New AI Software Developer GENIE Surprises Everyone! (AI Software Engineer)
Summary
TLDRCosign Genie, a fine-tuned version of GPT-4, has achieved a 3.8% score on the SW Bench, setting a new benchmark in software engineering. Unlike traditional AI models, Genie is designed to mimic human software engineers, using unique data sets to understand and solve coding problems. It can fetch issues from GitHub, write and debug code iteratively, and even open pull requests. CoSign's approach to AI development focuses on human-like reasoning, with plans to expand Genie's capabilities across programming languages and frameworks.
Takeaways
- 🚀 Cosign Genie is a new, state-of-the-art fine-tuned version of GPT-4 designed for software engineering tasks.
- 🏆 Genie achieved the highest score on the SW Bench, a software engineering benchmark, with a 3.8% performance rate.
- 🧠 The development approach for Genie was unique, focusing on emulating human reasoning by training on real examples of software engineers' work.
- 🔍 Genie can be prompted with natural language, such as GitHub issues, and it iteratively solves problems by fetching relevant code examples and writing new code.
- 💻 Genie's process includes planning, retrieval, code writing, and code running, all performed in a manner that mimics human software engineers.
- 🔧 Genie has the ability to edit code in place, a task that foundational models often struggle with.
- 🔄 The model is trained using a self-improvement loop, where it learns from its mistakes and corrects them in subsequent training iterations.
- 📈 There's a significant potential for improvement in AI models, as shown by the rapid increase in scores on the SW Bench.
- 🌐 Cosign plans to refine Genie's capabilities, expand its proficiency to more programming languages and frameworks, and create different sizes of AI models for various tasks.
- 📖 The future of Genie includes open-sourcing, pre-training, and the ability to specialize in specific code bases or programming languages.
Q & A
What is CoSign Genie and how does it relate to software development?
-CoSign Genie is a state-of-the-art, fine-tuned version of GPT-4 designed to perform software engineering tasks. It is capable of autonomously solving coding problems by emulating human reasoning and decision-making processes.
What is the significance of CoSign Genie's 3.8% performance on the SW Bench?
-CoSign Genie's 3.8% performance on the SW Bench signifies its high capability in software engineering tasks, outperforming other models and showcasing its advanced problem-solving abilities in real-world coding scenarios.
How does CoSign Genie's approach differ from other AI models in software engineering?
-CoSign Genie is trained on real examples of software engineers' work, focusing on human reasoning and step-by-step decision making. This differs from other models that use base models and prompting, allowing Genie to tackle problems more like a human.
What are the unique data techniques CoSign used to train Genie?
-CoSign used techniques that represent perfect information lineage, incremental knowledge discovery, and step-by-step decision making, all of which are designed to mimic how a human engineer logically approaches problem-solving.
How does Genie interact with a real coding problem from a repository?
-Genie can be prompted with a natural language description, such as a GitHub issue. It then iteratively fetches relevant files, writes and tests code, and uses debugging tools until it successfully solves the problem.
What advantages does CoSign Genie's data-first approach provide over foundational models?
-The data-first approach gives Genie a deep understanding of how software engineers break down and triage issues. It can edit code in place efficiently and has a long context window, allowing it to try multiple solutions without losing information.
How quickly was CoSign Genie able to solve a real problem from an unknown repo?
-CoSign Genie solved a real problem from an unknown repository in just 84 seconds, which is significantly faster than what a human could typically achieve.
What does CoSign Genie do after solving a problem?
-After solving a problem, Genie writes a pull request (PR) title and body, and opens the PR on the linked GitHub repository through the CoSign web platform, where it can respond to comments and reviews as if it were a human colleague.
What is the future outlook for CoSign Genie according to the script?
-The future outlook includes refining the data set to enhance Genie's capabilities, broadening its proficiency in more programming languages and frameworks, and creating different sizes of AI models for various tasks. There's also a plan for an open-source model and pre-training to improve generalization.
How does CoSign Genie's training process involve self-improvement?
-CoSign Genie's training process involves using the model's initial attempts to solve problems, correcting its mistakes, and incorporating these corrections into the training data for subsequent versions, leading to iterative improvement.
What are the implications of CoSign Genie's ability to understand specific code bases?
-CoSign Genie's ability to understand specific code bases allows it to be tailored to a company's unique programming languages and practices, making it an expert in a company's 'dialect' of code and enhancing its practical utility in real-world software development.
Outlines
🚀 Introduction to Cosign Genie and its Revolutionary Approach
Cosign Genie is a cutting-edge AI model designed to revolutionize software development. It has achieved the highest score on the SW Bench, showcasing its ability to perform tasks like a human software engineer. The model's development took a unique approach by training on real examples of software engineers at work, focusing on human reasoning and step-by-step decision-making. Unlike other models, Genie is not just generating random code; it tackles problems methodically, much like a human developer. It can be prompted with natural language, such as a GitHub issue, and it iteratively solves problems, writing and debugging code in a process that mirrors human software engineering practices. Genie's success rate and speed, solving a real problem in just 84 seconds, highlight its potential to outperform human capabilities in certain tasks.
📈 Unleashing AI's Full Potential: The Evolution of GPT Models
The script delves into the concept of 'unhobbling the gains' in AI, where models are initially limited in their practical applications but can be significantly improved with algorithmic enhancements like reinforcement learning and Chain of Thought prompting. It discusses how these improvements have led to a remarkable increase in performance, as seen in the rapid advancement of GPT models. The video also highlights Cosign Genie's approach to AI development, which from its inception aimed to create an autonomous agent capable of independent decision-making, akin to a human programmer. The training process involved teaching Genie background knowledge and unwritten strategies that experienced programmers possess, ensuring it could generate code that fits with existing project structures. The iterative self-improvement of Genie, where it learns from its mistakes and corrects them, is a key aspect of its development, leading to increasingly accurate and efficient problem-solving capabilities.
🔮 The Future of AI in Software Development: Cosign Genie's Vision
The final paragraph outlines Cosign Genie's future roadmap, which includes refining the data set to enhance Genie's capabilities, broadening its proficiency to include more programming languages and frameworks, and creating AI models of varying sizes for different tasks. The company plans to offer an open-source model and pre-training, aiming for improved generalization and specialized data reconciliation. A particularly exciting feature for businesses is the ability to train Genie to understand and work within specific, complex code bases, even those using uncommon or company-specific programming languages. This development signifies a significant evolution in AI's role in software development, with continuous improvements expected in the capabilities and applications of AI models like Genie.
Mindmap
Keywords
💡CoSign Genie
💡SW Bench
💡Human Reasoning
💡Data-first Approach
💡GitHub Issue
💡Iterative Process
💡Long Context Window
💡Codebase
💡Debugging Tools
💡Self-Improvement in Training
💡Agentic Loop
Highlights
Cosign Genie introduces a fine-tuned version of GPT-4 with a 3.8% performance on the new software engineering verified Benchmark.
Genie is designed to emulate human software engineers through unique training on real examples of software engineers at work.
Genie achieves the highest score on SW Bench, showcasing its ability to tackle problems like a human.
Genie can be prompted with natural language, such as a GitHub issue, and begins problem-solving iteratively.
The model fetches relevant files from a codebase, demonstrating an understanding of the issue at hand.
Genie writes and iteratively tests code, emulating the debugging process a developer would use.
Genie's training includes watching humans solve problems, giving it a deep understanding of software engineering breakdown and triage.
Genie can edit code in place, a task that foundational models often struggle with.
Genie's long context window allows it to try multiple approaches without losing information.
Genie solved a real problem from an unknown repo in just 84 seconds, a speed unmatched by human developers.
Genie can write a PR title and body and open a pull request on GitHub, integrating seamlessly into the development workflow.
Genie's performance on the SW Bench verified leaderboard has surpassed previous high scores, indicating rapid improvement in AI models.
The transcript discusses 'unhobbling the gains' in AI, where simple improvements can unlock significant latent capabilities.
Genie was designed to be agentic from the start, aiming for autonomous decision-making similar to a human programmer.
Genie's training includes teaching it the background knowledge and unwritten strategies that experienced programmers possess.
The agentic loop of Genie consists of planning, retrieval, code writing, and code running, performed in a human-like manner.
Genie's training process includes self-improvement, where it learns from its mistakes and corrects them in subsequent versions.
Cosign plans to refine Genie's capabilities, introduce new programming languages and frameworks, and create different sizes of AI models for various tasks.
Cosign aims to offer an open-source model and pre-training, extending foundational models on their extensive data set for improved generalization.
Genie can be trained to understand specific, large code bases, even for uncommon or company-specific programming languages.
Transcripts
so software development has taken
another massive stride with cosign Genie
coming in showing us the new
state-of-the-art fine-tuned version of
GPT 4 that can perform
3.8% on the new software engineering
verified Benchmark announced last
Tuesday take a look at their
announcement video It's rather
fascinating hi I'm Ally co-founder and
CEO of cosign a human reasoning lab and
I'd like to show you Genie our
state-of-the-art fully autonomous
software engineering colleague Genie has
the highest score on SW bench in the
world and the way we achieved this was
by taking a completely different
approach we believe that if you want a
model to behave like a software engineer
it has to be shown how a human software
engineer Works we've designed new
techniques to derive human reasoning
from real examples of software Engineers
doing their jobs our data represents
perfect information lineage incremental
knowledge Discovery and step-by-step
decision making representing everything
a human engineer does
logically by actually training genie on
this unique data set rather than simply
prompting base models which is what
everyone else is doing we've seen that
we're no longer simply generating random
code until some works it's tackling
problems like a
human so let's take a look at Genie
solving a real problem from a real repo
you'll notice you can prompt Genie with
a natural language prompt ticket or in
our case a GitHub issue so I'll go ahead
and
start so now Genie's fetched the GitHub
issue when I click solve it'll start
looking into the
problem as you can see it started
thinking about what it'll need to find
in order to solve this problem this
process is iterative and it will keep
going until the model is satisfied that
it's found everything that it needs
there we go we can see that it's pulled
a couple of examples of files from the
codebase that intuitively look like
they're relevant to the issue that we're
looking at now it's going to start
writing code to try to solve the problem
much like the retrieval step this
process is also iterative Genie will
write code run it and then react as a
function of what it's
seen one of the great advantages of our
data first approach is that because our
model has watched more human solve
problems that any human could in a
lifetime it has a great grasp of how
software Engineers really breakdown and
triage issues it's also easily able to
edit code in place which is something
that foundational models struggle with
without rewriting entire
sections Genie is now running the code
its writing and is using the debugging
tools that we've given it to look at
application State and execution flow
just like a developer would again it's
seen humans do this millions of times
and is emulating that process so back to
the task we've just watched Genie try a
couple of different approaches to
solving this problem and at first it
wasn't successful so it planned again
and has just written an alternative
approach this process can continue
indefinitely and because of the long
context window that Genie has available
to it many different approaches could be
tried without losing an information
along the way there we go all the tests
have now passing jinia successfully
solved this problem and it solved it in
just 84 seconds which i' guess was much
faster than any human could come to an
unknown repo with an unknown issue and
solve a problem so now it'll write a PR
title and body and actually open the pr
on our link GitHub repo through the
cosine web platform any comments or
reviews left on that PR will be heard by
Genie and will be acted upon as if it
was a real human
colleague we'd like to thank open AI for
allowing us to f- tune such a long
context window model and I'm extremely
excited to see where and how you guys
use Genie if you'd like to give Genie a
try just head over to our website at
cosign
Dosh we truly believe that software
engineering is just the starting point
and that we can codify human reasoning
for any job or industry we can't wait to
show you what we've been working on now
with this what we can see here is the
other models that are on this Benchmark
so thewe bench verified leaderboard is
the leaderboard that puts together all
of the previous agents SL models SL
agentic workflows that work to solve
these issue now previously the previous
high score was Amazon Q's developer
agent at at
38.8% now what's crazy about all of this
is the rate at which models are
improving we can see that from 7%
earlier this year all the way up to
43.8% this is a remarkable level of
improvement now the reason that this is
truly remarkable is not mainly for the
fact that we got better models but the
craziest thing about all of this is that
one of the things that you know Leopold
Ashen brener someone who worked at open
AI on the super alignment team what he
actually spoke about in his paper the
decade ahead was this thing called un
hobbling the gains and this was where by
default the model learns a lot of
amazing raw capabilities but they are
all hobbled in sorts of Dumb Ways
limiting their practical value and with
simple algorithmic improvements like
reinforcement learning Chain of Thought
prompting with tools and with
scaffolding we can unlock significant
latent capability basically stating that
look the way how we use LMS is
rudimentary and over time we're going to
figure out ways to get better and better
with these models so overtime is going
to be interesting to see how these
models will perform in terms of their
abilities that we manage to extract from
those models when we understand what
they're capable of for example in this
paper it talks about this you know this
is UN hobbling so imagine you had to
solve a hard math problem but you had to
instantly answer with the very first
thing that came to mind it seems obvious
that you would have a pretty hard time
except for the simplest problems but
until recently that's how we asked llms
to solve math problems you remember in
the first days of gbt 4 people would
just ask it a question but after that
what we decided to do was Chain of
Thought we decided to give it a
step-by-step scratchpad and it was able
to solve much more difficult problems
that were so Chain of Thought prompting
unlocked that for llms and the reason
I'm going over this is because now that
we're seeing that with new methods and
the way that new AI systems are
performing we're managing to unlock more
and more cas capabilities with the
system you can see here how the base GPT
4 has gained you know around 40% on its
level it says that gbt 4 base model 5%
with just the base model to 20% with gbt
4 post trained on release to nearly 40%
today with better post trining tools and
agent scaffold so now the reason that I
actually spoke about this is because
this relates exactly to what cosign gen
are doing and on their paper where they
actually talk about this you know model
they state that you know Genie was
always designed to be agentic although
when we first dreamt up the idea back in
2022 that term didn't really cement
itself in 2022 that was you know really
really early on so basically what
they're stating here is that from the
start of developing this model they
designed it to be you know autonomous
they wanted this model to act
independently and make decisions rather
than a smart assistant that would just
make it a passive toour they wanted this
to be like an actual assistant so they
wanted Genie to actually understand what
it was looking at and respond in the
most logical way quite like a human
programmer would so essentially you can
see here it says this is the tip of the
iceburg when it comes to the work that
was done to make as much implied
information in a developer's mind
explicit and for every task they train
genie on they had to teach it how to
First gather essential background
information about the project and this
was actually to prevent Genie from
making up code that doesn't fit with the
existing project structure that's where
they talk about you know so that it
wouldn't hallucinate code and ju
solutions that were in line with how the
code base was already organized and
already operated so they put a lot of
effort into teaching Genie the kind of
background knowledge that experienced
programmers already have in their heads
but don't actually always write down
basically how you teach some of the
rules of the game but all of the
unwritten strategies too now here's
where they actually talk about how gen
Works how it's you know a genetic Loop
actually works they say that you know um
the agentic loop is compromised of four
main processes planning retrieval code
writing and code running and these alone
are not new most Tools in this space so
the main thing is of course planning
retrieval code writing and running and
these are alone and not new of course
most Tools in this space will'll use a
mix of all of these but they say that
because Genie is trained to perform each
of these tasks like a human would rather
than how a base llm would we're
basically able to get so much more
performance from the model so once again
as I've spoken about before with the UN
hobbling it seems that genie have
managed to just extract more performance
out of this model now another crazy
thing that I saw was that they actually
talk about the use of self-improvement
in training the model they say that much
of the data that we were training on was
in a perfect State because the vast
majority of the time the code that is
published by human is in a working state
for it to be published so basically what
they did here which was rather you know
genius was that they you know used the
first version of Genie to try and solve
coding problems and then when made
mistakes they showed it how to correct
those mistakes and they then added these
examples of mistakes and corrections to
the training data for the next version
of G and then they repeated this process
multiple times so they basically used
self-improvement of you know to train
the model and I'm wondering that like
could they somehow repeat this Loop in
the future to get these models even
better and you can see it says every
time we repeated this process the
initial candidate solution from Genie
was stronger and in many cases is
correct and the cases where it wasn't
the amount of correction we had to show
the model in the data set was much
reduced so there was this iterative
Improvement of you know the model
improving the model that was just
completely crazy so um they also talk
about the future and they state that you
know despite Genie's impressive
state-of-the-art performance we know
that there's untapped potential and
we're committed to refining the data set
to enhance Genie's capability they're
going to be broading data introducing
new capabilities and that Genie will
become Prof efficient in more
programming languages and the latest
Frameworks so overall they're going to
be creating different sizes of AI models
smaller ones for simple tasks bigger
ones for more complex jobs and they can
turn any advanced model into a genie by
their method of fine-tuning and what's
interesting about this is that they're
stating that they're going to you know
do an open source model and pre-training
extending a foundational model on our
extensive data set aiming for improved
generalization and specialized data
reconciliation and one of the things
that they talk about is that a really
exciting feature for businesses is that
they can find Jun Genie to perfectly
understand specific larger code bases
this works even for uncommon or company
specific programming languages it's like
teaching Genie to become an expert in a
company's unique dialect of code so this
is going to be rather fascinating
because the software development space
for AI has evolved so rapidly and it
seems like nearly every month we get a
large update that shows how much these
companies are improving
Weitere ähnliche Videos ansehen
5.0 / 5 (0 votes)