Building a self-corrective coding assistant from scratch
Summary
TLDRThe video discusses using LangGraph to implement iterative code generation with error checking and handling, similar to AlphaCode. It shows loading documentation context, structuring LM outputs, defining graph nodes to generate code, check imports and execution, and retry on failure. An example is shown where a coding mistake is fixed via the graph by passing the error back into the context prompt to induce reflection. Experiments found using the graph boosts success rates by 25% on a question set. The video encourages users to try this simple yet effective technique of code generation with tests and reflection themselves.
Takeaways
- π Introduced LangGraph as a way to build arbitrary logic flows and graphs with LLMs
- π Showed how to implement iterative code generation and testing using LangGraph, inspired by Alpha Codium paper
- π‘ Structured LLM outputs using Pydantic for easy testing and iteration on components
- π¬ Evaluated code generation success rates with vs without LangGraph - saw ~50% improvement
- π LangGraph enables feedback loops and reflection by re-prompting with prior errors
- π Built an end-to-end example flow for answering coding questions using LangGraph
- π Ingested 60K tokens of documentation for code generation context
- β Checked both imports and execution success of generated code before final output
- β€οΈ Emphasized simplicity of idea and approach for reproducing key concepts from sophisticated models like Alpha Codium
- π Encouraged viewers to try out LangGraph flows for their own applications
Q & A
What is the key innovation introduced in the Alpha Codium paper for code generation?
-The Alpha Codium paper introduces the idea of flow engineering for code generation, where solutions are tested on public and AI-generated tests, and then iteratively improved based on the test results.
How does LangGraph allow building arbitrary graphs to represent logical flows?
-LangGraph allows defining nodes as functions in a workflow, specifying conditional edges to determine the next node based on output, and mapping the nodes and edges to logical flows like in the code generation example.
What is the benefit of using a structured output format from the generation node?
-Using a structured output format with distinct components allows easily implementing tests and checks for aspects like imports and code execution, as well as feeding errors back into the regeneration process.
How does the error handling and regeneration process work?
-When an error occurs in checking imports or executing code, it is appended to the prompt to provide context. The regeneration node then produces a new solution attempt, using the prior error information.
What were the results of evaluating the LangGraph method on a 20 question dataset?
-While import checks were similar with and without LangGraph, code execution success improved from 55% to 80%, showing a significant benefit from the retry and reflection mechanism.
How many iterations does the graph allow before deciding to finish?
-The example graph allows up to 3 iteration attempts before deciding to finish, to prevent arbitrarily long execution.
What size context is used for the generation node?
-The generation node ingests around 60,000 tokens of documentation related to Lang expression language to use as context for answering questions.
What model architecture is used for the generation node?
-The example implements the generation node using a 125M parameter GPT-3 style model (Dall-E architecture) tuned on Lang documentation.
What is the purpose of tracking the question iteration count?
-Tracking the number of generation attempts for each question allows implementing logic to finish execution after a certain number of tries.
How could this approach be extended to more complex use cases?
-Possibilities include testing against larger public benchmarks, integrating more sophisticated testing frameworks, and using additional regeneration strategies.
Outlines
π Introducing code generation with LangGraph
The first paragraph introduces using LangGraph for code generation, inspired by AlphaCode. It talks about representing flows with graphs and iterating solutions based on test results.
π Structured output using pantic models
The second paragraph demonstrates how to use pantic models to structure code generation output into prefix, imports, and code for later processing. It also shows querying large context models on LangSmith.
π Building a code generation graph
The third paragraph walks through building a code generation graph with nodes for generation, import checking, code execution, and conditionals for retries. It shows how errors can be passed back to the generator.
π€ Completing the code generation workflow
The fourth paragraph completes the workflow, connecting the nodes into a full graph. It then tests an example question, showing how errors trigger retries and reflection.
π§ Evaluation shows improved performance
The fifth paragraph summarizes an evaluation on 20 questions. Using the graph boosts code execution success rates from 55% to 80%, showing the value of retry and reflection.
Mindmap
Keywords
π‘code generation
π‘flow engineering
π‘Lang Graph
π‘tests
π‘retries
π‘feedback loop
π‘reflection
π‘context
π‘performance
π‘language model
Highlights
Introduced idea of flow engineering for code generation using testing and iteration to improve solutions
Paper showed ranking and testing solutions on public and AI-generated tests, then iterating to improve based on results
Tweet by Karpathy highlighted moving from prompt-answer to flow where you build up an answer iteratively over time using testing
Introduced LangGraph weeks ago as a way to build arbitrary graphs representing different kinds of flows
Showed using tooling to always format code generation output as a Pydantic object for easy testing and iteration
Implemented simple version of AlphaCode ideas in LangGraph with question node, generation node, import and execution testing nodes, and retry logic on errors
On errors, retry generation appends the error trace to the prompt to induce reflection and retry answering based on prior output
Import checks performed fine without retry logic, but code execution success rate increased from 55% to 80% using graph and reflection
Showed example of catching error on first try, passing to prompt, and second try getting correct functional code
Simple checks and reflections with graphs can significantly improve code generation performance
AlphaCode shows sophistication, this shows simplicity and ease of implementing powerful ideas yourself
LangGraph is great for building reflective, self-improving loops with logical flows and feedback
All code available to run this on any codebase and see improvements
Showed real evaluation results over multiple runs to demonstrate statistical validity of performance gains
Encouraged experimenting with these ideas using the provided building blocks
Transcripts
hi this is Lan from Lang chain I want to
talk about using Lang graph for code
Generation Um so co- generation is one
of the really interesting applications
of llms like we've seen projects like
GitHub co-pilot become extremely popular
um and a few weeks ago a paper came out
um by the folks at codium AI called
Alpha codium and this was really cool
paper in particular because it
introduced this idea
of doing code generation using what you
can think of as like flow engineering so
instead of just like an llm a coding
prompt like solve this problem and a
solution what it does is it generates a
set of solutions ranks them so that's
fine that's like kind of standard like
kind of prompt response style flow but
what it does here that I want to draw
your attention to is if it actually
tests that code in a few different ways
on public tests and on AI generated
tests and the key point is this it
actually iterates and tries to improve
the solution based upon those test
results so that was really
interesting and a tweet came out by
karpathy on this theme which kind of
mentions hey this idea of flow
engineering is a really nice Paradigm
moving away from just like naive prompt
answer to
flow where you can build up an answer
itely over time using
testing so it's a really nice idea and
what's kind of cool is a few weeks ago
we introduced Lang graph as a way to
build kind of arbitrary graphs which can
represent different kinds of flows and
I've done some videos on this previously
talking about Lang graph or things like
rag where like you can do retrieval and
then you can do like a retrieval quality
check like grade the documents if
they're not good you can like try to
retrieve again or you can like do a web
search or something but it's a way to
represent arbitrary kind of logical
flows with
llms in a lot of the same way we do with
agents but the benefit of graphs is that
you can outline a flow that's a little
bit more it's kind of like an agent with
guardrails it's like you define the
steps in a very particular order and
every time you run the graph it just
executes in that
order so what I want to do is I want to
try to implement some of the ideas from
alpha codium using L graph and we're
going to do that right now so in
particular let's say we want to answer
coding questions about some part of the
Lang chain documentation and for this
I'm going to choose the L chain
expression language docs so it's a
subset of our docs it's around 60,000
tokens and it focuses only on line chain
expression language which is basically a
way you can represent chains using
inline chain and we'll talk about that
in a little
bit but I want to do a few simple things
so I want to have one what we're going
to call a node in our graph that takes a
question and outputs an answer using
Lang chain expression language docs as a
reference and then with that answer I
want to be able to parse out components
so given the answer I want to be able to
parse out like the Preamble what is this
answering the import specifically and
then the code and to do this I want to
use some like a pedantic object so it's
like very nicely
formatted if I have that I can really
easily Implement tests for things like
check to make sure the Imports work
check to make sure the code executes and
if either of those fail I can loop back
to my generation node and say Hey try
again here's the error Trace so again
what they're doing in Al codium is way
more sophisticated I don't mean to
suggest we're iing imple M this as is um
this actually works on like a bunch of
public coding challenges it actually has
tests um for each question that are both
add and publicly available so again
we're doing something much simpler but I
want to show how you can Implement these
kinds of ideas and you can make it
arbitrarily complex if you
want so I'm going to copy over some code
into a notebook that I have running and
all I've done is I've just done some pip
installs and I've BAS to find a few
environment variables for Lang Smith
which we'll see later is pretty
useful and I'm going to call this
docs so this is where I'm going to
ingest the docs related to Lang
expression language and I'm going to
kick off uh this right now so that's
running so again this is using a URL
loader grab all the docs sort them and
clean them a little bit and here we go
so here we go these are all the docs
related to Lang and expression language
it's around 60,000 token tokens I've
measured it in the past so there's our
docs now I I want to show you something
that's very useful um I'll call it tool
use um with open ey models and and other
LMS have similar functionality but I
want to show you something that's really
useful um what I'm going to do here is
show how to build a chain that will
output remember we talked about in our
diagram we want three things for every
solution we want a preamble we want want
Imports we want code as a structured
object that we can like work with
individually I'll show you right here
how to do that so we're doing is we're
importing um uh from pantic this base
model and field and we're defining a
data model for our output so I want a
prefix which is just like the plain
language solution like here's the setup
to the problem the import statement and
the code I want those as three distinct
things that I can work with
later I'll use dpd4 uh1 25 say 128
context window model um and what I'm
going to do is I'm going to take this
this data model I'm going to turn it
into a tool and I'm going to bind it to
my model and so basically What's
Happening Here is it's going to always
perform a function call to attempt to
Output in this format I specify here
that's all it's happening I Define a
prompt that says here's all the L cell
they're Lang CH expression language
pronounced or or substituted as LC
here's all LCL docs answer the questions
structure your output in a few ways but
what's cool is we're always growing that
function call to basically try to Output
a pantic object so there we go now
what's nice is I can just invoke this
with a question so let's just try
that so I'm going to say
question and I'm going to say how to
create a rag chain in NLC we want to
run
um okay this needs to be a dict there we
go boom so that's
running now this is actually just we can
see right here we passed in all those
docs that we previously loaded so it's
like 60,000 tokens of context and again
you think about newer long context llms
like Gemini that becomes more more
feasible to do do things like this take
like a whole a whole code base a whole
set of documentation load it and just
stuff it into a model and have it say
hey answer questions about this that's
still running now the latency is
definitely higher because it's very very
large context but that's fine we have a
little bit of time and we can go over to
Lang Smith Al this is running and have a
look so we can see here was our prompt
okay so there you go look at this 63,000
tokens you can see it's a lot of context
um that's fine and we can actually see
it all here so it's in Langs Smith um we
don't want to scroll through all that
mess but you can see we've asked a
question we're grounding the response in
all this L cell docs and we're going to
hopefully output the response as a
pedantic object that we can play with so
let's just see and okay nice it's done
so you can see our object here has a
prefix um and it actually has um it's
also going to have our Imports here as
well we can actually can see that in
lsmith uh the answer is going to be here
and there you go see your Imports your
code and your prefix and these can all
be extracted from that object uh really
easily um so it's basically a list and
it's a pantic object code and you can
extract each one just like Co you know
answer. prefix answer do uh whatever our
variables or whatever our keys are
answer. Imports answer. code so that's
great so that just shows you how tool
use Works um and how we can get the
structured output out of our generation
node
now what I'm going to do here is now
that we've established that we can do
that I'm going to start setting up our
graph and what I'm going to do is first
I'm going to find our graph state so
this is just going to be a dictionary
which contains things relevant to our
problem it'll contain our code solution
it'll contain any errors and that's all
we're going to
need and here is where this is all the
code related to My Graph and we're going
to walk through this so don't worry too
much but I just want to kind of get this
all here
so here's our code now if we go up the
way to think about this is simply this
um I want to go back to my diagram here
so every node in our graph just has a
corresponding function and that function
modifies the state in some way so what's
happening is our generation node is
going to be working with question and
iteration those are the parts of state
that we want as like inputs you can see
it kind of maps to here you have
question and you have iteration just
counts how many times you've tried this
we'll see why that's interesting
later um this is exactly what we saw
before data model llm tool
use all the same stuff right template
now here's where it's
interesting if our state contains an
error this error key what that means is
we've fed back from some of our tests
and we have an error that's already been
generated so we're retrying here's why
interesting if we're
retrying what we're going to do is
append our prompt just like we saw above
we're going to add something to our
prompt that says hey you tried this
before here was your solution we saved
that as generation key um and in our
states you can see it's right here code
solution generation here is your error
please retry to answer this so it's kind
of like inducing a reflection based on
your prior
generation and
error and retry so that's a very
important point because basically gives
us that feedback from if there's a
mistake and either the Imports or the
executions we're feeding that back to
generation and generation is going to
retry with that information present so
that's that's all that's happening there
um and we're basically adding that to
the prompt um and we're then invoking
the chain with that error and then we're
getting a new code solution so again
that's if errors in our state dick if it
isn't then we're going to go ahead and
generate our solution just like we did
above same thing so
easy um one little thing is every time
we return the the basically we're going
to rewrite that output to the state
we're going to increment our iterations
and say Here's how many times we've
tried to answer this question that's
really it and you can see that's all we
do return the generation return the
question return the number of iterations
easy now here's what's kind of nice we
talked about having these two import
checks the check for imports to check
for execution let's our check import
node just going to be really simple we
have our
solution from the solution we can get
the Imports out just like we showed
above this code solution Imports is from
our pantic object um I'll move it over
so you can kind of see so a pantic
object has Imports we can get the
Imports and all we do is just attempt to
execute the Imports if it fails we alert
hey import check failed and here's the
key point we're just going to create a
new key uh error in our dict identifying
that hey there's an error present um
something failed here and you'll see
we're going to use that later now one
other little trick if there was a prior
error in our state we're just going to
pend it so we do want to kind of
maintain that um if there's like an
accumulation of errors as we run
multiple iterations we want to keep
accumulating them so we don't like
revert and make the same mistake we
already did on a future iteration so
we're going to maintain our set of
Errors now if there's no error here then
we're going to rewrite none so we're
going to say we're good keep going uh
and basically the same thing with code
execution right in that case we're just
extracting our code and our Imports we
create a code block of imports Plus Code
try to execute it again if it fails
write our error and append all prior
errors if it doesn't return none that's
it that's all you really need to know
now here note that we're going to do two
kind of gates so we want to know if did
either of those tests fail and again all
we need to do is we can uh grab our
error and remember if there is no error
then if error is none keep going so here
we're at the code execution like
decision point so do you want to go to
code execution or do you want to like
revert back and retry so you can see
here if there's no error when we get to
the this point um then because we've
done our import check if there's no
error there keep going go to code
execution we can see we return this node
we want to go to um and if there is an
error we can say hey return to the
generate node so really what these
functions do so these are conditional
edges what these do is they do some kind
of conditional check based upon like our
output state so again if there's an
error or not if there's no error it
tells you go to this node if there is an
error it tells you go back to generate
node that's it same deal with deciding
to finish again if there's no error and
now here's the iteration thing for the
sake of Simplicity what I say is give it
three tries I don't want it to run
arbitrarily long uh if there's no error
or if you try three times just end
that's it uh otherwise go back to
generate so again same kind of thing
decide to finished based upon um yeah
based upon whether or not there's an
error in our code execution or not
that's really it that's all we're doing
so we can go down we already grabbed all
this now here is where we actually
Define our what we call our
workflow um and so this is actually
where we defined all our edges and nodes
as these functions and here's just where
we kind of stitch them all together um
so it's actually pretty straightforward
it just follows exactly like the diagram
we showed above um we like we're
basically adding all of our nodes and
we're building our graph following the
diagram that we show so we can go back
to the diagram so like you can kind of
follow along right set your entry point
it's generate add an edge generate check
code Imports now our conditional Edge um
so if we're going to decide to check
code execution that was our function
here right here
so if um basically depending on the
output here we can decide the next node
to go to so um if the output of the
function says check code
execution we go to that node if the
output says go to generate we go back to
generate so these are where you specify
the logic of the next node you want to
go to and same here so that's all we do
compile it done and just map to this
diagram um kind of like one to one so
that's actually pretty
straightforward and there's just one
little thing we now need to do uh we are
going to go
ahead and try a
question so here's a question I've I've
run a bunch of these tests already this
is a question that seems kind of random
but it's like we actually built a NE
valve set and so it's a question that we
we've sound that there's some problems
with so I want to show you why this is
pretty cool I'm passing it text Key Food
in my prompt I want to process it with
some function process text how do I do
this using uh line transpression
language so it's a weird question but
you'll see why it's kind of fun in a
short in a little bit and what I'm going
to do is I'm just going to run my graph
so what we can see because we print out
what happens at every step we're can to
kind of follow along and see what's
happening
here um so it's going to generate
solution now we can see this may take a
little bit because it's the same kind of
long context generation that we saw
previously so this is now running we can
go to Lang Smith and we can actually
just check this Lang graph and we can
see okay so it's loading up and we're at
generate so it's actually doing this
generation this is still pending here is
all our input docs so you can see that
um that uh you know we passed this very
large context to our LM uh so that's
cool okay so here's this is interesting
so what's happening is it's going
through some checks so um the code
import check worked decide to check code
execution a decision testing code
execution
here's an interesting one code block
check
failed um decision retry so it's
actually doing a
regeneration
so okay let's see it looks like it came
to an answer um let's actually go and
look at what happened in our Lang graph
to kind of understand what happened so
what happened if we look at the
um let's look at when we
attempted yeah exactly so here let me
actually pull up the error
here
um here was our
response um and what I want to show you
is the error that we appended to our
prompt
um
and we can actually make this a bit
faster we can
scroll this is the Crux of what I want
to show you um okay here it is so what's
cool is our initial attempt to solve
this problem introduced an error there
was an execution error it unsupported
Opera for types dict and string so
basically it did something wrong and we
passed that in the prompt to the llm
when it performs this retry so the our
initial solution was here and it had a a
coding error as noted but here you can
see we provide that error and we say
please try to reans this structure you
know like the same instructions before
here was uh here was the the question
and we can see okay so this is actually
the test of code execution which now
works we can see previously when we
tried this uh it fails and this was the
error that error was passed along in the
prompt like we just saw the new the new
test indeed Works our final solution is
functional
code that's it so you can kind of get
some intuition for the fact that when
you have this retry Lube you can recover
from errors using a little bit of
reflection that's really the big idea um
and again you get your answer out
here um and so there's a bunch of keys
and we don't necessarily I'll show you
quickly uh keys and then we can just
look at the generation
key um
cool and it's going to be a list so
let's just break it out so there it is
there's our code object we can see
prefix okay so there's the prefix
import uh and let's try code and hey
let's just convince ourselves this
actually works so we can just run exe
Imports that works
ex
exec the code and this should work it's
doing something there it tells a joke
great um so this is pretty cool it
initially when to try to answer this
question produce an error and it then
retry by passing that error back to the
context just like we outlined in our
graph and on the second try it gets it
correct so that's nice it's a good
example of how you can do this feedback
and and reflection stuff now we actually
have done quite a bit more work on this
so I built an eval set of 20 Questions
related to Lang and expression language
and evaluated them all uh using this
approach relative to not using Lang
graph and here's the results I want to
kind of draw your attention to this
because it's a pretty interesting result
for the import check without Lang graph
versus with Lang graph it's about the
same so Imports weren't really a problem
before this like retry reflection stuff
Imports were okay on oural set of 20
questions I should make a note we
actually ran this uh this is showing
standard errors we ran this four times
and so I basically accumulate the
results I compute standard errors you
can see that there's there's some degree
of statistical reasonable inness to
these results um in any case import
checks were were kind of fine without it
but here's a big difference there's a
big difference in our code execution uh
per performance with and without land
graph so before land graph if you just
try like kind of single shot answer
generation a lot of the times this was
like a 55% success rate many of the
cases we saw code execution fail but
with Lang graph with this kind of retry
and reflection stuff we actually can see
that the the success rate goes up to
around I believe it was 80% so it's
around like a almost a 50% Improvement
in performance um with and without using
L graph so that was actually really
impressive and and it just shows the
power of like a very simple idea um
attempting code generation with these
kinds of like just very simple checks
and reflection can significantly bump up
your performance and again the alpha
codium paper shows this in like a very
sophisticated context but what's cool is
this is like a very simple idea you can
imp Implement yourself in not much time
um and we have this all available as a
notebook you can run this on any piece
of code you want so just take whatever
documents want here Plum them in and you
can run this and you can test this out
for yourself but I've been really
impressed I think it's pretty cool um
and in general I think lra's a really
nice way to build these kind of like
reflective or self-reflective
applications where you can build these
feedback loops to you can do a check if
the check fails try again with that Fe
with that uh feedback
present um in the retry and I'll just
show you we have a Blog coming out I'm
not sure there's anything else in that
blog I haven't already showed you
um yeah not nothing really to highlight
this was our results again um this is
maybe a little bit clearer to see um but
again pretty significant Improvement in
performance simple idea uh I definitely
encourage you to experiment with this um
and of course all this code will be
available for you so um uh you know feel
free to experiment and let us know how
it goes thank
you
Browse More Related Video
Stanford CS224W: ML with Graphs | 2021 | Lecture 5.1 - Message passing and Node Classification
Mind-bending new programming language for GPUs just dropped...
Is Prompt Engineering the NEW Software Engineering?
Realtime Powerful RAG Pipeline using Neo4j(Knowledge Graph Db) and Langchain #rag
RAG from scratch: Part 10 (Routing)
Phases of Compiler | Compiler Design
5.0 / 5 (0 votes)