5 Problems Getting LLM Agents into Production

Sam Witteveen
4 Jun 202413:11

Summary

TLDRThis video discusses the top five challenges in deploying AI agents into production, focusing on reliability as the primary issue. The speaker emphasizes the need for agents to be consistently reliable, as most struggle to achieve even 99% accuracy. Other issues include agents getting stuck in loops, the importance of custom tools, self-checking mechanisms, and the need for agents to be explainable. The video suggests strategies for mitigating these problems and hints at future content on building and debugging agents.

Takeaways

  • πŸ›‘οΈ Reliability is the top concern for deploying AI agents into production, with most agents struggling to achieve even 60-70% reliability, far from the desired 'five nines' or even 'two nines' (99%).
  • πŸ” Agents often fall into excessively long loops, which can be due to failing tools or the LLMs deciding to repeat parts of the process unnecessarily, leading to inefficiency and potential costs.
  • πŸ› οΈ Customizing tools for specific use cases is crucial as generic tools may not meet the needs of an agent and can lead to failures in the agent's operation.
  • πŸ”„ The importance of creating intelligent tools that can manipulate and prepare data for LLMs effectively, and handle failures in a way that prevents endless loops.
  • πŸ” Agents require self-checking mechanisms to ensure the outputs are useful, such as running unit tests for code generation or verifying the existence of URLs.
  • πŸ“‘ Explainability is key for user trust; agents should provide explanations or citations for their outputs to show the reasoning behind decisions or results.
  • 🐞 Debugging is an essential part of agent development; logs and outputs should be intelligently designed to help trace and understand where and why an agent fails.
  • πŸ“Š Minimizing decision points in an agent's operation can lead to more straightforward and reliable outcomes, reducing the complexity and potential for errors.
  • πŸ’‘ The script suggests that not all tasks require the complexity of an LLM; sometimes, a simpler, more direct approach might be more effective.
  • πŸš€ The speaker plans to create more videos discussing building agents with frameworks like LangGraph and even without frameworks, using plain Python for certain tasks.
  • ❓ The video encourages viewers to think critically about their own agent designs, assessing decision points and reliability to improve their agents' performance.

Q & A

  • What are the five common problems discussed in the video script that people face when trying to get their AI agents into production?

    -The script discusses five key issues: 1) Reliability, with agents often not meeting the desired level of consistency; 2) Excessive looping, where agents get stuck in repetitive processes; 3) Tool issues, including the need for custom tools tailored to specific use cases; 4) Self-checking, where agents should be able to verify the usefulness of their outputs; 5) Lack of explainability, which is important for users to understand and trust the agent's decisions.

  • Why is reliability considered the number one problem for AI agents according to the script?

    -Reliability is the top issue because companies typically require a high level of consistency for production use, often expecting 'five nines' or 99.999% reliability. Most agents, however, are only able to achieve around 60-70% reliability, which is insufficient for production needs.

  • What is the concern with agents going into excessively long loops?

    -Long loops can occur for various reasons, such as a failing tool or the agent deciding to repeat a process unnecessarily. This can lead to inefficiency, increased costs if using an expensive model, and a lack of progress, ultimately hindering the agent's performance.

  • Why is it important to have custom tools for AI agents?

    -Custom tools are crucial because they can be tailored to specific use cases, ensuring that the agent can filter inputs, manipulate data, and prepare it in a way that is beneficial for the LLMs. This customization helps in avoiding common pitfalls and enhances the overall functionality and efficiency of the agent.

  • What is the purpose of self-checking in AI agents?

    -Self-checking allows the agent to verify the usefulness of its outputs, ensuring that the results are accurate and relevant to the task. This is particularly important in tasks like code generation, where running unit tests can confirm the correctness of the code produced by the agent.

  • How does the lack of explainability in AI agents affect their usability in production?

    -Without explainability, it's difficult for users to trust the agent's outputs, as they cannot understand the reasoning behind the decisions made. This is crucial for gaining user confidence and ensuring that the agent's decisions are transparent and justifiable.

  • What is the role of citations in improving the explainability of AI agents?

    -Citations provide a way to attribute the information used by the agent to make decisions or perform tasks. By showing where the information came from, citations offer transparency and help users understand the basis for the agent's actions or conclusions.

Outlines

00:00

πŸ”’ Key Challenges in Agent Reliability

The speaker addresses the common issues encountered when trying to put AI agents into production, focusing on their reliability. Companies often seek high reliability levels, but most agents struggle to achieve even 60-70% effectiveness. The speaker emphasizes the need for agents to be consistently reliable to be useful in production environments, as unreliable agents necessitate constant human oversight, which defeats the purpose of automation. The speaker also touches on the problem of agents getting stuck in loops, a common issue with certain frameworks like CrewAI, and the importance of architecting agents to avoid or quickly exit such loops.

05:02

πŸ› οΈ The Importance of Custom Tools for Agents

This paragraph delves into the critical role of tools in the functionality of AI agents. The speaker points out that while tools like those in LangChain are good for starting out, they often need to be heavily customized for specific use cases. The speaker suggests that understanding and improving these tools is essential, as they are the 'secret sauce' for agents, affecting how data is obtained, manipulated, and prepared for the LLMs. Examples are given, such as a webpage diffing tool, to illustrate how custom tools can be developed to meet specific needs and prevent endless loops, thereby enhancing the agent's efficiency and effectiveness.

10:06

πŸ” Ensuring Agent Outputs are Useful and Explainable

The speaker discusses the necessity for agents to have self-checking mechanisms to ensure the usefulness of their outputs. This is particularly important in scenarios like code generation, where running unit tests can verify the correctness of the code produced by the agent. The speaker also highlights the importance of explainability in agents, where the agent should be able to provide explanations or citations for its outputs, thereby increasing user confidence. Additionally, the speaker touches on the need for intelligent debugging tools and logs that can help trace the agent's decision-making process and identify points of failure.

πŸ›‘ Minimizing Decision Points and Debugging for Agent Efficiency

In the final paragraph, the speaker emphasizes the importance of minimizing decision points within an agent to streamline its operation and increase reliability. The speaker advises assessing existing agents to identify unnecessary decision points and to ensure conformity at each point to achieve desired outcomes. The speaker also mentions the intention to create more videos on building agents with frameworks like LangGraph and CrewAI, despite reservations about using CrewAI for production, and encourages viewers to consider moving from high-level frameworks to more direct coding approaches for greater control and efficiency.

Mindmap

Keywords

πŸ’‘Reliability

Reliability in the context of the video refers to the consistency and dependability of AI agents to perform tasks accurately and effectively. The speaker emphasizes that most AI agents currently fall short of the desired 'five nines' (99.999%) reliability, often achieving only around 60-70% effectiveness. This lack of reliability is a significant barrier to putting AI agents into production, as companies require agents to be consistently useful and beneficial to end users.

πŸ’‘Frameworks

Frameworks in this video are the underlying structures or systems that support the development and operation of AI agents. The speaker mentions being 'framework agnostic,' meaning they aim to provide advice applicable to various systems, but acknowledges that some issues are more relevant to specific frameworks. The discussion of CrewAI and LangGraph as examples illustrates the varying levels of reliability and features that different frameworks offer.

πŸ’‘Production

Production in the video signifies the stage where AI agents are fully implemented and operational in a real-world environment, performing tasks without the need for constant human intervention. The speaker discusses the challenges of reaching this stage due to issues like low reliability and the need for agents to be autonomous and consistently effective.

πŸ’‘Autonomy

Autonomy in the context of the video is the ability of AI agents to operate independently without the need for human oversight or intervention. The speaker desires agents to reach a state where they can produce consistent results on their own, which is crucial for their successful deployment in production environments.

πŸ’‘Loops

Loops in the video refer to repetitive cycles that AI agents may get stuck in, often due to not being satisfied with the output of a tool or a sub-agent. The speaker mentions this as a common issue with certain frameworks, like CrewAI, where agents may repeatedly perform the same actions without achieving the desired outcome, leading to inefficiency and potential cost increases.

πŸ’‘Tools

Tools in the video are the functionalities or programs that AI agents use to perform tasks, such as data retrieval or manipulation. The speaker points out the importance of having reliable and customized tools that can effectively support the agent's operations. They also highlight the need for these tools to be able to communicate failures in a way that prevents the agent from entering endless loops.

πŸ’‘Customization

Customization in the video is the process of tailoring tools and systems to fit specific use cases or requirements. The speaker suggests that while starting with existing tools can be helpful, it's often necessary to create custom tools that are better suited to the unique needs of an AI agent, ensuring they can operate more effectively and efficiently.

πŸ’‘Self-checking

Self-checking is the ability of an AI agent to evaluate its own outputs to determine their usefulness or accuracy. The speaker uses the example of code generation, where the agent should be able to run tests to ensure the code is functional. This concept is crucial for agents to ensure they provide valuable and correct results to end users.

πŸ’‘Explainability

Explainability in the video refers to the agent's capacity to provide rationale or evidence for its decisions or outputs. The speaker argues that this is important for building trust in the agent's results, such as through citations or logs that show the decision-making process. This is particularly important for users to understand and have confidence in the agent's actions.

πŸ’‘Debugging

Debugging in the context of the video is the process of identifying and resolving issues within an AI agent's operation. The speaker stresses the importance of having logs and outputs that can help in understanding where an agent fails or behaves unexpectedly, which is essential for improving its performance and reliability.

πŸ’‘Decision Points

Decision points in the video are the specific instances within an AI agent's operation where it must make a choice or judgment. The speaker advises minimizing these points to streamline the agent's actions and reduce the potential for errors or inefficiencies. This concept is related to simplifying the agent's processes to achieve desired outcomes more directly.

Highlights

The video discusses five common problems faced when trying to put AI agents into production.

Reliability is the top issue for AI agents, with most only achieving around 60-70% effectiveness.

The desire for agents to be fully autonomous without the need for human oversight.

Agents often get stuck in excessively long loops, which can be due to failing tools or repeated attempts at a task.

The importance of hardcoding limits on the number of steps an agent can take to prevent endless loops.

Customizing tools for specific use cases is crucial for the success of AI agents.

Tools should be able to handle data effectively and communicate failures to the LLM in a beneficial way.

Creating custom tools for specific tasks can greatly enhance an agent's functionality.

The necessity for agents to have self-checking mechanisms to ensure the usefulness of their outputs.

The example of using unit tests for code generation by agents to verify their outputs.

The challenge of ensuring agents generate accurate URLs and the importance of validating them.

The lack of explainability in LLM agents and the need for providing explanations for their decisions.

The use of citations as a method to increase confidence in an agent's output by showing information sources.

The importance of debugging and having intelligent logs to trace an agent's decision-making process.

Minimizing decision points in an agent's architecture to streamline outcomes and reliability.

The suggestion that sometimes simple tasks do not require an LLM and can be sequenced without decision points.

The speaker's intention to make more videos on building with LangGraph and evaluating the effectiveness of CrewAI for prototyping.

The emphasis on the importance of considering these problems when developing AI agents for production environments.

Transcripts

play00:00

All right.

play00:00

So in this video, I want to talk about the five problems that I keep seeing

play00:05

again and again that people face of getting their agents good enough

play00:09

to basically put into production.

play00:12

I get a lot of questions about in regards to frameworks around this.

play00:16

And while I'm trying to be sort of reasonably framework agnostic

play00:20

here, certainly some of these things apply a lot more to some

play00:23

frameworks than to other frameworks.

play00:26

So one of the things that came up recently was someone asked me about

play00:28

putting CrewAI into production.

play00:31

And my comment was that I actually would never currently put CrewAI into production

play00:36

based on the fact that, there were so many issues with it that I wouldn't trust it.

play00:42

Putting things like LangGraph into production that's

play00:44

certainly much more reliable.

play00:47

but I think you've got some of these problems with all of the different

play00:51

agent frameworks if you're not aware of them and if you're not

play00:54

thinking about how to basically fix these problems as we go through.

play00:58

So let's dive into this.

play01:00

By far the number one problem for all of agents out there

play01:04

at the moment is reliability.

play01:07

So talking to a lot of startups, talking to a lot of companies that

play01:10

want to do agents the thing I'm seeing consistently is that companies are very

play01:15

reluctant to do agents, for anything really complicated just because the

play01:20

reliability of the agents is so low.

play01:23

While your typical company wants five nines of reliability, they'd probably

play01:27

even settle for, two nines of reliability, meaning 99%, but most agents are

play01:33

probably at best getting around 60, 70 percent of being able to do things.

play01:39

Now, there are some places where maybe that's okay, but for the majority

play01:43

of things, getting something into production, you have to make it reliable.

play01:47

You have to be able to make it consistently be able to produce

play01:51

an output that the end user would be able to benefit from.

play01:56

That the end result would be able to, be like they expect it to

play02:00

be and something that they can benefit from and actually use it.

play02:04

there's no use of creating agents that only work some of the time, and then end

play02:10

up failing a large percentage of the time.

play02:13

The issue that creates is the whole issue of humans then having to basically

play02:18

check every single thing in the agent.

play02:21

Now that's fine if you're, starting out and you're trying to make training

play02:25

data or something like that, and you've got a human in the loop and

play02:28

you're doing that kind of thing.

play02:29

but really what we want for agents eventually is we want to be able

play02:33

to be fully autonomous, to be fully operating by themselves, producing a

play02:37

consistent level of result, without a human having to be in the loop there.

play02:43

So this brings us to some of the things that actually go wrong.

play02:46

So the second thing that I see happening a lot is, agents going into

play02:50

excessively long loops and this can be for a variety of different reasons.

play02:54

But it's quite common to see this in CrewAI and some of the other frameworks.

play02:59

, where you'll have it set up and the agents will basically not like the

play03:04

output , either of a tool, which can be one of the ways that this happens quite

play03:09

often is a failing tool or a tool that sort of just don't working in some way.

play03:14

the other way too, though, is that where the LLMs basically, get a response out

play03:19

from one sub agent to the next part.

play03:23

And it just decides that no, it needs to do that part again.

play03:26

And it just gets into this loop of going through it again and again and again.

play03:30

Now this is one of the frustrations I've felt a lot with CrewAI

play03:34

and with some of the others.

play03:35

with LangGraph, what I actually do is I sort of hard code it so that we kind

play03:40

of know how many steps it's taking.

play03:43

Now CrewAI has actually, set up a thing also that does something like that

play03:47

nowadays too where you can actually limit the number of steps that it

play03:50

goes through or repeats and retries that it does for this kind of thing.

play03:55

But this is a very common pattern that you see with LLM agents, that

play03:58

they get into these kind of loops.

play04:01

And a lot of what you have to think about when you're architecting an agent is

play04:05

actually how to handle any of these loops.

play04:08

ideally you want to reduce them to none.

play04:11

but if they do happen, you want to make sure that your overall sort of agent or

play04:17

system is aware that they're happening.

play04:19

And then puts a stop to them pretty quickly.

play04:22

Otherwise, you find that you end up just getting an agent, just going

play04:25

on, making LLM call off the LLM call.

play04:28

And if it is, fully autonomous where you're not watching that, they can get

play04:32

very expensive very quickly if you're using an expensive model or something

play04:36

The third problem that can go wrong is around tools.

play04:39

Now, tools is something that I've been meaning to make a

play04:41

lot more videos about, in here.

play04:44

In the previous section, I talked about failing tools.

play04:47

And this is something that happens a lot, that I feel like

play04:49

people are often not aware of.

play04:51

while the tools in things like LangChain, a pretty nice for starting out, you're

play04:57

gonna find that you want to customize them a lot to your specific use case.

play05:02

you need to understand that a lot of those tools were made over a year ago.

play05:06

They were very simple at the time.

play05:08

They're not really made for agents a lot.

play05:11

They're often made more for use in sort of RAG than agentic stuff.

play05:16

and you really find that what you want to do is basically make

play05:19

your own set of custom tools.

play05:22

Now I will follow up with a video talking a bit about custom tools,

play05:25

but I will say that, tools are really your agents sort of secret sauce.

play05:30

if you got a really good set of tools that basically can filter inputs

play05:36

can use inputs in the right way.

play05:38

can generate outputs that are going to be beneficial to the actual LLMs.

play05:43

So really the whole tools thing is all about how do you get data?

play05:48

how do you manipulate data?

play05:50

And how do you prepare it for an LLM?

play05:52

And then when it fails, how does the tool basically tell the LLM that it's

play05:57

failed in a way that, is actually going to be beneficial Rather than

play06:01

going into an endless loop in here.

play06:04

So you can see for often really simple things, I will make quite complex tools.

play06:09

This is an example of a webpage diffing tool, just to check, basically the

play06:14

outputs of a web page so then an agent can tell when a web page has been updated.

play06:19

So for example, this was a simple use of the tool for basically checking,

play06:24

if OpenAI's webpage had been updated.

play06:27

it could then basically assess what new links were there, and then

play06:31

be able to go to those new links.

play06:33

and find out what had been announced for, returning news,

play06:37

returning different kinds of things.

play06:38

Now the same kind of thing, worked nicely on sites like CNN and other

play06:41

news sites and stuff like that.

play06:43

The idea here though, is that this is a very custom tool

play06:47

for a very specific use case.

play06:49

And that's how you want to think about most of the things that you're doing.

play06:52

When I look at some of the best, agents that I see companies doing, they've

play06:56

generally got very specific tools that, they are able to sort of handle,

play07:01

different kinds of input, work out what they need to do to generate data,

play07:06

et cetera, provide that back to the agent in a way that's useful so that

play07:11

the agent can know what's going on.

play07:14

one of the sort of classic examples is if you look at a lot of the simple

play07:18

search tools while they'll return information about, what's on the page,

play07:23

they don't actually provide the URL.

play07:26

so you want to sort of go through and customize some of those things so that

play07:29

you're actually getting the URL back.

play07:31

You're storing those URLs.

play07:33

You'll then basically, caching any response to that URL.

play07:38

So, if you're scraping that URL, then you're caching it so that your

play07:41

agent can basically use that cache again and again, without having to

play07:45

do any kind of, repeating itself of calling these different things.

play07:49

this is a whole class of what I would call sort of intelligent tools

play07:53

that you want to build in here.

play07:54

All right.

play07:55

Th this brings us to the fourth problem that I see a lot is the

play07:58

whole idea of self-checking.

play08:01

you need your agent to have some thing or some way of being able to check

play08:08

its outputs and see, is it generating outputs that are useful or not useful?

play08:13

the classic example of this would be, with code examples.

play08:17

So if you got an agent that you've got, that's actually generating code,

play08:22

you want to make sure that at some point, that code is checked and that

play08:26

might be as simple as running a unit test on it to see, do all the imports

play08:31

work, do the functions actually run, and return what I expect for them.

play08:36

You want to set up some tests for things like that So that you can

play08:39

actually check the output of the code that the agent is actually generating.

play08:44

Now in lots of other use cases, you're not going to be generating code.

play08:48

So you need to think about in those sort of situations, how will your agent

play08:53

have the ability to know if something is right versus if something is wrong,

play08:57

how can it check to see that this is something that's going to be useful versus

play09:01

something that's just going to be totally off base of what the end user wants?

play09:06

and that can be things like, checking URLs, LLMs loved hallucinate URLs.

play09:12

So check, do those URLs actually exists?

play09:14

Do they not exist?

play09:15

That kind of thing that you want to think about as you're going through,

play09:18

but this idea of self checking is a really sort of key thing.

play09:22

The last thing, I think that you need to think about a lot and that

play09:26

I see as a big problem with LLM agents is the lack of explainability.

play09:31

So you really want to think about when the user actually gets a result

play09:34

back at the end from an agent.

play09:37

Can the agent sort of point to some explanation?

play09:40

Now this could be citations is a great way of doing this.

play09:43

citations showing exactly where the information that used to basically make

play09:48

a decision or to do something, was,

play09:51

That gives people a lot more confidence in the output of the agent when

play09:54

they can see why the agent said something, or why the agent gave a

play09:59

certain result, that kind of thing.

play10:01

It can also be things like, being able to look at a set of log files

play10:05

or look at a set of outputs that the agent made along the way.

play10:09

So this brings us to sort of like the sixth of the bonus sort

play10:12

of thing that you need to think of, which is debugging an agent.

play10:16

you need to have some kind of outputs or some kind of logs that are kind

play10:21

of intelligent and not just purely calling the LLMs and the agents.

play10:24

That's one way of doing it, but can be very tedious way of going through.

play10:28

You need to be able to assess at which point does the agent start to fall apart?

play10:33

Now, remember a lot of this stuff.

play10:35

if you're using the LLM agent, you should be using that to basically make decisions.

play10:40

And perhaps generate, tokens out, as either text or as code or something

play10:45

like that but mostly what you're using the reasoning part of an LLM

play10:50

agent is to be able to make decisions is to be able to see these things.

play10:54

Now you want to make sure that's something that gets logged

play10:58

independently that's quite easy for you to see, ah, okay, this looks a

play11:03

bit suspicious what's going on here?

play11:05

Can we debug this?

play11:06

We can look at the reasoning points in the agent as we go along.

play11:11

So these things I think are things that you need to be thinking about constantly

play11:15

when you're doing anything with LLM agents, autonomous agents in here.

play11:19

far too often, I see people doing stuff that actually, you don't even need an

play11:24

LLM, to do some of these things, you can just basically, sequence them up

play11:28

. There's no need for any sort of decision point or something like that in there.

play11:32

make sure that, when you're building your agent, you want it to have as few decision

play11:37

points as possible to get the outcome that you want to be able to achieve with this.

play11:42

So go back and assess some of your own agents and look at it and think

play11:46

about, okay, where are the points of decision, going on in here?

play11:51

And how am I checking to make sure that each of these things is being conformed

play11:56

to, so that you do get the actual sort of reliability out of these things.

play12:02

Are we making a bunch more videos of looking at building things with

play12:06

LangGraph, even with things like CrewAI.

play12:09

Even though, I don't think CrewAI is ideal for production.

play12:13

I think it's great for trying ideas out really quickly.

play12:17

I'll show you some sort of things that I've been doing with that To

play12:19

be able to build some of these crews really quickly and try out ideas and

play12:24

get a sense of what is probably going to work, what is not going to work.

play12:28

and then look at, more about how converting them across to much more

play12:32

sort of low level code things like LangGraph, things like just coding

play12:37

some of these things in plain Python.

play12:39

Often you don't need a framework to do some of these things.

play12:42

and that's something that I want to go into more in the

play12:45

future as we go through this.

play12:46

Anyway, hopefully this video was useful to get you thinking about

play12:49

the key things that go wrong in getting LLM agents into production.

play12:54

And how you can start just think about mitigating some of these

play12:57

problems that you come across.

play12:59

As always, if you've got comments or questions, please

play13:02

put them in the comments below.

play13:03

If you found the video useful, please click like and subscribe.

play13:07

And I will talk to you in the next video.

play13:09

Bye for now.

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AI AgentsProduction IssuesReliabilityAutonomyFrameworksCrewAILangGraphCustom ToolsSelf-CheckingExplainabilityDebugging