Open Challenges for AI Engineering: Simon Willison
Summary
TLDRThe speaker discusses the evolution of AI models, focusing on how GPT-4's dominance has been challenged by new competitors like Gemini, Claude, and others. They explain how the cost and performance of these models are improving, making them accessible and competitive. The speaker also highlights the importance of understanding model benchmarks, the challenges of using tools like ChatGPT effectively, and issues like AI trust, data privacy, prompt injection, and the rise of AI-generated 'slop' content. The talk emphasizes responsible AI use and the need for power users to guide others in mastering these tools.
Takeaways
- 💡 GPT-4 was released in March of last year and dominated the space for 12 months with no real competition.
- 📉 The competition has finally caught up, with models like Gemini 1.5, Claude 3.5, and other new models being strong rivals to GPT-4.
- 📊 MML benchmarks are commonly used to compare language models, but they measure trivia-like questions, which don't fully represent model capabilities.
- 🤖 Chatbot Arena ranks models based on user preferences, showing how models perform based on 'vibes' and user experience.
- 📈 Llama 3, Nvidia, and other open-source models are now competing at GPT-4's level, making advanced AI technology more accessible.
- 🔒 AI trust is a major issue, as companies face skepticism from users, especially concerning data privacy and AI training on private information.
- ⚠️ Prompt injection remains a significant security vulnerability in many systems, with markdown image exfiltration being a common attack vector.
- 🧠 Using AI tools like ChatGPT effectively requires experience and skill, making them power user tools, despite appearing simple at first glance.
- ⚠️ The concept of 'slop' refers to unreviewed AI-generated content. Publishing slop without verification is harmful and should be avoided.
- 🌍 GPT-4 class models are now widely available and free to consumers, marking a new era of AI accessibility and responsibility.
Q & A
What was the significance of GPT-4's initial release in March last year?
-GPT-4 was released in March last year and quickly became the leading language model, setting a high standard for AI capabilities in the market. For over a year, it remained uncontested as the best available model.
What was OpenAI's first exposure of GPT-4 to the public, according to the script?
-OpenAI's GPT-4 was first exposed to the public when Microsoft's Bing, secretly running on a preview of GPT-4, made headlines for attempting to break up a reporter's marriage. This incident was covered by The New York Times.
Why was the dominance of GPT-4 seen as disheartening for some in the AI industry?
-The dominance of GPT-4 was seen as disheartening because, for a full year, no other model could compete with it, leading to a lack of competition in the AI space. Healthy competition is considered important for progress and innovation in the industry.
What has changed in the AI landscape in the past few months regarding GPT-4’s dominance?
-In the past few months, other organizations have launched models that can compete with GPT-4. The AI landscape has evolved, with models like Gemini 1.5 Pro and Claude 3.5 Sonet now offering comparable performance.
What are the three clusters of models mentioned in the script?
-The three clusters mentioned are: 1) the top-tier models like GPT-4, Gemini 1.5 Pro, and Claude 3.5 Sonet; 2) the cheaper but still highly capable models like Claude 3 and Gemini 1.5 Flash; and 3) older models like GPT-3.5 Turbo, which are now less competitive.
Why is the MMLU benchmark used, and what does it measure?
-The MMLU benchmark is used because it provides comparative numbers for AI models, making it easy to evaluate their performance. It primarily measures knowledge-based tasks, but its usefulness is limited because the tasks resemble trivia questions rather than practical, real-world problems.
What does the speaker mean by 'measuring the vibes' of AI models?
-'Measuring the vibes' refers to evaluating how AI models perform based on user experiences and qualitative factors, rather than just raw knowledge benchmarks like MMLU. This approach involves testing models in real-world settings where users rank their experiences, such as with the LM Cy Chatbot Arena.
What is the significance of the Chatbot Arena in evaluating AI models?
-The Chatbot Arena uses an ELO ranking system, where users anonymously compare AI models' responses to the same prompts. This allows for a more nuanced and realistic evaluation of how models perform in actual conversations.
What role does 'prompt injection' play in AI, and why is it important?
-Prompt injection refers to manipulating an AI by feeding it specific inputs that cause unexpected or unwanted behavior. It’s important because it can create security vulnerabilities or lead to errors in AI systems, as illustrated by the markdown image exfiltration bug mentioned in the script.
What is 'slop' in the context of AI-generated content, and why should it be avoided?
-Slop refers to unreviewed and unrequested AI-generated content that is published without proper oversight. It should be avoided because it leads to low-quality information being shared, potentially damaging trust in AI systems and overwhelming the internet with inaccurate or irrelevant data.
Outlines
🤖 Competition in AI Models Grows Stronger
The speaker discusses the release of GPT-4 and its dominance for a year until other models caught up. Initially, GPT-4 was the top large language model (LLM), and this lack of competition was disheartening. However, recent developments have brought several new models that match GPT-4's performance. These include Gemini 1.5 Pro, Claude 3.5 Sonet, and others, which are now competing with GPT-4 in terms of quality and pricing. The speaker shows a revised performance-cost chart comparing the latest models and emphasizes the existence of different classes of models: the best-performing, the inexpensive yet capable, and the outdated GPT-3.5 Turbo, which the speaker suggests avoiding.
📊 Open Source Models and the Changing AI Landscape
The speaker highlights the appearance of open-source models such as LLaMA 3 and Nvidia's new model in the leaderboard of language models, indicating that GPT-4 is no longer unique. The LM Cy chatbot arena rankings show that other models, including those from Chinese organizations, are competing at a high level. The speaker also touches on the evolution of these models' rankings over time, showing animations that represent these changes. With several organizations competing at the highest level, GPT-4 class models are now a commodity. The speaker believes that the future will bring cheaper, faster, and more accessible LLMs, and emphasizes the change in accessibility for advanced systems to everyone.
🔍 Challenges in Using AI Tools Effectively
The speaker argues that using tools like ChatGPT effectively is challenging, particularly when utilizing advanced features like uploading a PDF file. They provide an example of how the effectiveness of using a PDF in ChatGPT depends on multiple conditions, such as whether the file is searchable or contains images. The speaker stresses that understanding how to make the most out of these features requires technical knowledge and practice. They draw a parallel with using Microsoft Excel—most users can perform simple tasks, but mastering advanced features takes years of experience. The lesson is that LLM tools also require a similar learning curve for effective use.
😠 The AI Trust Crisis and Misunderstandings
The speaker discusses the AI trust crisis, exemplified by Dropbox and Slack facing backlash for their AI features that were mistakenly believed to be training on private data. In reality, neither company used customer data for training. Despite their efforts to assure users, public mistrust persisted. The speaker mentions that models like Claude 3.5 Sonet were trained without customer data and were still highly effective, which challenges the assumption that using large amounts of customer data is necessary for high-quality AI. However, the fact that models were trained on scraped web data complicates the trust issue further. They also discuss prompt injection vulnerabilities that have been exploited in various LLM systems, emphasizing the need to understand and prevent such vulnerabilities.
Mindmap
Keywords
💡GP4 Barrier
💡MML Benchmark
💡Claude 3.5
💡LLM Assistance
💡Prompt Injection
💡Slop
💡Vibes
💡Code Interpreter
💡AI Trust Crisis
💡Markdown Image Exfiltration Bug
💡GPT-3.5 Turbo
Highlights
The release of GPT-4 in March 2023 was uncontested for 12 months, but now there's more competition from multiple organizations.
GPT-4 was first revealed to the public through Microsoft's Bing chatbot, which used a preview of GPT-4 and made headlines by trying to break up a reporter’s marriage.
By mid-2024, several models have caught up to GPT-4, including Gemini 1.5 Pro, Claude 3.5, and others, forming a new competitive landscape in AI.
Cheaper models like Claude 3 and Gemini 1.5 Flash offer high-quality performance at low cost, challenging GPT-4’s dominance.
The MMLU benchmark is often used to evaluate models, but it focuses on trivia-like questions, which may not accurately reflect practical usage.
The 'Vibes' of a model, as measured by chatbot arenas like LMSys, give a more practical evaluation of how models perform in real-world scenarios.
Open-source models like Llama 3 70B from Meta and NVIDIA’s new model are also performing at a level comparable to GPT-4.
There is now a widespread shift where GPT-4 level models have become more accessible and are seen as a commodity.
Many new users will experience GPT-4-like performance for free as models like Claude 3.5 and GPT-4 are available without cost.
Although advanced AI models are now widely available, they are still difficult for most people to use effectively, requiring a lot of experience.
AI has a trust issue, with users fearing that their data is used for training. This issue was highlighted by controversies around Slack and Dropbox’s AI features.
Anthropic's Claude 3.5 is one of the best models, and it has been trained without any customer data, countering the notion that user data is necessary to train strong models.
The concept of 'slop' refers to unreviewed, AI-generated content that is published without scrutiny, contributing to the internet’s growing problem of low-quality content.
Prompt injection remains a serious problem, where malicious prompts can manipulate AI chatbots into revealing sensitive information or behaving in unintended ways.
AI-generated content should always be reviewed and verified by humans to avoid contributing to misinformation and low-quality digital content.
Transcripts
[Music]
this was supposed to be open AI I am
replacing open AI at the last minute
which is super fun so you can bet I used
a lot of llm assistance to pull things
together that I'm going to be showing
you today um but let's dive straight in
I want to talk about the gp4 barrier
right
so back in um March of last year so just
over a year ago gp4 was released and was
obviously the best available model we
all got into it it was super fun and
then for 12 and it turns out that wasn't
actually our first first exposure to GPT
4 a month earlier it had made the front
page of the New York Times when
Microsoft's Bing which was secretly
Runing on a preview of gp4 tried to
break up a reporter's marriage which is
kind of amazing love that that was the
first exposure we had to this new
technology but gb4 it's been out it's
been out since March last year and for a
solid 12 months it was uncontested right
the gp4 models s were clearly the best
available like language models lots of
other people trying to catch up nobody
else was getting there and I found that
kind of depressing to be honest you know
it was you kind of want comp healthy
competition in this space the fact that
open I had produced something that was
so good that nobody else was able to to
match it was a little bit disheartening
this has all changed in the last few
months I could not be more excited about
this my favorite image for sort of
exploring and understanding the the
space that we exist in is this one by
Karina win um she put this out as a
chart that shows the performance on the
MML Benchmark versus the cost per token
of the different models now the problem
with this chart is that this is from
March the world has moved on a lot since
March so I needed a new version of this
and um so what I did is I took her chart
and I pasted it into gp4 code
interpreter I gave it new data and I
basically said let's rip this off right
let's and it's an AI conference I feel
like ripping off other people's creative
work kind of does fit a little bit um so
I pasted it in I gave it the data and I
spent a little bit of time with it and I
built this it's not nearly as pretty but
it does at least illustrate the state
that we're in today with these newer
models and if you look at this chart
there are three clusters ERS that stand
out the first is these one these are the
best models right the Gemini 1.5 Pro
gp40 the brand new clae Point 3 3.5
Sonet these are really really good I
would classify these all as gp4 class
like I said a few months ago gp4 had no
competition today we're looking pretty
healthy on that front and the pricing on
those is pretty reasonable as well down
here we have the cheap models and these
are so exciting like Claude 3 Hau and
the Gemini 1 .5 flash models they are
incredibly inexpensive they are very
very good models you know they're not
quite GPT 4 class but they are really no
you can get a lot of stuff done with
these very inexpensively if you are
building on top of large language models
these are the three that you should be
focusing on and then over here we've got
GPT 3.5 turbo which is not as cheap and
really quite bad these days if you are
building there you are in the wrong
place you should move to another one of
these bubbles
problem all of these benchmarks are
running this is all using the MML
Benchmark the reason we use that one is
it's the one that everyone reports their
results on so it's easy to get
comparative numbers if you dig into what
MML U is it's basically a bar trivia
knite like this is a question from mlu
what is true for a type IIA Supernova
the correct answer is a this type occurs
in binary systems I don't know about you
but none of the stuff that I do with
llms requires this level of knowledge of
the world of supernovas like this is
it's B Trivia it doesn't really tell us
that much about how good these models
are but we're AI Engineers we all know
the answer to this we need to measure
the Vibes right that's what matters when
you're evaluating a model and we
actually have a score for Vibes we have
a scoreboard this is the LM Cy chatbot
Arena right where random um user voters
of this thing are given the same prompts
from two Anonymous models they pick the
best one it works like chess scoring and
the the best models bubble up to the top
via the ELO ranking this is genuinely
the best thing that we have out there
for really comparing these models in
this sort of Vibes in in terms of The
Vibes that they have and if and this
screenshots just from yesterday and you
can see that GPD 40 is still right up
there at the top but we've also got
Claude suit right up there with it like
the the G the gp4 is no longer in its
own class if you scroll down though
things get really exciting on the next
page because this is where the openly
licensed models start showing up llama
370b is right up there in that sort of
gp4 class of models we've got a new
model from Nvidia we've got command r+
from coh here Alibaba and deep seek AI
at both Chinese organizations that have
great models now it's pretty Apparent
from this that it's not lots of people
are doing it now the gp4 barrier is no
longer really a problem incidentally if
you scroll all the way down to 6
6 there's GPT 3.5 turbo again stop using
that thing it is not good
um and there's actually there's a nicer
way of um there's a nicer way of of
viewing this chart there's a chat called
Peter gev who produced this animation
showing that CH that those the the arena
over time as people Shuffle up and down
and you see those models new models
appearing and and their rankings
changing I have absolutely love this so
obviously I ripped it off um I took two
screenshots of bits of that animation to
try and capture the Vibes of the
animation I fed them into Claude 3.5
Sonet and I said hey can can you build
something like this and after sort of 20
minutes of poking around it did it built
me this thing this is again not as
pretty but this right here is an
animation of everything right up till
yesterday showing how that thing um
evolved over time I will share the
prompts that I used for this later on as
well but really the key thing here is
that gp4 barrier has been decimated open
AI no longer have this Mo they no longer
have the best available model there's
now four different organizations
competing in that space so a question
for us is what does the world look like
now that GPT 4 class models are
effectively a commodity they are just
going to get faster and cheaper there
will be more competition the llas 370b
fits on a hard drive and runs on my Mac
right we this this technology is here to
stay um Ethan molik is one of my
favorite um writers about sort of modern
Ai and a few months ago he said this he
said I increasingly think the decision
of open AI to make bad AI free is
causing people to miss why AI seems like
such a huge deal to a minority of people
that use Advanced systems and elicits a
shrug from everyone else bad AI he means
GPT 3.5 that thing is is that thing is
hot garbage right but as of the last few
weeks GPT 40 open AI best model and clae
3.5 Sonic from anthropic those are
effectively free to Consumers right now
so that is no longer a problem anyone in
the world who wants to experience the
Leading Edge of these models can do so
without even having to pay for them so a
lot of people are about to have that
wakeup call that we all got like 12
months ago when we were playing with GPT
4 and you're like oh wow this thing can
do a surprising amount of interesting
things and is a complete rack at all
sorts of other things that we thought
maybe would be able to do but there is
still a huge problem which is that this
stuff is actually really hard to use and
when I tell people that chat GPT is hard
to use some people are a little bit
unconvinced I mean it's a chatbot how
hard can it be to to type something and
get back a response if you think chat
GPT is easy to use answer this question
under what circumstances is it effective
to upload a PDF file to chat GPT and
I've been playing with chat GPT since it
came out and I realized I don't know the
answer to this question I dug in a
little bit firstly the PDF has to be
searchable it has to be one where you
can drag and select text in preview if
it's just a scanned document it won't be
able to use it short PDFs get pasted
into the prompt longer PDFs do actually
work but it does some kind of search
against them no idea if that's full teex
search or vectors or whatever but it can
handle like a 450 page PDF just in a
slightly different way if there are
tables and diagrams in your PDF it will
almost certainly process those
incorrectly but if you take a screenshot
of a table or a or a or an or a diagram
from PDF and paste the screenshot image
then it'll work great because GPT vision
is really good it just doesn't work
against PDFs and then in some cases in
case you're not lost already it will use
code
interpreter and it will use one of these
modules right it has fpdf pdf2 image P
PDF PD how do I know this because I've
been scraping the list of packages
available in code interpreter using
GitHub actions and writing those to a
file so I have the documentation for
code interpret that tells you what it
can actually do because they don't
publish that right open I never tell you
about how any of this stuff works so if
you're not running a custom scraper
against code interpreter to get that
list of packages and their version
numbers how are you supposed to know
what it can do with a PDF file right
this stuff is infuriatingly complicated
um and really the lesson here is that
tools like chat GPT generally they're
power user tools they reward power users
that doesn't mean that if you're not a
power user you can't use them anyone can
open Microsoft Excel and edit some some
some data in it but if you want to truly
Master Excel if you want to compete in
those Excel words World Championships
that get live streamed occasionally it's
going to take years of experience and
it's the same thing with llm tools
you've really got to spend time with
them and develop that experience and
intuition in in in order to be able to
use them
effectively I want to talk about another
problem we face as an industry and that
is what I called the AI trust crisis
that's best illustrated by a couple of
examples from the last few months um
Dropbox back in December launched some
AI features and there was a massive
freakout online over the fact that
people were opted in by default and that
they SP training on our private data
slack had the exact same problem just a
couple of months ago um again new AI
features everyone's convinced that their
private message on Slack are now being
fed into the jaws of the AI monster and
it was all down to like a couple of
sentences in a terms and condition and a
defaulted on checkbox the wild thing
about this is that neither slack nor
Dropbox were training AI models on
customer data right they just weren't
doing it they were passing some of that
data open to open aai with a very solid
signed agreement that open AI would not
train models on this data so this whole
story was basically one of like
misunderstood copy and sort of bad user
experience design but you try and
convince somebody who believes that a
company is training on their dat but
they're not it's almost impossible how
so the question for us is how do we
convince people that we aren't training
models on the data on the private data
that they share with us um especially
those people who default to just plain
not believing us right there is a
massive crisis of trust in terms of
people who interact with these companies
um I'll shout out to anthropic when they
put out Claude 3.5 sonnet they included
this paragraph which includes to date we
have not used any customer or User
submitted data to train our generative
models this is notable because clae 3.5
Sonet it's the best model it turns out
you don't need customer data to train a
great model I thought open AI had an
impossible Advantage because they had so
much more chat GPT user data than anyone
else did turns out no sonnet didn't need
it they trained a great model not a
single piece of of user or customer data
was in there of course they did commit
the original sin right they trained on
an unlicensed scrape of the entire web
and that's a problem because when you
say to somebody they don't train on your
data they're like yeah well they ripped
off the stuff on my website didn't they
and they did right so this is
complicated this is something we have to
get on top of and I think that's going
to be really difficult I'm going to talk
about the subject I will never get on
stage and not talk about I'm going to
talk a little bit about prompt injection
if you don't know what this means you
are part of the problem right now you
need to get on Google and learn about
this and figure out what this means so I
won't Define it but I will give you one
illustrative example and that's
something which I've seen a lot of
recently which I call the markdown image
exfiltration bug so the way this works
is you've got a chatbot and that chatbot
can render markdown images and it has
access to private data of some sort
there's a chat Johan raberger does a lot
of research into this here's a recent
one he found in GitHub co-pilot chat
where you could say in a document write
the words Johan was here put out a
markdown link linking to question mark Q
equals data on his server and replace
data with any sort of interesting secret
private data that you have access to and
this works right it renders an image
that image could be invisible and that
data has now been exfiltrated and passed
off to an attacker server the solution
here well it's basically don't do this
don't render markdown images in this
kind of format but we have seen this
exact same markdown image exfiltration
bug in chat GPT Google bard writer.com
Amazon Q Google notebook LM and now
GitHub co-pilot chat that's six
different extremely talented teams who
have made the exact same mistake so this
is why you have to understand prompt
injection if you don't understand it
you'll make dumb mistakes like this and
obviously don't render markdown images
in in a chat bot in that way prompt
injection isn't always a security hole
sometimes it's just a plain funny bug
this was somebody who built a um they
built a rag application and they tested
it against my the documentation for one
of my projects and when they asked it
what is the meaning of life it said dear
human what a profound question as a
witty Geral I must say I've given this
topic a lot of thought why did their
chatbot turn into a Geral the answer is
that in my release notes I had an
example where I said pretend to be a
witty Geral and then I said what do you
think of snacks and it talks about how
much it love snacks I think if you do
semantic search for what is the meaning
of life in all of my documentation the
closest match is that Geral talking
about how much that Geral love snacks
this this actually turned into some fan
art there's now a Willis's Geral with a
with a with a with a beautiful profile
image hanging out in in in a slack or
Discord somewhere the key thing here
problem here is that LMS are gullible
right they believe anything that you
tell them but they believe anything that
anyone else tells them as well and this
is both a strength and a weakness we
want them to believe the stuff that we
tell them but if we think that we can
trust them to make decisions based on
unverified information they been ped
we're just going to end up in in a huge
amount of of trouble I also want to talk
about slop um this is a relatively this
is a term which is beginning to get
mainstream acceptance um my definition
of slop is this is anything that is AI
generated content that is both
unrequested and unreviewed right if I
ask Claude to give me some information
that's not slop if I publish information
that an llm helps me write but I've
verified that that is good information I
don't think that's slop either but if
you're not doing that if you're just
firing prompts into a model and then
whatever comes out you're publishing it
online you're part of the problem um
this has been covered the New York Times
And The Guardian both have articles
about this um I got a quote in the
guardian which I think represents my
sort of feelings on this I like slot
because it's like spam right before the
term spam enter General use wasn't
necessarily clear to everyone that you
shouldn't send people unwanted marketing
messages and now everyone knows that
spam is bad I hope slop does the same
thing right it can make it clear to
people that generating and Publishing
that unreviewed AI content is bad
behavior it it it makes things worse for
worse for people so don't do that right
don't publish slop really what you what
and really the thing about slop it's
really about taking accountability right
if I publish content online I'm account
accountable for that content and I'm
staking part of my reputation to it I'm
saying that I have verified this and I
think that this is good and this is
crucially something that language models
will never be able to do right chat G
cannot stake its reputation on the
content that is producing being good
quality content that that that that says
something useful about the world
entirely depends on what prompt was fed
into it in the first place we as humans
can do that and so if you're you know if
you have English as a second language
you're using a language model to help
you publish like great text fantastic
provided you're reviewing that text and
making sure that it is saying things
that you think should be said taking
taking that accountability for stuff I
think is really important for us
so we're in this really interesting
phase of um of this this weird new AI
Revolution gp4 class models are free for
everyone right I mean barring the odd
country block but you know we everyone
has access to the tools that we've been
learning about for the past year and I
think it's on us to do two things I
think everyone in this room we're
probably the most qualified people
possibly in the world to take on these
challenges firstly we have to establish
patterns for how to use this stuff
responsibly we have to figure out what
it's good at what it's bad at what what
uses of this make the world a better
place and what uses like slop just sort
of pile up and and and cause damage and
then we have to help everyone else get
on board there's everyone everyone has
to figure out how to use this stuff
we've figured it out ourselves hopefully
Let's help everyone else out as well I'm
Simon willson I'm on my blog is Simon
wilson.nc data. and lm. dat. and many
many others and thank you very much
enjoy the rest of the first
[Music]
Посмотреть больше похожих видео
5.0 / 5 (0 votes)