OpenAI's NEW Embedding Models

James Briggs
25 Jan 202416:31

Summary

TLDROpenAI released two new text embedding models, text-embedding-3-small and text-embedding-3-large, showing decent improvements in English language embeddings and massive gains in multilingual embeddings quality. The models have the same context window as previous versions but are trained on more recent data. Most impressively, the large 372-dimensional model can supposedly be reduced to 256 dimensions while still outperforming the previous 536-dimensional model. The new models were tested by indexing sample text and querying the vectors, with the large model returning more relevant results.

Takeaways

  • 😲 OpenAI released two new text embedding models - text-embedding-3-small and text-embedding-3-large
  • 📈 The new models show decent improvements in English language embedding quality
  • 🚀 Massive improvements in multilingual embedding quality - from 31.4 to 54.9 on MIRACL benchmark
  • ⏪ The models use data cutoff from Sept 2021, so may not perform as well on recent events
  • 🔢 Text-embedding-3-large has higher dimensionality for better meaning compression
  • 🤯 Can compress text-embedding-3-large to 256 dims and outperform 002 model with 512 dims
  • 🐢 Text-embedding-3-large is slower for embedding than previous models
  • 🔎 Compared retrieval results across models - text-embedding-3-large performed best overall
  • 🤔 Hard to see big performance differences between models in this test
  • 👍 New models correlate with claimed performance gains, exciting to test 256 dim version

Q & A

  • What were the two new embedding models released by OpenAI?

    -OpenAI released Text Embedding 3 Small and Text Embedding 3 Large.

  • What benchmark showed massive improvements with the new models?

    -The models showed massive improvements on the multilingual embeddings benchmark MIRACL.

  • What was the knowledge cut-off date for the new models?

    -The knowledge cut-off date is still September 2021.

  • What is the benefit of reducing the number of dimensions in embedding vectors?

    -Reducing the number of dimensions leads to reduced quality embeddings, but allows for improved compression of meaning into the vectors.

  • How many dimensions does the Text Embedding 3 Large model use?

    -The Text Embedding 3 Large model uses 372 dimensions.

  • What did OpenAI claim about reducing dimensions in the large model?

    -OpenAI claimed the large model could be reduced to 256 dimensions and still outperform the previous model with 536 dimensions.

  • Which model showed the slowest embedding speed?

    -The Text Embedding 3 Large model was the slowest, taking about 24 minutes to embed the entire dataset.

  • What datatype does the output need to be in?

    -The output needs to be in JSON format.

  • What was the hardest sample question that none of the models answered well?

    -The question "Keep talking about red teaming for LLama 2" was too difficult and none of the models provided good answers.

  • Which model provided the most accurate results in the comparison?

    -The Text Embedding 3 Large model provided the most accurate results in the GPT-4 vs LLama comparison.

Outlines

00:00

😲 New OpenAI text embedding models released

Paragraph 1 introduces the new text embedding models released by OpenAI in March 2024 - text embedding 3 small and text embedding 3 large. It discusses the impressive performance improvements shown on English and multilingual benchmarks, especially for the large model on the multilingual benchmark (miracle). However, the models use old data cutoffs from Sept 2021 and do not increase max context window size.

05:01

👨‍💻 Using the new OpenAI text embedding models

Paragraph 2 walks through sample code to get API keys, initialize connections, and create Pinecone indexes to test the new OpenAI text embedding models (as well as the previous text embedding 002 model). It compares embedding time for the full dataset using each model.

10:01

📈 Testing different embedding models on sample questions

Paragraph 3 shows sample tests using the different embedding models to retrieve relevant documents for some sample input questions. It iterates through the models, from 002 to the new small and then large models, assessing differences in performance.

15:05

😃 The new large embedding model performs best

Paragraph 4 concludes after testing that the new large embedding model from OpenAI (text embedding 3 large) performs the best out of those tested, getting 4/5 relevant results for the sample question. The author is interested to test compressing down to 256 dimensions as claimed.

Mindmap

Keywords

💡ChatGPT

ChatGPT is an AI chatbot launched by OpenAI in November 2022. It quickly gained popularity for its ability to have natural conversations and perform tasks like answering questions and summarizing text. The video mentions how interest in AI exploded after ChatGPT's release, calling it a major shift in how we approach AI.

💡text embeddings

Text embeddings are vector representations of text that capture semantic meaning. They allow texts with similar meanings to have similar vector representations. The video introduces new text embedding models released by OpenAI which outperform previous models on benchmarks.

💡dimensions

Dimensions refer to the number of numeric values used to represent each text embedding vector. The video discusses how OpenAI's new models can maintain performance even with far fewer dimensions per vector compared to old models. This allows more compact representations.

💡multilingual embeddings

Multilingual embeddings are text embeddings designed to encode and compare semantic meaning across multiple languages in the same vector space. The video highlights major improvements in OpenAI's new models for multilingual embeddings.

💡maximum context window

The maximum context window determines the maximum length of text that can be encoded into a single embedding vector. As discussed in the video, OpenAI has not increased this for the new models since longer texts tend to have multiple meanings that cannot be effectively compressed into one vector.

💡knowledge cutoff date

This refers to the latest date up to which data was used to train the models. The video notes that OpenAI's new embedding models still use data only up to Sept 2021, meaning they may perform worse on recent events like COVID not seen in the training data.

💡vector indexes

Vector indexes refer to search indexes that allow efficient similarity search over text embedding vectors. The video shows initializing connections to vector search indexes to benchmark different embedding models.

💡embedding latency

Embedding latency measures the time taken to generate an embedding for a single text input. As noted in the video, OpenAI's larger new model has higher embedding latency which could slow down applications dependent on generating embeddings in real-time.

💡information retrieval

Information retrieval refers to automatically finding relevant documents or passages from a collection given a search query or question. The video suggests text embeddings impact this by determining semantic text similarity.

💡natural language processing

Natural language processing (NLP) involves developing algorithms and models to analyze and generate human language. As discussed in the video, text embeddings are crucial for many NLP use cases dealing with natural language understanding.

Highlights

OpenAI released two new embedding models, text embedding 3 small and text embedding 3 large, with improved performance

The new models have much better multilingual performance measured by the mircle benchmark

The models have not been trained on more recent data, with a September 2021 knowledge cutoff

The large model can be compressed to 256 dimensions while still outperforming the older 002 model

The new models have slower embedding speeds compared to 002

Tested relevance search on 3 models, the large model performed best in retrieving relevant documents

The small model runs at comparable speed to 002 for embedding

The large model is slower for embedding compared to small and 002

Can customize embedding dimensions, but lower dimensions means lower quality

API latency seems slow currently, likely due to high demand after release

Hardest question on LL2 could not be answered well by any model

Small and large models retrieved some relevant documents comparing LL2 and GPT4

Large model got most relevant documents on LL2 vs GPT4 comparison

Performance gains align with benchmarks but harder to see in practice

Will test very small dimensionality embeddings from large model against 002

Transcripts

play00:00

way back in December

play00:02

2022 we had the biggest shift in how we

play00:06

approach AI ever that was thanks to open

play00:10

aai releasing chat GPT at the very end

play00:13

of November chat GPT quickly caught a

play00:17

lot of people's attention and it was in

play00:20

the month of December that the interest

play00:23

in chat gbt and AI really exploded but

play00:26

right in the middle of December open AI

play00:29

really another model that also changed

play00:32

the entire landscape of AI but it didn't

play00:36

go as notic as chat GPT and that model

play00:42

was text embedding order 002 very

play00:46

creative naming but behind that name is

play00:49

a model that just completely changed the

play00:53

way that we do information retrieval for

play00:56

natural language which covers rag FACS

play01:01

and also basically any use case where

play01:04

you're retrieving text information now

play01:06

since then despite a huge explosion in

play01:09

the number of people using Rag and the

play01:12

really cool things that you can do with

play01:13

rag open the eye remain pretty quiet in

play01:16

their embedding models right embedding

play01:18

models are what you need for Rag and

play01:20

there has been no new models since

play01:22

December 20122 until now open AI has

play01:27

just released two new embedding models

play01:31

and a ton of other things as well those

play01:33

two embedding models are called text

play01:36

embedding 3 small and text embedding

play01:39

three large and when we look at the

play01:41

results that open is sharing right now

play01:45

we can see a fairly decent Improvement

play01:48

on English language embeddings with the

play01:51

mte Benchmark but perhaps more

play01:53

impressively we see a massive

play01:56

Improvement in the quality of mul IL

play02:00

lingual embeddings which are measure

play02:01

using the miracle Benchmark now

play02:04

002 state-ofthe-art when it was released

play02:07

and for a very long time afterwards and

play02:09

still still a top performing embedding

play02:11

model that had an average score of 31.4

play02:15

on mirle the new Tex embedding 3 large

play02:18

has an average score of

play02:21

54.9 on Miracle that's a massive

play02:25

difference now one of the other things

play02:28

you notice looking at the these new

play02:30

models is that they have not increased

play02:34

the max context window so the maximum

play02:36

number of tokens that you can feed into

play02:37

the model that makes a lot of sense with

play02:39

embedding models because what you're

play02:41

trying to do with embeddings is trying

play02:42

to compress the meaning of some text

play02:44

into a single point and if you have a

play02:47

larger chunk of text there's usually

play02:49

many meanings within that text so going

play02:54

large and trying to compress into a

play02:56

single point doesn't you know those two

play02:58

things don't really go together because

play03:01

that large text can have many meanings

play03:03

so it always makes sense to use smaller

play03:05

chunks and clearly opening eye of are

play03:08

aware of that they're not increasing the

play03:10

maximum number of tokens that you can

play03:12

embed with these models now the other

play03:14

thing which is maybe not as clear to me

play03:16

is that they have not trained on more

play03:19

recent data the knowledge date cut off

play03:22

is still September 20121 which is a fair

play03:25

while ago now and okay for embedding

play03:28

models maybe that isn't quite as

play03:30

important as it is for llms but it's

play03:33

still important it's good to have some

play03:35

context of recent events when you're

play03:36

trying to embed meaning so things like

play03:40

covid you ask a covid question these

play03:42

models I imagine are probably not going

play03:44

to perform as well as say coher

play03:47

embedding models which have been trained

play03:48

on more recent uh data nonetheless this

play03:51

is still very impressive and one thing

play03:54

which I think is probably the most

play03:56

impressive thing that I've seen so far

play03:59

is is we're now able to decide how many

play04:04

dimensions we'd like in our vectors now

play04:06

there is a tradeoff you reduce the

play04:08

number of Dimensions you're going to get

play04:10

reduced quality embeddings but what is

play04:14

incredibly interesting and I almost

play04:17

don't quite believe it yet I need I

play04:19

still need to test this is that they're

play04:21

saying that the large model Tex

play04:24

embedding three large you can cut it

play04:26

down from 372 diam Dimensions which is

play04:30

larger than the previous models you can

play04:32

cut that down to 256 dimensions and so

play04:36

outperform order 002 which is

play04:40

a536 dimension embedding model

play04:43

compressing all of that performance into

play04:47

256 floating Point numbers is insane

play04:53

so I'm going to I'm going to test that

play04:56

not right now but I'm going to test that

play04:58

and just prove to myself that that is

play05:00

possible I'm a little bit skeptical but

play05:03

if so incredible okay so with that out

play05:05

the way let's jump into how we might use

play05:09

this new model okay so jumping right

play05:11

into it we have this notebook I'm going

play05:13

to share with you a link either in the

play05:16

description I will try and get a link

play05:18

added to the video as well and first I'm

play05:23

going to do download data set well pip

play05:25

install first then I'm going to download

play05:27

data set okay so I'm using this AI

play05:29

archive I've used it a million times

play05:30

before uh but it is a good data set for

play05:33

testing going to remove all of the

play05:34

columns I don't care about I'm going to

play05:36

keep just ID text metadata okay typical

play05:39

format then I'm going to initialize or

play05:42

I'm going to take my open a API key okay

play05:45

so that's platform. open.com if you need

play05:48

one and I'm going to put in here and

play05:49

then this is how you create your new

play05:51

embeddings okay exactly the same as what

play05:53

you did before you just change the model

play05:57

ID now okay and we'll see those in the

play05:59

moment as well so that is our embedding

play06:02

function then we jump down we're going

play06:04

to initialize connection to Pyon

play06:06

serverless so you get $100 free credit

play06:09

and you can create multiple indices

play06:14

which is what we need because I want to

play06:16

test multiple models here with different

play06:18

dimensionalities so that's why I'm using

play06:21

serverless alongside all the other

play06:23

benefits that you get from it as well

play06:26

now taking a look at this the these are

play06:29

the models we're going to take a look at

play06:31

using the default dimensions for now we

play06:35

will try the others pretty soon so we

play06:38

have the original model well kind of

play06:41

original the you know V2 of embedding

play06:43

from open AI so this is the one they

play06:46

released in December 2022 the

play06:48

dimensionality there 15 36 most of us

play06:52

will be very familiar with that number

play06:54

by now now the small model uses the same

play06:56

dimensionality and you can also decrease

play06:59

this

play06:59

down to 512 okay nice nice little cool

play07:05

thing you can do there the other

play07:07

embedding model so the large one the one

play07:08

with the like insane performance gains

play07:12

is this one so three large higher

play07:15

dimensity that means they can you pack

play07:17

more meaning into that single Vector so

play07:21

makes sense that this is more

play07:23

performance uh but what is very cool is

play07:26

that you can compress this down to 25

play07:30

Six Dimensions and apparently still

play07:33

outperform this model here and I mean

play07:37

that is 100% unheard of within like

play07:40

vector embeddings like two five six

play07:42

dimensions and getting this level of

play07:45

performance is

play07:46

insane let's see I you know I don't know

play07:49

maybe I mean they say it's true so uh

play07:53

then I'm going to kind of go through I'm

play07:56

going to throw I'm going to create three

play07:58

different

play08:00

indexes one for each one of the models

play08:04

okay and then what I'm going to do is

play08:06

just index everything now it takes a

play08:08

little bit of time to index everything

play08:11

but we can see know while I'm waiting

play08:14

for that we can have a quick look at how

play08:15

long this is taking because this is also

play08:17

something to consider when you're you

play08:19

know choosing embedding models and you

play08:21

know looking at these so straight

play08:25

away one the apis right now are I think

play08:29

pretty slow because everything has just

play08:30

been released so I expect during normal

play08:34

times this number will probably be

play08:38

smaller so for 002 I'm getting 15 and a

play08:42

half minutes to embed everything okay

play08:45

it's to embed and then throw everything

play08:47

into Pine going slightly slower for the

play08:50

small model which okay probably maybe

play08:54

hasn't been as optimized as 002 and also

play08:57

maybe more people using this right now

play09:00

but generally it's I mean pretty

play09:03

comparable speed there as we might

play09:05

expect embedding through large is

play09:08

definitely slower okay so right now

play09:11

we're on on track for about 24 minutes

play09:15

for that whole thing to embed so yeah

play09:19

definitely slower that also means your

play09:20

embedding latency is going to be slower

play09:22

so I mean you kind of look at this okay

play09:25

this is 2 seconds uh this is including

play09:27

like your network latency and everything

play09:29

thing and also you know going to Pine

play09:32

Cone as well so you have multiple things

play09:34

there it's not a 100% fair comparison

play09:37

but then this one is almost two seconds

play09:41

slower maybe make like a 1.5 second

play09:43

slower for a single iteration okay so

play09:47

this one is definitely slower it will

play09:50

clearly slow down if you're using rag or

play09:52

something like that is going to slow

play09:54

down that process a little bit probably

play09:56

not that much compared to you know the

play09:58

LM gener ntion component but still

play10:00

something to consider so I'm going to

play10:03

wait for this finish and Skip ahead to

play10:07

when it has okay so we are done and we

play10:11

now have okay it's like 20 just about 24

play10:14

minutes for that final model so I've

play10:17

created this function it's just going to

play10:20

go through and basically return

play10:23

documents for it so let's try it with

play10:26

002 and see what we get

play10:30

so keep talking about red teaming for

play10:32

llama 2 what do we get we got okay red

play10:36

teaming chat GPT not no not quite

play10:40

there let's

play10:43

try with the new small model okay cool

play10:48

let's see do we mention l two in

play10:51

here no no l 2 so also not quite there

play10:56

this was a pretty hard one not I haven't

play10:58

seen a mod get this one yet so let's see

play11:02

we're starting with a hard

play11:06

question okay let's see let's see what

play11:08

we have here okay so it's talking about

play11:11

R te exercises this and

play11:13

this

play11:16

but I don't see llama 2 no nothing in

play11:21

there so okay maybe that question is too

play11:26

hard for any model apparently so let's

play11:28

try

play11:29

all right let's just go

play11:31

with you can tell me why I might want to

play11:34

use LL 2 why would I want to use llama

play11:41

2 now the models usually can get

play11:45

relevant results here so yeah straight

play11:47

away this one you can see L 2 scales up

play11:52

to this it's helpfulness and safety is

play11:55

pretty good per better than existing

play11:58

over Source models okay cool good that

play12:01

is uh you know I would hope they can get

play12:05

this one as OD Z2

play12:08

can okay same result I think it's

play12:12

probably the most relevant or one of the

play12:14

most relevant so let's

play12:17

see let me see uh so what I want to use

play12:21

and then here we get so this is a large

play12:23

modeling us is it the same oh no same

play12:27

result okay cool that's fine let's try

play12:31

another question okay so let's try where

play12:35

we're comparing llama to gbt 4 and just

play12:38

see how many of these maners should get

play12:40

either gbt 4 in there or llama so okay

play12:44

this is are okay you know that's

play12:47

like four of Five results seem relevant

play12:51

are they actually are they talking about

play12:53

see they're talking about GPT 4 as well

play12:55

and yeah you can see GPT 4 in here don't

play12:59

actually see GT4 in here see

play13:02

gptj oh okay no no no so effect no of

play13:05

instruction tuning using

play13:08

GT4 but not necessarily comparing to

play13:13

GT4 okay this one I don't see them

play13:16

talking about llama or so okay these two

play13:19

here not relevant this one compar chat

play13:22

box instruction tuning of llama llama

play13:24

GT4 out forms this one this one but

play13:28

there still a gap okay so there's a

play13:29

comparison there fine here okay so

play13:33

that's a llama fine tuned on jt4

play13:36

instructions or outputs but there is a

play13:38

comparison

play13:41

and again okay there's a comparison

play13:43

right so there's like three results

play13:46

there that are

play13:47

compared accurate for the small model

play13:50

Let's see we compare these okay relevant

play13:53

I would say this one

play13:56

interesting second one not relevant

play13:59

third

play14:00

one all chat BS against GPT 4

play14:04

comparisons run by a reward mode

play14:06

indicates that all chat boots are

play14:08

compared against okay yeah yeah that's

play14:10

relevant two out of three here I don't

play14:15

see anything where it's comparing to GPT

play14:19

4 so I think that's a that's a no so

play14:21

it's two out of four now okay and then

play14:23

here there's you know talking kind of

play14:24

like about the comparisons so three out

play14:28

of five

play14:29

but then the other model was slightly oh

play14:32

it was the same

play14:34

okay now let's go with the best model we

play14:38

expect to see more L and I think I do so

play14:42

this one has l in four of those answers

play14:44

We

play14:46

compare okay we're comparing this one no

play14:50

so look this one okay they're comparing

play14:53

so that's accurate this

play14:55

one okay here comparing again and then

play14:59

this final one here we have okay uh do

play15:05

we have gpg

play15:06

4 here I think so they have like B chart

play15:10

GPT GPT 4 and then they have some I mean

play15:12

this is a table it's you know it's kind

play15:14

of hard to understand but it seems like

play15:17

okay that is actually a comparison as

play15:18

well so that one okay this one it got

play15:21

four out of five that's the best

play15:23

performing one okay that's good that

play15:26

kind of that that correlates with what

play15:29

we would expect cool okay those are new

play15:32

Elling models from open AI I think it's

play15:35

kind of hard to see the performance

play15:37

difference there I mean you can see a

play15:39

little bit maybe with the large model

play15:42

but given the performance differences we

play15:43

saw at the start in that table at least

play15:46

on multilingual there's a massive leap

play15:49

up which is insane I'm looking forward

play15:51

to trying the the very small

play15:53

dimensionality and just comparing that

play15:54

to 002 I think that is very impressive

play15:57

definitely try that soon but for now

play15:59

looks pretty cool definitely want to try

play16:01

the other models as well that opening I

play16:03

have released there are a few so for now

play16:06

I'm going to leave it there I hope all

play16:08

this has been interesting and useful so

play16:10

thank you very much for watching and

play16:12

I'll will see you again in the next one

play16:27

bye