Book Recommendation System in Python with LLMs

NeuralNine
31 Jul 202424:33

Summary

TLDRIn this informative video, the host guides viewers through the process of coding a book recommendation system using Python and large language models. The system involves creating a vector store to hold vector representations of books, transforming textual data into 4,096-dimensional vectors with the help of LLMs like Llama 2. The video demonstrates how to use a dataset from Kaggle, craft textual representations, and perform similarity searches to recommend the most relevant books. The host also highlights the importance of maintaining consistent data structures for accurate recommendations.

Takeaways

  • 📚 The video is about creating a book recommendation system using large language models in Python.
  • 🔍 The goal is to build a vector store (Vector Store Service, VSS) that contains vector representations of various books.
  • 📈 The books' attributes like title, description, author, and publishing date will be transformed into a textual representation and then into a high-dimensional vector.
  • 📈📈 These vectors will be 4,096-dimensional, intelligently derived from the text to represent each book uniquely.
  • 🔎 The system performs a similarity search to find the closest vector in the vector space to a newly input book's vector, suggesting the most similar books.
  • 🤖 Large language models (LLMs) are used for the embedding process, which is crucial for intelligently converting text into meaningful vectors.
  • 🛠️ The video mentions using 'ollama' for convenience, a tool that allows running models locally to get text embeddings.
  • 🗃️ A 'faiss' vector store from Facebook is used to store and search through the vectors.
  • 📊 The data set used is the 7K books data set from Kaggle, chosen for its inclusion of book descriptions, which are essential for accurate representation.
  • 📝 A textual representation function is created to structure the book data in a way that is useful for the LLM.
  • 🔑 The video emphasizes the importance of maintaining consistent data structure when building and querying the vector store to ensure accurate recommendations.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is how to code a book recommendation system using large language models in Python.

  • What is the purpose of building a vector store for the book recommendation system?

    -The purpose of building a vector store is to contain vector representations of different books, which will be used for similarity search to recommend books.

  • What attributes of books are mentioned in the script as being used for the recommendation system?

    -The attributes mentioned include title, description, author, publishing date, categories, average rating, and number of pages.

  • Why are vector representations used instead of raw text for the book recommendation system?

    -Vector representations are used because they intelligently encode the text into a high-dimensional space where similarity can be measured numerically, which is not possible with raw text.

  • What is the dimensionality of the vectors that represent the books in the system?

    -The dimensionality of the vectors is 4096, meaning each vector has 4096 numerical values.

  • Which model is used for text embedding in the video?

    -The video uses 'llama 2' for text embedding, which is a large language model that can convert text into a meaningful vector representation.

  • What is the role of the 'requests' package in the video script?

    -The 'requests' package is used to send a request to the 'llama 2' model's API to get the embedding for a given text representation of a book.

  • What is the data set used in the video for building the book recommendation system?

    -The data set used is the '7K books data set' from Kaggle, which includes descriptions along with other attributes of the books.

  • How does the script handle the process of finding similar books once the vector store is built?

    -The script performs a similarity search in the vector store by finding the vector that is closest to the vector representation of a new book, and then recommends the books associated with the closest vectors.

  • What is the importance of keeping the textual representation structure consistent when building the vector store?

    -Keeping the textual representation structure consistent is important because it ensures that the vector store accurately reflects the text and maintains the integrity of the similarity search results.

  • How does the video demonstrate the effectiveness of the book recommendation system?

    -The video demonstrates the effectiveness by showing the process of finding similar books to a given book and displaying the recommended books that are indeed similar in genre or theme.

Outlines

00:00

📚 Introduction to Building a Book Recommendation System

The video begins with an introduction to building a book recommendation system using large language models in Python. The presenter outlines the process of creating a vector store to hold vector representations of various books, including attributes like title, description, author, and publishing date. The main goal is to convert these textual attributes into 4,096-dimensional vectors intelligently using large language models, allowing for a similarity search to recommend books.

05:00

🔍 Crafting Textual Representations for Books

The second paragraph delves into the specifics of creating textual representations for each book from the dataset. The presenter discusses the importance of selecting relevant information and structuring it in a way that is useful for the language model. A function is introduced to convert each row of the dataset into a string containing the book's title, authors, categories, description, publishing year, average rating, and number of pages, which will be used for generating embeddings.

10:03

🚀 Embedding Text into Vectors and Storing Them

In this segment, the focus shifts to embedding the textual representations into vectors and storing them in a vector store. The presenter discusses using the Faiss library for the vector store and the Llama 2 model for generating embeddings from text. The process involves sending requests to the Llama 2 API with the textual representations and receiving the corresponding 4,096-dimensional vectors, which are then added to the Faiss index.

15:04

🔎 Searching for Similar Books Using the Vector Store

The fourth paragraph explains how to use the created vector store to find similar books. The presenter demonstrates how to take a book's textual representation, embed it using the same model, and perform a similarity search to find the closest vectors in the vector space. This process involves using the Faiss index to search for the top matches based on the embedded vector of the book in question.

20:04

🛠️ Troubleshooting and Finalizing the Recommendation System

The final paragraph addresses potential issues that may arise when using different structures for the textual representation during the embedding process. The presenter emphasizes the importance of maintaining consistency in the structure used for training the index and when performing searches. After resolving any issues, the video concludes with a demonstration of how to use the system to recommend similar books based on a given title, showcasing the effectiveness of the recommendation system.

Mindmap

Keywords

💡Book Recommendation System

A book recommendation system is an algorithmic tool designed to suggest books to users based on their interests or past reading habits. In the context of the video, the system utilizes large language models to analyze textual data about books and provide suggestions. The script describes the process of building such a system in Python, which includes creating a vector store to hold vector representations of books for similarity searches.

💡Large Language Models (LLMs)

Large Language Models, or LLMs, refer to artificial intelligence models that are trained on vast amounts of text data and can generate human-like text. In the video, LLMs are used for the embedding process, which involves converting textual information about books into vector representations that can be understood and compared by a computer system.

💡Vector Representation

Vector representation in this context is the conversion of text data into a numerical form that can be processed by a machine learning model. The script mentions that each book is represented as a 4,096-dimensional vector, which is a point in a high-dimensional space where each dimension corresponds to a specific feature of the text.

💡Vector Store

A vector store is a database designed to store and manage vector representations of data. In the script, the vector store is used to hold the vector representations of books, allowing for efficient similarity searches to find books similar to a given query.

💡Embedding

Embedding, in the context of natural language processing, is the process of transforming words, phrases, or texts into vectors of real numbers. The video script describes using LLMs to create embeddings for books, which capture the semantic meaning of the book's attributes in a 4,096-dimensional space.

💡Faiss

Faiss is a library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors. In the video, Faiss is used as the vector store to manage the embeddings of the books, allowing for fast retrieval of the most similar books based on their vector representations.

💡Kaggle Dataset

The Kaggle Dataset mentioned in the script is a collection of data used for machine learning projects, in this case, a dataset of 7K books that includes attributes like title, author, description, and publishing date. This dataset is used to train the book recommendation system and create the vector store.

💡Textual Representation

Textual representation in the script refers to the structured string that contains the information about a book, such as title, author, description, and other attributes. This representation is crafted to be fed into the LLM to generate a meaningful vector representation.

💡Similarity Search

Similarity search is the process of finding data points that are similar to a given query point in a multidimensional space. In the video, similarity search is performed in the vector space to identify books whose vector representations are closest to the vector of a user-provided book.

💡API

API, or Application Programming Interface, is a set of rules and protocols for building software applications. The script mentions using an API to communicate with the LLM, specifically to send requests for generating embeddings from textual representations of books.

💡Dimensionality

Dimensionality in the context of the video refers to the number of features or attributes that a vector representation can have. The embeddings for the books are 4,096-dimensional, meaning each vector has 4,096 numerical values that represent different aspects of the book's text.

Highlights

Introduction to building a book recommendation system using large language models in Python.

Creating a vector store (VSS) to contain vector representations of books.

Using attributes like title, description, author, and publishing date to build textual representations of books.

Transformation of textual data into 4,096-dimensional vectors for intelligent representation.

Similarity search to find the closest vector in the vector space for book recommendations.

Utilization of large language models (LLMs) for the intelligent embedding process.

Choice of using the 'llama' model for convenience in local model running.

Explanation of using the Facebook AI Similarity Search (FAISS) vector store.

Selection of the 7K books dataset from Kaggle for its descriptive content.

Installation of necessary packages like pandas, numpy, and FAISS for data processing.

Crafting a textual representation function to structure book information.

Application of the textual representation function to the entire dataset.

Importance of maintaining consistent structure for embedding and similarity search.

Process of embedding book data into the vector store using the 'llama' model.

Performance of similarity search to find the top five most similar books.

Demonstration of finding similar books to 'How to Win Friends and Influence People'.

Discussion on the practicality and potential improvements of the book recommendation system.

Encouragement for viewers to experiment with different textual representations for better results.

Conclusion and invitation for feedback on the video's content and approach.

Transcripts

play00:00

what is going on guys welcome back in

play00:01

this video today we're going to learn

play00:02

how to code a book recommendation system

play00:05

by utilizing large language models in

play00:07

Python so let us get right into

play00:09

[Music]

play00:17

it all right so we're going to build a

play00:19

book recommendation system in Python

play00:21

today by utilizing large language models

play00:24

and I want to briefly sketch the process

play00:25

nothing too complicated and nothing too

play00:27

detailed I just want to show you

play00:28

basically here uh visually what we're

play00:31

going to do very basic I'm going to use

play00:32

my mouse so this is not going to be the

play00:34

most beautiful drawing but our goal is

play00:36

to build up a vector store so a vector

play00:39

database I'm going to say VSS here

play00:40

Vector store uh and this is going to

play00:42

contain Vector representations of a

play00:45

bunch of different books so the idea is

play00:47

we have a data set full of different

play00:48

books and these books have certain um

play00:52

attributes like a title a description an

play00:55

author a publishing date and so on so we

play00:57

have a bunch of attributes here and what

play00:59

we want to do is who want to get those

play01:01

uh buil a textural representation so

play01:04

basically just raw text containing this

play01:06

information in some way and then we want

play01:08

to somehow intelligently take that uh

play01:11

and turn it into a vector so all

play01:15

representation are going to be turned

play01:16

into vectors and these vectors have some

play01:19

values I don't know four 0.5 something

play01:22

and this is an uh a very high

play01:24

dimensional Vector so I think let me

play01:25

just double check here in my prepared

play01:27

code this is going to be a 4,096

play01:29

dimensional Vector so it's going to have

play01:31

496 values numerical values uh which are

play01:35

not random which are intelligently

play01:37

Chosen and then these vectors are going

play01:39

to be stored in the vector store and

play01:41

when I get a new book with new

play01:43

information what I do is I take that

play01:46

turn it into the same kind of textual

play01:48

representation turn it into the same

play01:50

kind of vector using the same model so

play01:52

into the same kind of uh Vector

play01:55

representation here and then I ask

play01:57

what's the closest Vector to this one

play01:59

this is a simp similarity search so um

play02:02

mathematically if I have two points in a

play02:04

496 dimensional Vector space there is a

play02:06

distance between all the different

play02:08

points and what I'm asking is using this

play02:10

new Vector what's the closest Vector

play02:12

what's the closest point in this

play02:13

high-dimensional Vector space to this

play02:16

one and I assume this is going to be the

play02:17

most similar book so I'm going to use

play02:19

the five uh closest vectors to determine

play02:22

the five most similar books and

play02:24

recommend them uh as a result now the

play02:27

interesting part of this whole system is

play02:29

uh basically this Arrow here because

play02:31

that is the embedding process taking

play02:34

text and turning it into a vector

play02:36

intelligently and for this we're going

play02:38

to use

play02:39

llms so large language models this is

play02:42

where the intelligence is needed because

play02:43

we need to take a text and we need to

play02:45

somehow take the content of this text

play02:47

and turn it into something that is

play02:49

meaningfully represented in a 496

play02:51

dimensional Vector space and then we

play02:53

need to be able to do a similarity

play02:55

search there so that's what we're going

play02:56

to do now for the embedding model um if

play03:00

you want to change the code to use

play03:01

something else you can do whatever you

play03:02

want you can use chat GPT so the open AI

play03:05

API you can use GPT you can use uh any

play03:08

kind of self-hosted model I'm going to

play03:10

use olama just for convenience ol Lama I

play03:12

have a video on his Channel showing how

play03:14

to install and use olama basically it

play03:16

allows you to easily run models locally

play03:18

I'm going to use uh I think let me just

play03:20

double check here I'm going to use llama

play03:22

2 um just because it fits into my

play03:24

hardware and I'm going to use llama 2 to

play03:26

get text feed it into it and get an

play03:29

embedding out of a 496 dimensional

play03:31

embedding uh and as a vector store we're

play03:33

going to use face so the Facebook uh

play03:35

Vector store um all right and as a data

play03:38

set we're going to use a kaggle data set

play03:40

which is the 7K books data set the

play03:42

reason I chose this one is because it

play03:43

also has descriptions so we have

play03:45

actually text describing the content of

play03:48

the book not just the title because the

play03:49

title can be very misleading or very uh

play03:51

simple so we want to have these

play03:54

descriptions here as well so this is the

play03:56

data set I'm going to use you will find

play03:57

a link to it in the description down

play03:59

below um and actually I think I have to

play04:02

download it because I don't have it in

play04:04

my directory so I'm going to just

play04:06

download um going to go to python

play04:09

current here I'm going to download the

play04:10

archive zip and in here we have the book

play04:13

CSV file I'm going to extract

play04:15

it and uh I'm going to just close this

play04:19

all right so now I should have the book

play04:23

CSV file here I'm going to open up a new

play04:25

Jupiter notebook instance and we should

play04:27

install a couple of packages first the

play04:29

basic data assign stuff as always so

play04:31

pandas numpy should always be part of

play04:34

the equation here uh and I think for

play04:36

this we're going to use um face as well

play04:39

as I said so we're going to use also the

play04:40

request package because we need to send

play04:41

a request to the API of AMA and we're

play04:44

going to use face and then you can use

play04:46

uh I think phase GPU and phase CPU I

play04:50

think I'm using phas GPU here so this

play04:53

utilizes of course the graphical

play04:55

Processing Unit uh and not the CPU but

play04:58

if you don't have a GPU a strong GPU you

play05:00

can also go with face CPU but this is

play05:02

what we need for this video today all

play05:04

right so we're going to start by loading

play05:06

the data set and taking a look at it

play05:08

this should not be too difficult so the

play05:10

data frame is equal to pandas read uh

play05:14

the CSV file which is called book

play05:17

CSV and then I can look at it basically

play05:20

what kind of textual representation you

play05:22

want to use is up to you you can be

play05:23

creative with this one because at the

play05:25

end of the day you want to pick a

play05:26

representation that contains only the

play05:28

necessary information only the useful

play05:30

information uh and you also want to

play05:32

structure it in a way that is the most

play05:34

useful for the llm now what this way is

play05:37

I don't really know you have to play

play05:38

around with it you have to see if

play05:40

different representations give you

play05:41

better results my Approach here is to

play05:44

just craft a string saying um listing

play05:46

the specific information that we need so

play05:48

for example saying the title is this the

play05:51

author is this the publishing year is

play05:53

this the categories are these and so on

play05:55

and so forth um so yeah this is the data

play05:59

that we have I'm going to use the title

play06:01

I'm going to use the authors I'm going

play06:02

to use the category I'm going to use the

play06:04

description uh the publishing year and

play06:08

maybe maybe let's go also with the

play06:10

rating so the rating could also be

play06:12

interesting because yeah I mean if a

play06:14

book has an average rating of one it's

play06:16

probably not that

play06:18

good um all right so what we're going to

play06:22

do is we're going to create a function

play06:24

which is going to take a row and turn it

play06:26

into a textual representation so we're

play06:27

going to say textual

play06:30

representation by the way I have a very

play06:32

very similar video where I do the same

play06:33

thing with movies if you're more

play06:34

interested in movie recommendation you

play06:36

can check this out uh today we're going

play06:38

to do books so textual

play06:41

representation uh we get a row ENT input

play06:44

here uh and then basically we just take

play06:47

the content of this row so title the

play06:50

authors and so on and turn it into a

play06:51

string so I'm going to use an F string

play06:53

here A multi-line F string so three

play06:55

quotation marks uh and I'm going to say

play06:58

that this is the text ual

play07:01

[Applause]

play07:03

representation uh and we're going to

play07:04

start by saying first of all the title

play07:06

of the book is

play07:10

row title like

play07:13

this um actually shouldn't this oh I'm

play07:16

using double why okay we're not in flask

play07:19

here we use just single uh curly

play07:21

brackets so this is the title now and

play07:23

then I can do a line break and I can say

play07:25

the next thing is uh the authors the

play07:28

authors is going to be equal equal to

play07:30

row and then

play07:34

authors and then I can do the same thing

play07:37

for all the other rows uh for all the

play07:39

other columns so again we have title we

play07:41

have authors categories description so

play07:43

let's go with uh description first

play07:47

description is going to be equal to row

play07:51

description

play07:54

obviously then the categories or maybe

play07:57

we should say I mean c atories is fine

play08:00

maybe we should say say genre but I'm

play08:02

not sure about

play08:03

that categories um and then what did I

play08:08

say we want to go with the publishing

play08:10

year and the average rating that should

play08:12

be it so we're going to go with

play08:14

publishing year is going to be

play08:19

row published year is the column

play08:23

name and then finally we have uh

play08:27

average rating

play08:32

row uh

play08:34

average rating and I think one more

play08:37

feature that is useful is uh the number

play08:40

of pages because maybe I'm only

play08:41

interested in very small books so I

play08:43

usually only read books like 70 pages

play08:45

long uh that might also be a factor even

play08:48

though maybe not the most important one

play08:49

so let's say number of pages going to be

play08:53

equal to

play08:55

row num

play08:57

Pages all right so that function takes a

play09:00

row and turns it into such a string and

play09:03

then all I can do is or all I have to do

play09:05

is I have to just return this textual

play09:10

representation like this all right so

play09:14

let's see what happens when I apply it

play09:16

so let's go and say DF and then iog up

play09:20

until five and then

play09:22

apply the function textual

play09:25

representation

play09:30

um do I have to provide an access I

play09:34

think

play09:35

so there you go so this function now

play09:39

applied gives us that maybe we can go

play09:41

values zero and actually I can go and

play09:45

print that to get the result here and

play09:48

you can see we get the title the

play09:49

author's description categories

play09:51

publishing your average rating and

play09:53

number of pages so that is what our

play09:56

function does now now we need to apply

play09:58

that function to all the

play10:00

individual um rows so we're going to say

play10:02

DF

play10:05

textual

play10:08

representation is going to be equal to

play10:10

DF apply textual representation AIS

play10:14

equals

play10:16

1 that is uh turning our data set into

play10:20

one where we have these textual

play10:22

representations here all

play10:25

right

play10:27

so we have the this now and the next

play10:30

thing we want to do is we want to take

play10:32

all of this and make an embedding and

play10:35

put everything into a vector store so

play10:36

we're going to say here import Face

play10:39

import requests now we don't need to

play10:42

really interact with ol Lama uh the only

play10:44

thing that you need with AMA is of

play10:46

course you need to say uh AMA surf I

play10:49

think or AMA run um and then you need to

play10:52

say AMA pull llama 2 that's important

play10:56

because you need to have the model on

play10:57

your system um and again I have a video

play10:59

on ama if you have struggles with AMA

play11:02

check out that video so import phase

play11:04

import request and then import numpy as

play11:08

NP then we say the dimensionality is

play11:11

4,096 as I said and then the index so

play11:15

the vector store is going to be

play11:17

face index flat L2 with the

play11:21

dimensionality here we choose this

play11:23

dimensionality because that is the

play11:24

dimensionality of the response we get

play11:26

from llama 2 when it comes to the

play11:28

embedding

play11:29

uh and then we want to say x is equal to

play11:33

np0 and here we pass length data frame

play11:37

textual

play11:40

representation and dimensionality so we

play11:43

just initialize uh these uh actually we

play11:47

need to also pass the data type D type

play11:50

is float 32 uh we just initialize uh

play11:54

input full of zeros here now um and what

play11:57

we want to do next is we want to

play11:58

actually get the embeddings from llama 2

play12:01

so we say 4

play12:03

I

play12:05

representation in enumerate so we have

play12:08

an index enumerate DF textual

play12:13

representation uh for that we say take

play12:16

the representation and make a request so

play12:19

say response equal to requests. poost

play12:24

and now we just need to use the Local

play12:26

Host URL of llama 2 uh o Lama sorry

play12:29

sorry uh which is by default

play12:32

HTTP Local Host and then port 11 434 if

play12:38

you didn't change that that should be

play12:39

the default Port of olama again here if

play12:42

you don't want to use ol Lama you can

play12:44

also use the open AI API there is an API

play12:46

for embeddings if you want to replace

play12:48

this code with code that gets you the

play12:50

embeddings from open AI where you have

play12:51

to pay money you can do that uh you

play12:54

don't have to do it with ama if you

play12:56

don't want to you just have to get

play12:57

somehow the embeddings so / API

play13:01

embeddings and here now we need to pass

play13:03

some data the data is going to be

play13:05

adjacent object and the Json object will

play13:08

say I want to use the

play13:10

model llama 2 and I want to use the

play13:15

prompt for the embedding which is the

play13:19

representation um all right so that is

play13:22

our request and then the result is going

play13:24

to

play13:25

be or the embedding is going to be equal

play13:28

to the response

play13:30

do get the Json and then get a specific

play13:33

field called

play13:35

embedding and then in order to store

play13:37

that go to index I this is why we do the

play13:39

enumeration here uh go to index I and

play13:42

say that that is now our

play13:44

new uh input here so NP array

play13:49

embedding actually I think we can also

play13:50

use

play13:52

npm uh that would save us some time I

play13:55

guess but it doesn't really matter

play13:57

so yeah uh and in the end when we're

play14:00

done with that we want to do index. add

play14:03

X

play14:05

so we can do it like that and it will

play14:08

take quite some time so I can run this

play14:10

and you will see it will start working

play14:11

and it takes some time so I can actually

play14:13

go ahead and add a line here saying if I

play14:16

modulo 100 is equal to Zer print I and

play14:20

remember we have uh how many we have

play14:24

6,810

play14:26

rows so I can run this you can see I get

play14:29

zero then at some point I'm going to get

play14:31

100 200 and so on but you see the

play14:33

progress is quite slow so when you see

play14:36

100 and when you see 200 you're going to

play14:38

see how slow this actually is so I'm not

play14:40

going to do all of this here now on

play14:42

camera and I think actually okay it

play14:45

seems like I cannot run this while

play14:46

recording because it crashes my

play14:47

recording or at least it makes it very

play14:49

laggy but you can see it doesn't uh work

play14:52

very quickly at least on my Hardware

play14:54

maybe you have some power GPU and it

play14:55

works instantly so you have to run this

play14:58

for a while I'm not going to do this on

play14:59

camera I already did this this is why I

play15:02

have this index file here I'm going to

play15:03

show you how to create that here in a

play15:04

second but you can run this for example

play15:06

on the first couple of instances if you

play15:08

want to you can run it on the whole data

play15:09

set and just wait but the idea is that

play15:12

once this process is done so once this

play15:14

Loop here is finished and you add

play15:15

everything to the index what you can do

play15:17

easily is let me just close this

play15:20

here um what you can do easily is you

play15:23

can export the index by saying phase

play15:26

right index and then

play15:29

uh you take the index and you save it to

play15:31

index now in my case I already did that

play15:34

so uh I'm not going to do this but this

play15:37

is the line of code you would run to do

play15:39

that I'm just adding a two here in case

play15:40

I accidentally run this and delete my

play15:42

index uh and what you can do then is you

play15:45

can load the index from the file Again

play15:47

by saying ph. read index and then just a

play15:50

file name so index here and then you can

play15:53

store that in an index so in this case

play15:55

what I'm doing now is I'm loading the

play15:56

index from a file instead of creating it

play15:58

here by training because I already did

play16:00

this exact code here I ran it I weighed

play16:02

it I produced the index I wrote it to

play16:04

dis and now I can just say index face

play16:06

read index and I have the full index so

play16:09

everything that is the result of running

play16:11

this just that I now uh terminated that

play16:14

but I now have the index and this is now

play16:17

the same thing that you get when you

play16:18

just run this and let it run until it's

play16:20

finished or you can also go ahead and

play16:21

just truncate it you can say okay give

play16:23

me a random sample of the data I don't

play16:25

need all of it uh it's up to you so it

play16:28

takes some time

play16:29

um now what do we do with this index

play16:32

what we can do with this index now is we

play16:34

can provide a new instance and find the

play16:37

most similar instance of the database so

play16:39

for example uh let's use from our uh

play16:43

data frame here so let's go and say data

play16:45

frame DF where the title

play16:48

contains uh let's look for a book uh

play16:51

classic self-improvement book would be

play16:53

something like uh how to in friends and

play16:55

influence people so let's look for

play16:57

friends uh uh what's the problem here DF

play17:00

title

play17:02

contains oh title. string

play17:05

contains so we have little house friends

play17:08

friends friends how do I friends and

play17:10

influence people there you go or

play17:11

actually this is the book so it's 4533

play17:15

let's say I want to find the most

play17:16

similar book to this one now this is not

play17:18

a new book you can also do that with a

play17:19

book that is not part of the data frame

play17:21

uh but then you would have to craft your

play17:22

string yourself but let's go and say

play17:25

that my favorite book now is equal to DF

play17:30

iog and it's

play17:33

4533 so if I look at my favorite book

play17:36

you can see it's this one I can also

play17:38

look at the textual representation

play17:43

here and I will

play17:47

get the data for the book now let's say

play17:50

I like this book and I want to find

play17:52

similar books because I want to learn

play17:54

more about this or similar topics here

play17:57

what I can do is I can use the vector

play17:59

store to embed this again and this again

play18:01

this could be something completely

play18:03

different I could go ahead now and craft

play18:04

this myself I don't need to use a string

play18:07

that is already part of the um of the

play18:10

data frame I can go and say title uh

play18:13

python Bible 7 and one which is a book

play18:17

from me I can say

play18:18

authors my name and I can I can put my

play18:21

book here if I want to and I can feed

play18:24

that in as well so it doesn't have to be

play18:26

a book that's already part of the data

play18:27

frame you just craft your string and you

play18:29

feed it into it it doesn't even have to

play18:30

have the structure so you can also go

play18:32

ahead and feed in hello world and embed

play18:33

it it also works uh but it's not very

play18:36

useful so we have this book here and

play18:40

what I want to do now is I want to embed

play18:42

this again assuming that this is not

play18:44

part of the data frame or you can again

play18:46

embed your own string and then take that

play18:48

embedding and perform a similarity

play18:50

search so we do that again by saying

play18:52

basically the exact same thing that we

play18:54

did here so we're going to copy that

play18:56

code

play18:58

the response is equal to requests post

play19:00

and then that but here now instead of

play19:02

representation we pass favorite

play19:04

book uh

play19:07

textual

play19:09

representation or you could also as I

play19:11

said pass your own string um yeah so

play19:16

that is that we get a response from this

play19:19

now we need to get an embedding so

play19:21

what's the embedding of this particular

play19:22

book it's equal to NP

play19:24

array of uh response

play19:29

response.

play19:30

[Music]

play19:32

Json uh yeah we need to actually use

play19:36

this

play19:37

uh thing here for the shape so response

play19:40

Json

play19:41

embedding and then the data type is

play19:45

equal to float

play19:48

32 this is now the closing bracket

play19:53

actually we need no we need it like this

play19:56

there you go so that's the betting and

play19:59

now we have to feed this into our index

play20:02

and search for similarities so we say di

play20:04

is equal to index sech so we performed

play20:08

the search based on the embedding and

play20:10

we're interested in a top five results

play20:12

so I pass Five here and then I can get

play20:15

the matches by saying best

play20:17

matches is equal to NP

play20:22

array um DF

play20:27

textual represent

play20:29

ation so I get only the column with the

play20:32

representations from the data frame and

play20:34

I say that I'm interested in particular

play20:36

in a couple of indices and these indices

play20:40

are what I get as a result here from I

play20:43

so I flatten that because what you need

play20:45

to understand is that I'm doing this and

play20:47

as a result I don't get a textual

play20:48

representation I get positions I get

play20:50

indices of the individual entries and I

play20:53

then need to translate them back to

play20:54

actual representations from the data

play20:56

frame so I can say for match in best

play21:02

matches print the

play21:05

match print an empty line and basically

play21:10

run this and you can see now not

play21:13

surprising actually this is surprising

play21:17

because this is not what I was expecting

play21:21

let me see this is I think the issue is

play21:25

that okay so I actually figured out that

play21:27

the problem was a different one and it

play21:29

was that I was not using the exact same

play21:31

structure that I was using uh when

play21:33

training the previous index because of

play21:35

course I trained the index with my

play21:36

prepared code and there I had a slightly

play21:39

different structure now I changed this

play21:40

you can see now title is no longer the

play21:42

first thing we have categories title

play21:44

authors average raing number of pages

play21:46

publishing year then a blank line in

play21:48

description not because that's

play21:49

necessarily the best way to do it just

play21:51

because that's the way I did it when I

play21:53

trained my index which I loaded so this

play21:56

is just the reason you want to keep this

play21:57

the same you can train or you cannot

play22:00

build your vector store with examples

play22:02

like these and then use a completely

play22:04

different structure so you have to keep

play22:06

it the same in your case it shouldn't

play22:08

make a difference you should uh get good

play22:10

results immediately because you have

play22:12

only been using one structure in my case

play22:14

it made a difference so just as a side

play22:15

note here it's good that we can learn

play22:17

from mistakes uh you need to keep this

play22:19

the same so you cannot just swap things

play22:21

around here because it's going to mess

play22:22

up the database so I changed the

play22:24

structure to be the exact same as the

play22:25

one I used so now I can run uh these

play22:28

these things here again I'm not going to

play22:31

run this one uh I can read index I can

play22:36

find this book again I can post I can

play22:40

get the best matches and then I can get

play22:42

my results which are in this case now

play22:45

way better now of course this one here

play22:47

is going to be number one because it's

play22:49

the exact same thing but besides that we

play22:52

have here conduct of life from Steven

play22:54

cvy also a self-improvement book we have

play22:57

psychology the of intimacy from this

play23:00

author here we have uh Marketing in the

play23:03

bottom line oh actually this was not the

play23:05

type conduct of life is not the type

play23:07

first things first is the type of the

play23:08

book uh and the dance of intimacy is the

play23:11

type the title of this book so this is

play23:12

just a category here um but yeah so you

play23:16

can see that what we get here how to

play23:18

talk so teens will listen and listen so

play23:21

team will talk yeah whatever but these

play23:23

are all like self-improvement SL

play23:25

productivity SL communication books

play23:27

maybe we can go and look at a couple of

play23:29

more here and we see that for the most

play23:32

part art of Happiness these are all

play23:34

self-improvement books so it seems to

play23:37

work to some degree you can play around

play23:38

with that you can play around with

play23:39

different representations you can also

play23:41

try first of all smaller samples and

play23:44

then do it on the whole Vector store or

play23:45

on the whole data uh data set but this

play23:48

is how you can build a uh recommendation

play23:50

system because all you have to do now is

play23:52

you have to come up with new books like

play23:55

uh in this structure here and then you

play23:56

can just feed them in and get

play23:57

recommendations for uh similar books so

play24:01

that's it for today's video I hope you

play24:02

enjoyed it and hope you learned

play24:03

something if so let me know by hitting a

play24:05

like button and leing a comment in the

play24:06

comment section down below and of course

play24:08

don't forget to subscribe to this

play24:09

Channel and hit the notification Bell to

play24:11

not miss a single future video for free

play24:13

other than that thank you much for

play24:14

watching see you on the next video and

play24:16

bye

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
PythonRecommendation SystemLarge Language ModelsBook DatabaseVector StoreEmbeddingSimilarity SearchData AnalysisMachine LearningAI Applications
هل تحتاج إلى تلخيص باللغة الإنجليزية؟