Book Recommendation System in Python with LLMs
Summary
TLDRIn this informative video, the host guides viewers through the process of coding a book recommendation system using Python and large language models. The system involves creating a vector store to hold vector representations of books, transforming textual data into 4,096-dimensional vectors with the help of LLMs like Llama 2. The video demonstrates how to use a dataset from Kaggle, craft textual representations, and perform similarity searches to recommend the most relevant books. The host also highlights the importance of maintaining consistent data structures for accurate recommendations.
Takeaways
- đ The video is about creating a book recommendation system using large language models in Python.
- đ The goal is to build a vector store (Vector Store Service, VSS) that contains vector representations of various books.
- đ The books' attributes like title, description, author, and publishing date will be transformed into a textual representation and then into a high-dimensional vector.
- đđ These vectors will be 4,096-dimensional, intelligently derived from the text to represent each book uniquely.
- đ The system performs a similarity search to find the closest vector in the vector space to a newly input book's vector, suggesting the most similar books.
- đ€ Large language models (LLMs) are used for the embedding process, which is crucial for intelligently converting text into meaningful vectors.
- đ ïž The video mentions using 'ollama' for convenience, a tool that allows running models locally to get text embeddings.
- đïž A 'faiss' vector store from Facebook is used to store and search through the vectors.
- đ The data set used is the 7K books data set from Kaggle, chosen for its inclusion of book descriptions, which are essential for accurate representation.
- đ A textual representation function is created to structure the book data in a way that is useful for the LLM.
- đ The video emphasizes the importance of maintaining consistent data structure when building and querying the vector store to ensure accurate recommendations.
Q & A
What is the main topic of the video?
-The main topic of the video is how to code a book recommendation system using large language models in Python.
What is the purpose of building a vector store for the book recommendation system?
-The purpose of building a vector store is to contain vector representations of different books, which will be used for similarity search to recommend books.
What attributes of books are mentioned in the script as being used for the recommendation system?
-The attributes mentioned include title, description, author, publishing date, categories, average rating, and number of pages.
Why are vector representations used instead of raw text for the book recommendation system?
-Vector representations are used because they intelligently encode the text into a high-dimensional space where similarity can be measured numerically, which is not possible with raw text.
What is the dimensionality of the vectors that represent the books in the system?
-The dimensionality of the vectors is 4096, meaning each vector has 4096 numerical values.
Which model is used for text embedding in the video?
-The video uses 'llama 2' for text embedding, which is a large language model that can convert text into a meaningful vector representation.
What is the role of the 'requests' package in the video script?
-The 'requests' package is used to send a request to the 'llama 2' model's API to get the embedding for a given text representation of a book.
What is the data set used in the video for building the book recommendation system?
-The data set used is the '7K books data set' from Kaggle, which includes descriptions along with other attributes of the books.
How does the script handle the process of finding similar books once the vector store is built?
-The script performs a similarity search in the vector store by finding the vector that is closest to the vector representation of a new book, and then recommends the books associated with the closest vectors.
What is the importance of keeping the textual representation structure consistent when building the vector store?
-Keeping the textual representation structure consistent is important because it ensures that the vector store accurately reflects the text and maintains the integrity of the similarity search results.
How does the video demonstrate the effectiveness of the book recommendation system?
-The video demonstrates the effectiveness by showing the process of finding similar books to a given book and displaying the recommended books that are indeed similar in genre or theme.
Outlines
đ Introduction to Building a Book Recommendation System
The video begins with an introduction to building a book recommendation system using large language models in Python. The presenter outlines the process of creating a vector store to hold vector representations of various books, including attributes like title, description, author, and publishing date. The main goal is to convert these textual attributes into 4,096-dimensional vectors intelligently using large language models, allowing for a similarity search to recommend books.
đ Crafting Textual Representations for Books
The second paragraph delves into the specifics of creating textual representations for each book from the dataset. The presenter discusses the importance of selecting relevant information and structuring it in a way that is useful for the language model. A function is introduced to convert each row of the dataset into a string containing the book's title, authors, categories, description, publishing year, average rating, and number of pages, which will be used for generating embeddings.
đ Embedding Text into Vectors and Storing Them
In this segment, the focus shifts to embedding the textual representations into vectors and storing them in a vector store. The presenter discusses using the Faiss library for the vector store and the Llama 2 model for generating embeddings from text. The process involves sending requests to the Llama 2 API with the textual representations and receiving the corresponding 4,096-dimensional vectors, which are then added to the Faiss index.
đ Searching for Similar Books Using the Vector Store
The fourth paragraph explains how to use the created vector store to find similar books. The presenter demonstrates how to take a book's textual representation, embed it using the same model, and perform a similarity search to find the closest vectors in the vector space. This process involves using the Faiss index to search for the top matches based on the embedded vector of the book in question.
đ ïž Troubleshooting and Finalizing the Recommendation System
The final paragraph addresses potential issues that may arise when using different structures for the textual representation during the embedding process. The presenter emphasizes the importance of maintaining consistency in the structure used for training the index and when performing searches. After resolving any issues, the video concludes with a demonstration of how to use the system to recommend similar books based on a given title, showcasing the effectiveness of the recommendation system.
Mindmap
Keywords
đĄBook Recommendation System
đĄLarge Language Models (LLMs)
đĄVector Representation
đĄVector Store
đĄEmbedding
đĄFaiss
đĄKaggle Dataset
đĄTextual Representation
đĄSimilarity Search
đĄAPI
đĄDimensionality
Highlights
Introduction to building a book recommendation system using large language models in Python.
Creating a vector store (VSS) to contain vector representations of books.
Using attributes like title, description, author, and publishing date to build textual representations of books.
Transformation of textual data into 4,096-dimensional vectors for intelligent representation.
Similarity search to find the closest vector in the vector space for book recommendations.
Utilization of large language models (LLMs) for the intelligent embedding process.
Choice of using the 'llama' model for convenience in local model running.
Explanation of using the Facebook AI Similarity Search (FAISS) vector store.
Selection of the 7K books dataset from Kaggle for its descriptive content.
Installation of necessary packages like pandas, numpy, and FAISS for data processing.
Crafting a textual representation function to structure book information.
Application of the textual representation function to the entire dataset.
Importance of maintaining consistent structure for embedding and similarity search.
Process of embedding book data into the vector store using the 'llama' model.
Performance of similarity search to find the top five most similar books.
Demonstration of finding similar books to 'How to Win Friends and Influence People'.
Discussion on the practicality and potential improvements of the book recommendation system.
Encouragement for viewers to experiment with different textual representations for better results.
Conclusion and invitation for feedback on the video's content and approach.
Transcripts
what is going on guys welcome back in
this video today we're going to learn
how to code a book recommendation system
by utilizing large language models in
Python so let us get right into
[Music]
it all right so we're going to build a
book recommendation system in Python
today by utilizing large language models
and I want to briefly sketch the process
nothing too complicated and nothing too
detailed I just want to show you
basically here uh visually what we're
going to do very basic I'm going to use
my mouse so this is not going to be the
most beautiful drawing but our goal is
to build up a vector store so a vector
database I'm going to say VSS here
Vector store uh and this is going to
contain Vector representations of a
bunch of different books so the idea is
we have a data set full of different
books and these books have certain um
attributes like a title a description an
author a publishing date and so on so we
have a bunch of attributes here and what
we want to do is who want to get those
uh buil a textural representation so
basically just raw text containing this
information in some way and then we want
to somehow intelligently take that uh
and turn it into a vector so all
representation are going to be turned
into vectors and these vectors have some
values I don't know four 0.5 something
and this is an uh a very high
dimensional Vector so I think let me
just double check here in my prepared
code this is going to be a 4,096
dimensional Vector so it's going to have
496 values numerical values uh which are
not random which are intelligently
Chosen and then these vectors are going
to be stored in the vector store and
when I get a new book with new
information what I do is I take that
turn it into the same kind of textual
representation turn it into the same
kind of vector using the same model so
into the same kind of uh Vector
representation here and then I ask
what's the closest Vector to this one
this is a simp similarity search so um
mathematically if I have two points in a
496 dimensional Vector space there is a
distance between all the different
points and what I'm asking is using this
new Vector what's the closest Vector
what's the closest point in this
high-dimensional Vector space to this
one and I assume this is going to be the
most similar book so I'm going to use
the five uh closest vectors to determine
the five most similar books and
recommend them uh as a result now the
interesting part of this whole system is
uh basically this Arrow here because
that is the embedding process taking
text and turning it into a vector
intelligently and for this we're going
to use
llms so large language models this is
where the intelligence is needed because
we need to take a text and we need to
somehow take the content of this text
and turn it into something that is
meaningfully represented in a 496
dimensional Vector space and then we
need to be able to do a similarity
search there so that's what we're going
to do now for the embedding model um if
you want to change the code to use
something else you can do whatever you
want you can use chat GPT so the open AI
API you can use GPT you can use uh any
kind of self-hosted model I'm going to
use olama just for convenience ol Lama I
have a video on his Channel showing how
to install and use olama basically it
allows you to easily run models locally
I'm going to use uh I think let me just
double check here I'm going to use llama
2 um just because it fits into my
hardware and I'm going to use llama 2 to
get text feed it into it and get an
embedding out of a 496 dimensional
embedding uh and as a vector store we're
going to use face so the Facebook uh
Vector store um all right and as a data
set we're going to use a kaggle data set
which is the 7K books data set the
reason I chose this one is because it
also has descriptions so we have
actually text describing the content of
the book not just the title because the
title can be very misleading or very uh
simple so we want to have these
descriptions here as well so this is the
data set I'm going to use you will find
a link to it in the description down
below um and actually I think I have to
download it because I don't have it in
my directory so I'm going to just
download um going to go to python
current here I'm going to download the
archive zip and in here we have the book
CSV file I'm going to extract
it and uh I'm going to just close this
all right so now I should have the book
CSV file here I'm going to open up a new
Jupiter notebook instance and we should
install a couple of packages first the
basic data assign stuff as always so
pandas numpy should always be part of
the equation here uh and I think for
this we're going to use um face as well
as I said so we're going to use also the
request package because we need to send
a request to the API of AMA and we're
going to use face and then you can use
uh I think phase GPU and phase CPU I
think I'm using phas GPU here so this
utilizes of course the graphical
Processing Unit uh and not the CPU but
if you don't have a GPU a strong GPU you
can also go with face CPU but this is
what we need for this video today all
right so we're going to start by loading
the data set and taking a look at it
this should not be too difficult so the
data frame is equal to pandas read uh
the CSV file which is called book
CSV and then I can look at it basically
what kind of textual representation you
want to use is up to you you can be
creative with this one because at the
end of the day you want to pick a
representation that contains only the
necessary information only the useful
information uh and you also want to
structure it in a way that is the most
useful for the llm now what this way is
I don't really know you have to play
around with it you have to see if
different representations give you
better results my Approach here is to
just craft a string saying um listing
the specific information that we need so
for example saying the title is this the
author is this the publishing year is
this the categories are these and so on
and so forth um so yeah this is the data
that we have I'm going to use the title
I'm going to use the authors I'm going
to use the category I'm going to use the
description uh the publishing year and
maybe maybe let's go also with the
rating so the rating could also be
interesting because yeah I mean if a
book has an average rating of one it's
probably not that
good um all right so what we're going to
do is we're going to create a function
which is going to take a row and turn it
into a textual representation so we're
going to say textual
representation by the way I have a very
very similar video where I do the same
thing with movies if you're more
interested in movie recommendation you
can check this out uh today we're going
to do books so textual
representation uh we get a row ENT input
here uh and then basically we just take
the content of this row so title the
authors and so on and turn it into a
string so I'm going to use an F string
here A multi-line F string so three
quotation marks uh and I'm going to say
that this is the text ual
[Applause]
representation uh and we're going to
start by saying first of all the title
of the book is
row title like
this um actually shouldn't this oh I'm
using double why okay we're not in flask
here we use just single uh curly
brackets so this is the title now and
then I can do a line break and I can say
the next thing is uh the authors the
authors is going to be equal equal to
row and then
authors and then I can do the same thing
for all the other rows uh for all the
other columns so again we have title we
have authors categories description so
let's go with uh description first
description is going to be equal to row
description
obviously then the categories or maybe
we should say I mean c atories is fine
maybe we should say say genre but I'm
not sure about
that categories um and then what did I
say we want to go with the publishing
year and the average rating that should
be it so we're going to go with
publishing year is going to be
row published year is the column
name and then finally we have uh
average rating
row uh
average rating and I think one more
feature that is useful is uh the number
of pages because maybe I'm only
interested in very small books so I
usually only read books like 70 pages
long uh that might also be a factor even
though maybe not the most important one
so let's say number of pages going to be
equal to
row num
Pages all right so that function takes a
row and turns it into such a string and
then all I can do is or all I have to do
is I have to just return this textual
representation like this all right so
let's see what happens when I apply it
so let's go and say DF and then iog up
until five and then
apply the function textual
representation
um do I have to provide an access I
think
so there you go so this function now
applied gives us that maybe we can go
values zero and actually I can go and
print that to get the result here and
you can see we get the title the
author's description categories
publishing your average rating and
number of pages so that is what our
function does now now we need to apply
that function to all the
individual um rows so we're going to say
DF
textual
representation is going to be equal to
DF apply textual representation AIS
equals
1 that is uh turning our data set into
one where we have these textual
representations here all
right
so we have the this now and the next
thing we want to do is we want to take
all of this and make an embedding and
put everything into a vector store so
we're going to say here import Face
import requests now we don't need to
really interact with ol Lama uh the only
thing that you need with AMA is of
course you need to say uh AMA surf I
think or AMA run um and then you need to
say AMA pull llama 2 that's important
because you need to have the model on
your system um and again I have a video
on ama if you have struggles with AMA
check out that video so import phase
import request and then import numpy as
NP then we say the dimensionality is
4,096 as I said and then the index so
the vector store is going to be
face index flat L2 with the
dimensionality here we choose this
dimensionality because that is the
dimensionality of the response we get
from llama 2 when it comes to the
embedding
uh and then we want to say x is equal to
np0 and here we pass length data frame
textual
representation and dimensionality so we
just initialize uh these uh actually we
need to also pass the data type D type
is float 32 uh we just initialize uh
input full of zeros here now um and what
we want to do next is we want to
actually get the embeddings from llama 2
so we say 4
I
representation in enumerate so we have
an index enumerate DF textual
representation uh for that we say take
the representation and make a request so
say response equal to requests. poost
and now we just need to use the Local
Host URL of llama 2 uh o Lama sorry
sorry uh which is by default
HTTP Local Host and then port 11 434 if
you didn't change that that should be
the default Port of olama again here if
you don't want to use ol Lama you can
also use the open AI API there is an API
for embeddings if you want to replace
this code with code that gets you the
embeddings from open AI where you have
to pay money you can do that uh you
don't have to do it with ama if you
don't want to you just have to get
somehow the embeddings so / API
embeddings and here now we need to pass
some data the data is going to be
adjacent object and the Json object will
say I want to use the
model llama 2 and I want to use the
prompt for the embedding which is the
representation um all right so that is
our request and then the result is going
to
be or the embedding is going to be equal
to the response
do get the Json and then get a specific
field called
embedding and then in order to store
that go to index I this is why we do the
enumeration here uh go to index I and
say that that is now our
new uh input here so NP array
embedding actually I think we can also
use
npm uh that would save us some time I
guess but it doesn't really matter
so yeah uh and in the end when we're
done with that we want to do index. add
X
so we can do it like that and it will
take quite some time so I can run this
and you will see it will start working
and it takes some time so I can actually
go ahead and add a line here saying if I
modulo 100 is equal to Zer print I and
remember we have uh how many we have
6,810
rows so I can run this you can see I get
zero then at some point I'm going to get
100 200 and so on but you see the
progress is quite slow so when you see
100 and when you see 200 you're going to
see how slow this actually is so I'm not
going to do all of this here now on
camera and I think actually okay it
seems like I cannot run this while
recording because it crashes my
recording or at least it makes it very
laggy but you can see it doesn't uh work
very quickly at least on my Hardware
maybe you have some power GPU and it
works instantly so you have to run this
for a while I'm not going to do this on
camera I already did this this is why I
have this index file here I'm going to
show you how to create that here in a
second but you can run this for example
on the first couple of instances if you
want to you can run it on the whole data
set and just wait but the idea is that
once this process is done so once this
Loop here is finished and you add
everything to the index what you can do
easily is let me just close this
here um what you can do easily is you
can export the index by saying phase
right index and then
uh you take the index and you save it to
index now in my case I already did that
so uh I'm not going to do this but this
is the line of code you would run to do
that I'm just adding a two here in case
I accidentally run this and delete my
index uh and what you can do then is you
can load the index from the file Again
by saying ph. read index and then just a
file name so index here and then you can
store that in an index so in this case
what I'm doing now is I'm loading the
index from a file instead of creating it
here by training because I already did
this exact code here I ran it I weighed
it I produced the index I wrote it to
dis and now I can just say index face
read index and I have the full index so
everything that is the result of running
this just that I now uh terminated that
but I now have the index and this is now
the same thing that you get when you
just run this and let it run until it's
finished or you can also go ahead and
just truncate it you can say okay give
me a random sample of the data I don't
need all of it uh it's up to you so it
takes some time
um now what do we do with this index
what we can do with this index now is we
can provide a new instance and find the
most similar instance of the database so
for example uh let's use from our uh
data frame here so let's go and say data
frame DF where the title
contains uh let's look for a book uh
classic self-improvement book would be
something like uh how to in friends and
influence people so let's look for
friends uh uh what's the problem here DF
title
contains oh title. string
contains so we have little house friends
friends friends how do I friends and
influence people there you go or
actually this is the book so it's 4533
let's say I want to find the most
similar book to this one now this is not
a new book you can also do that with a
book that is not part of the data frame
uh but then you would have to craft your
string yourself but let's go and say
that my favorite book now is equal to DF
iog and it's
4533 so if I look at my favorite book
you can see it's this one I can also
look at the textual representation
here and I will
get the data for the book now let's say
I like this book and I want to find
similar books because I want to learn
more about this or similar topics here
what I can do is I can use the vector
store to embed this again and this again
this could be something completely
different I could go ahead now and craft
this myself I don't need to use a string
that is already part of the um of the
data frame I can go and say title uh
python Bible 7 and one which is a book
from me I can say
authors my name and I can I can put my
book here if I want to and I can feed
that in as well so it doesn't have to be
a book that's already part of the data
frame you just craft your string and you
feed it into it it doesn't even have to
have the structure so you can also go
ahead and feed in hello world and embed
it it also works uh but it's not very
useful so we have this book here and
what I want to do now is I want to embed
this again assuming that this is not
part of the data frame or you can again
embed your own string and then take that
embedding and perform a similarity
search so we do that again by saying
basically the exact same thing that we
did here so we're going to copy that
code
the response is equal to requests post
and then that but here now instead of
representation we pass favorite
book uh
textual
representation or you could also as I
said pass your own string um yeah so
that is that we get a response from this
now we need to get an embedding so
what's the embedding of this particular
book it's equal to NP
array of uh response
response.
[Music]
Json uh yeah we need to actually use
this
uh thing here for the shape so response
Json
embedding and then the data type is
equal to float
32 this is now the closing bracket
actually we need no we need it like this
there you go so that's the betting and
now we have to feed this into our index
and search for similarities so we say di
is equal to index sech so we performed
the search based on the embedding and
we're interested in a top five results
so I pass Five here and then I can get
the matches by saying best
matches is equal to NP
array um DF
textual represent
ation so I get only the column with the
representations from the data frame and
I say that I'm interested in particular
in a couple of indices and these indices
are what I get as a result here from I
so I flatten that because what you need
to understand is that I'm doing this and
as a result I don't get a textual
representation I get positions I get
indices of the individual entries and I
then need to translate them back to
actual representations from the data
frame so I can say for match in best
matches print the
match print an empty line and basically
run this and you can see now not
surprising actually this is surprising
because this is not what I was expecting
let me see this is I think the issue is
that okay so I actually figured out that
the problem was a different one and it
was that I was not using the exact same
structure that I was using uh when
training the previous index because of
course I trained the index with my
prepared code and there I had a slightly
different structure now I changed this
you can see now title is no longer the
first thing we have categories title
authors average raing number of pages
publishing year then a blank line in
description not because that's
necessarily the best way to do it just
because that's the way I did it when I
trained my index which I loaded so this
is just the reason you want to keep this
the same you can train or you cannot
build your vector store with examples
like these and then use a completely
different structure so you have to keep
it the same in your case it shouldn't
make a difference you should uh get good
results immediately because you have
only been using one structure in my case
it made a difference so just as a side
note here it's good that we can learn
from mistakes uh you need to keep this
the same so you cannot just swap things
around here because it's going to mess
up the database so I changed the
structure to be the exact same as the
one I used so now I can run uh these
these things here again I'm not going to
run this one uh I can read index I can
find this book again I can post I can
get the best matches and then I can get
my results which are in this case now
way better now of course this one here
is going to be number one because it's
the exact same thing but besides that we
have here conduct of life from Steven
cvy also a self-improvement book we have
psychology the of intimacy from this
author here we have uh Marketing in the
bottom line oh actually this was not the
type conduct of life is not the type
first things first is the type of the
book uh and the dance of intimacy is the
type the title of this book so this is
just a category here um but yeah so you
can see that what we get here how to
talk so teens will listen and listen so
team will talk yeah whatever but these
are all like self-improvement SL
productivity SL communication books
maybe we can go and look at a couple of
more here and we see that for the most
part art of Happiness these are all
self-improvement books so it seems to
work to some degree you can play around
with that you can play around with
different representations you can also
try first of all smaller samples and
then do it on the whole Vector store or
on the whole data uh data set but this
is how you can build a uh recommendation
system because all you have to do now is
you have to come up with new books like
uh in this structure here and then you
can just feed them in and get
recommendations for uh similar books so
that's it for today's video I hope you
enjoyed it and hope you learned
something if so let me know by hitting a
like button and leing a comment in the
comment section down below and of course
don't forget to subscribe to this
Channel and hit the notification Bell to
not miss a single future video for free
other than that thank you much for
watching see you on the next video and
bye
Voir Plus de Vidéos Connexes
Movie Recommender System in Python with LLMs
Introduction to Generative AI (Day 10/20) What are vector databases?
Plant Leaf Disease Detection Using CNN | Python
Vector Databases simply explained! (Embeddings & Indexes)
Llama Index ( GPT Index) step by step introduction
Linguistik Digital - Video Material 1
5.0 / 5 (0 votes)