Building A Python-Based Search Engine
Summary
TLDRDaniel Lindley's talk offers a comprehensive overview of building a Python-based search engine, emphasizing the importance of understanding search technologies distinct from web scraping tools. He introduces core concepts such as document indexing, tokenization, stemming, and the use of inverted indexes and engrams for efficient querying. The presentation also touches on scoring algorithms like BM25 and advanced topics including faceted search, boosting, and 'more like this' functionality, providing both theoretical insights and practical code examples to guide the audience through the complexities of search engine development.
Takeaways
- đ Introduction: Daniel Lindley, a Kansas native and internet personality, is the speaker for the session on building a Python-based search engine.
- đ Purpose: The goal of the talk is to provide an overview of search technologies, emphasizing their differences from traditional databases and the importance of understanding these technologies.
- đ Core Concepts: Search engines are document-based and rely on concepts like inverted indexes, stemming, engrams, and relevance for effective searching.
- đ Importance of Search: In-house search engines are valuable because they understand the data model better than external scrapers and can provide more accurate and relevant results.
- đ€ Avoid Reinventing: The talk discourages creating new search libraries, as they are often unnecessary and existing solutions can be more efficient.
- đ Indexing Process: Indexing involves receiving and storing documents, tokenizing text, generating terms, and creating an inverted index for efficient querying.
- đ Inverted Index: The inverted index is central to search engines, acting as a dictionary that maps terms to their occurrences in documents.
- đ Querying: The process of querying a search engine includes parsing the user's query, reading the index, and scoring the results to determine relevance.
- đ Scoring Algorithms: Algorithms like BM25 are used to score and rank search results based on their relevance to the user's query.
- đ ïž Advanced Topics: The talk touches on advanced search engine features such as faceting, boosting, and 'more like this' functionality, which enhance the search experience.
- đ Resources: The speaker provides resources for further learning, including books on information retrieval and links to the slides and code examples.
Q & A
What is the main topic of Daniel Lindley's talk?
-The main topic of Daniel Lindley's talk is building a Python-based search engine and providing an overview of how search technologies work.
Why might someone want to build an in-house search engine instead of relying on Google or other external search services?
-An in-house search engine can be beneficial because it allows for better understanding and handling of the specific data model, avoiding issues with HTML soup and irrelevant content that external search services might encounter.
What is an inverted index in the context of search engines?
-An inverted index is a data structure that stores a mapping from content, such as words or numbers, to its locations in a database file, which is essential for efficient searching in search engines.
What is the purpose of tokenization in search engine indexing?
-Tokenization is the process of breaking down a large text blob into smaller individual words or tokens, which are then processed for indexing and searching.
What is stemming and why is it used in search engines?
-Stemming is the process of reducing a word to its base or root form. It is used in search engines to match different forms of a word to the same root word, improving search results by considering variations like plurals or different endings.
What are n-grams and how do they differ from stemming?
-N-grams are contiguous sequences of n items from a given sample of text. Unlike stemming, which focuses on reducing words to their root form, n-grams capture the context of words by considering sequences of characters or tokens, which can be beneficial for matching phrases or autocomplete features.
What is the significance of relevance in search engine results?
-Relevance refers to the algorithms applied to the search results to determine how well a document or result matches an individual query. High relevance scoring helps ensure that the most pertinent results are returned for a given search query.
What is the purpose of a query parser in a search engine?
-A query parser is responsible for interpreting and processing a user's search query, converting it into a format that the search engine can use to retrieve relevant documents from the index.
Can you explain the concept of 'facing' in the context of search engines?
-Facing, or faceting, in search engines refers to the process of categorizing and counting documents based on specific attributes or terms, allowing users to filter search results by various criteria, such as brand or price range.
What is the role of an index reader in the search process?
-An index reader is responsible for accessing the search index and retrieving the relevant document and position information based on the terms generated from a user's query.
How does the scoring process in search engines work?
-Scoring in search engines involves using algorithms to evaluate the relevance of documents retrieved from the index based on a user's query. Algorithms like BM25 can be used to assign scores that reflect how well a document matches the query, with higher scores indicating more relevant results.
What are some advanced topics related to search engines that were briefly mentioned in the talk?
-Some advanced topics mentioned include faceting for drill-down search capabilities, boosting to prioritize certain results, 'more like this' functionality for finding contextually similar documents, and the use of n-grams for better matching across different languages.
Outlines
đ Introduction to Python-Based Search Engine Building
Daniel Lindley introduces himself as a speaker focusing on building a Python-based search engine. He mentions his background with open-source consulting and his work on 'hyack,' a Django app for plugable search. The session aims to provide an overview of search technologies, emphasizing their differences from other technologies like RDBMS. Lindley encourages understanding search engines' workings to gain insights into other engines and discourages the creation of redundant search libraries. The talk will include code and slides, touching on the importance of search, the inadequacy of web scraping for content understanding, and the advantages of in-house search capabilities.
đ Core Concepts of Search Engines and Document Indexing
This paragraph delves into the foundational concepts of search engines, defining terms such as 'engine,' 'documents,' 'corpus,' 'stop words,' 'stemming,' 'segments,' 'relevance,' 'faceting,' and 'boost.' It explains that search engines are document-based and not relational databases, highlighting the importance of denormalizing data for indexing. The paragraph discusses the process of indexing, which involves receiving, storing, tokenizing, and generating terms from documents. It also touches on the limitations of stemming and introduces the concept of engrams as a solution for handling different languages and improving search accuracy.
đ Deep Dive into Engrams and Inverted Indexing
The speaker discusses engrams as a method for generating search terms, explaining how they work with a moving window over tokens to create terms for indexing. Engrams help in matching user queries with indexed documents by considering different word fragments. The paragraph also explains the construction of an inverted index, which is central to search engine functionality. The inverted index is likened to a Python dictionary, mapping terms to document IDs and their positions, facilitating efficient querying. The discussion covers challenges like the increased number of terms generated by engrams and their impact on storage and search quality.
đ Query Processing and Scoring Mechanisms
This section outlines the process of querying a search engine, which involves a query parser, index reader, and scoring system. The query parser's role is to interpret user queries, similar to SQL, and prepare them for the search engine. The index reader utilizes the inverted index to fetch relevant document information based on query terms. Scoring algorithms, such as BM25, reorder search results to reflect relevance to the user's query. The speaker also attempts a live demo using the Enron email dataset, searching for the term 'fraud,' but encounters a technical issue.
đ Advanced Search Engine Topics and Practical Considerations
The speaker touches on advanced topics like faceting, which provides counts of documents containing specific terms, allowing users to make informed decisions during their search. Boosting is mentioned as a method to artificially increase the score of certain results during the scoring process. The 'more like this' feature is introduced as a way to find contextually similar documents. The speaker also points out that the provided code is for educational purposes and not suitable for production use, emphasizing the complexity of building a robust search engine. Additional resources for further learning in information retrieval are suggested.
â Audience Q&A and Final Thoughts
The final paragraph consists of a question-and-answer session with the audience. Questions address the importance of stopwords in scoring algorithms like BM25 or TF-IDF, techniques for full phrase matching, and the handling of prefix-heavy languages like Greek in search engines. The speaker provides insights into these topics, explaining how algorithms filter out frequent words and the approaches to phrase matching. The discussion wraps up with acknowledgments and thanks to the audience, with the speaker's contact information and resources provided for further exploration.
Mindmap
Keywords
đĄSearch Engine
đĄInverted Index
đĄTokenization
đĄStemming
đĄN-grams
đĄRelevance
đĄDrill Down (Folding)
đĄBoost
đĄDocument
đĄStop Words
đĄBM25
Highlights
Daniel Lindley introduces the concept of building a Python-based search engine and its significance.
The importance of understanding how search engines work to gain insights into working with other engines is emphasized.
Avoiding the creation of redundant search libraries is suggested as a community goal.
The necessity of in-house search capabilities due to the limitations of web scraping by external search engines like Google is discussed.
The advantage of having a better understanding of one's own data model for effective search is highlighted.
Core concepts of search engines, such as document-based systems and the avoidance of large text blobs, are introduced.
The concept of 'inverted indexes' as a key component of search engine technology is explained.
Terminology such as 'stemming', 'segments', 'relevance', 'faceting', and 'boost' is defined for clarity in search engine discussions.
The process of indexing, involving document receipt, storage, tokenization, and term indexing, is outlined.
The importance of document quality for effective search results is stressed.
Tokenization is described as the breakdown of text into individual words or tokens for search processing.
Stemming is presented as a method to find root words for more efficient search indexing.
The introduction of 'engrams' as an alternative to stemming to overcome language limitations is discussed.
The construction of an inverted index using terms and document metadata for efficient querying is detailed.
The process of querying a search engine involves a query parser, index reader, and scoring mechanism.
A demo of searching the Enron email dataset using the implemented search engine is attempted, showcasing practical application.
Advanced topics such as 'faceting', 'boosting', and 'more like this' features in search engines are briefly introduced.
The speaker provides resources and encourages further learning in the field of search engine technology.
A question-and-answer session explores practical considerations, such as the handling of stop words in scoring algorithms and full phrase matching techniques.
The challenges of implementing search engines for languages with heavy use of prefixes, like Greek, are discussed.
Transcripts
good afternoon
everyone our speaker this afternoon uh
for the first half of this track is
going to be Daniel Lindley he's going to
be speaking on building a python based
search
engine how are you all do you have a
good lunch yes all right good so my name
is Daniel Lindley I am from Kansas not
this Kansas but this
Kansas but really I'm from the internets
so I run a little web shop called toast
driven um might you see this and I do
Consulting in open source um and I'm the
primary author of a package called hyack
which is ajango app for plugable search
but we're not here to talk about Hy
stack today so what I'd like the goal of
what we're going to talk about today is
is to give you a broad overview of how
search Technologies work because if
you're familiar with other Technologies
like rwm's they're not the same Beast
they work very differently and
understanding how a search engine works
can give you insights as to how to work
with other engines what we're going to
try not to do is have all of you write
yet another search library because we
really probably don't need that so
um there'll be code and a link to the
slides at the end of the talk so bear
with me so why should you care about
search doesn't the Googles handle that
Well turns out there's a lot of good
reasons to have in-house search one of
those is that things like Google or ask
or Microsoft and Yahoo's Technologies
they have to scrape web content which
means they get HTML soup and they have
to parse through it and try and pull out
any kind of meaning getting
reasonable proper content out of that
can be a big nightmare
um
another good reason is you know your
data model better than any scraper does
they just have HTML they might have ad
content in there they might have RSS
feeds all kinds of things that aren't
pertinent to the content that's really
being displayed or maybe you're weird
and you don't do web apps like I
do so good reasons to have let's talk
about some Core Concepts uh search
engines are all document based these are
not Road
DBS in you know your rdbms you're never
ever just gripping through a string that
is like the worst thing you can possibly
be doing we don't want to be looking at
text Big text blobs every time we go to
run a
query uh you'll hear me talk a lot about
inverted indexes we're going to get into
this in a few minutes as well as
stemming engrams and relevance so as
with all Fields there's a ton of
terminology so we're going to go through
some stuff really quick so that what I
say throughout the rest of the talk has
some meaning so we're going to talk
about these terms and we'll start with
engine the engine is the Black Box you
hand a query to and you get the results
from the good examples of Open Source
engines are things like solar elastic
search zapien whoosh and many others
Sphinx is an
example uh
documents documents are the text blob
the text that you send to a search
engine is one of the most important
things that you can pay attention to and
additionally you probably want some
additional metadata to describe what
that text blob consists of where it came
from titles other attributes and stuff
Corpus a corpus is just the collection
of all the documents in the
index stop words stop wordss are words
that are basically filler these are
things like and uhthe but and whatnot
words that don't contribute a lot of
meaning but appear very frequently in
documents
stemming stemming is finding root words
and we're going to look at this a little
bit
later
segments segments are the data that make
up the overall index because you may be
dealing with gigabytes or terabytes of
data shoving everything in one big file
really doesn't scale well so a common
technique is to Shard things down into
what are called segment files which as a
whole make up the
index
relevance relevance are the algorithm or
algorithms May apply to the results you
get back out of the search engine to
determine how well a certain document or
result may fit a a individual
query
fating fasting is drill down if you've
ever been on amazon.com that left-hand
bar that lets you say hey I want a
camera that's a canon in $200 to $400
range that's and lets you drill down
that's
fasting boost boost is a way to push up
certain kinds of results in the results
set so that they bubble up to the
top so let's start at the beginning
let's talk about
indexing indexing is the is taking a
document that the blob that we were
talking about with metadata and pulling
it into the engine and Performing
Transformations and storage so that we
can query on it later on there's four
major breakdowns receiving and storing
documents tokenization generating terms
and indexing those terms so let's start
with documents
documents are the simplest thing we will
talk about all day they are not ever
ever ever not not not not they're not a
row in the
DB really no really really um you really
want to be thinking like I've said a
blob of text and some metadata that goes
with it uh like I've said quality is the
most important thing what your users
search on is what I'm sorry what you
index is what your users will be able to
search on if you put put a bunch of junk
in they're going to get garbage results
out search or documents are a very flat
concept A lot of people try to work
relations into search engines and it
never ends well you want to be thinking
things like um I I like to think of
search engines as kind of the original
no sequels after Berkeley DB so um don't
try and put relations in it doesn't
really work well and the the the one
thing I really want you to take away
from any of this because it applies
everywhere is you want to denormalize
the living crap out of anything you're
indexing because the more you can shove
on that has meaning again text quality
is important the more you can shove on
that has meaning onto the individual
documents the better they will Bubble Up
in results and the better results your
users will get when they search so a
document looks really familiar if we
represent it as say Json or python it's
just a dick there's really not a whole
lot to it it's a text blob with optional
metadata
um from there and you can imagine all
kinds of ways to store that data
tokenization is the process of taking
that big text blob and breaking it down
into little individual word siiz chunks
um typically what you'll find most
engines doing is splitting on things
like whites space they'll lowercase
every single token that they get out of
that filter out all those meaningless
words like and and the but and um which
you're going to hear a lot of um um and
stripping punctuation etc etc many times
is configurable so that you can tweak it
to match your document
set the point of that is to get some
nice neat normalized little tokens out
that you can work with when
querying so we mentioned stemming and so
that's where we'll go to next because
now we've got documents we've got a list
of tokens in those documents the next
thing we want to do
is be able to take those mostly
normalized tokens and get something even
better out of it so by doing
stemming I'm sorry
um so if you do something like GP you're
processing through a file stream and
it's literally just checking hey is it
here is it here is it here is it here
that's horrible when you've got
gigabytes and gigabytes and gigabytes of
stored document content so when we do
this tokenization and the postprocessing
we're really close to what we want but
not quite so stemming is the process of
finding the root word examples of this
would be like test becoming testing
because the ing ending is not
particularly meaningful or Searcher
becoming Searcher becoming search
because we want to find root words so
that if someone misspell something or if
uh if there's a pluralized form we can
get back down to that root word and
store one variant rather than
20 these then become the terms in the
inverted index that we'll be querying on
and if you you apply these same the same
tokenization and stemming to the query
what you get out of a user's query is
the exact same thing as the terms you're
shoving in the
index the downsides of stemming are that
it really only works for the language it
was built for most stemmers that are out
there are pretty much English only there
are other languages but they don't work
particularly well and they're poorly
supported it's really really painful
even in high quality search engines
and they're really hard to make work
cross language because the structure of
German or French is not the same as
English so this is a really bad
shortcoming so there are many ways to
solve this the approach we're going to
look at today is
engrams as I said it takes care of some
of the shortcomings of stemming we're
going to introduce some new problems but
overall it's typically worth it what
engram consist of is taking a window and
pass passing it over your token over
your
tokens and that these new little terms
that come out of this moving window are
what become your terms in the index as
opposed to a stemmed version of the word
and we'll see an example of this in a
few moments so uh for instance if we do
a typical engram and we assume a gram
size of three we've got three characters
h l e l and then we jump to the next
term or the next token
and
generate a complete list of terms based
on the tokens that we have now that's
great except that typically you're
probably not typing in lllo as a query
so what we want to do is find another
way to approach this this shortcoming Ed
gen grams typically solve this so what
an ed genr will do is it will pin to a
side of a token in this case uh if with
a gram size of three we still have that
H but instead of moving the window along
we're going to generate different uh
gram lengths so that we have a gradually
more and more complete word that becomes
our terms and then again once we switch
to the next token we start with a gram
size of three again move to four and
five and as you can see these terms are
much more likely to be in a user's query
than
l so this is great because it gives us a
uh first of all it works great for
autocomplete because user can start
typing something very short and you can
start immediately feeding results back
to
them it's also good because it works
across other languages because it no
longer relies on the grammatical
structure you're just simply looking at
tokens and if the word starts the same
way in a in a different language it'll
match well to however whatever terms you
generated cons are that when you use NRS
or Edge grams you generate a lot more
terms um and as a result storing all
those terms can be kind of a problem or
at least it bloats up on space and
initial quality of search can suffer a
little bit because now you're dealing
with fragments and something that you
may not have meant to have matched now
matches so an example in Python might
look something like this we set a
minimum and Max gram size like we did on
the previous slide 3 to six we have a
dictionary of terms that we're going to
be storing out of this we pass in a list
of tokens run through it make sure we
grab position information as well as the
individual token and then we just for
the range of our gram size pass a window
over it and shove the newly generated
term into our
terms so now we've got documents and
we've got engram terms the next step is
storing this in a way that we can
actually query on that comes in in the
inverted index the inverted index is the
heart and soul of the search engine it
is how things work you can think of it
exactly like a python dictionary because
it's largely key value those keys are
the terms that we just got finished
generating with engrs and it's all the
terms from all the documents so you can
imagine taking a large body of text
generating tons and tons and tons of
edrs each one of those that you gener
ated becomes a new key in that big
inverted
index and of course this is across all
documents not just one document at a
time the position information that we
were grabbing out of the previous
example of of edge andrs uh is important
because we can say how far into the
document did this word appear we can
also say hey how many times in this
document did this word appear and of
course we want to store document ID so
that when we search we can map things
back to the data that we've stored
so an index as I said might just look
just like a plain old python dictionary
and you'll be storing the individual
terms like blob and text uh if if you
had a engram size of four for instance
and document IDs such as document
1524 and blob appeared at position three
of that example we saw on a previous
slide once you have these this
index generated the problem then becomes
querying over it efficiently
like I said if you have this hugee
massive data structure hanging out you
can't put it all in one file and search
over it effectively so there's lots and
lots of ways to do segments to break
that index down for efficient
querying many many many projects follow
um there's this this famous Java project
called Lucen which is really really good
at search um but we're going to cheat
cuz it's easier so uh in the example
code that I've got we're going to just
use flat files we're going to take those
engram terms that we generated when we
ripped through the Big Blob of data and
we're going to Hash them and we're going
to take part of that hash the leading
part and use it to map it onto a file in
the file system this G this limits the
number of files that we've got down to
something that like a Unix like system
can handle as well as ensures that if
you got a given engram you can easily
map right down into the proper segment
file that it's associated with we're
going to keep terms in always sorted
order and we're going to use Json to
store the document and position
information not because it's
particularly performant but because
it'll work well across other um other
languages or other
tools so an example implementation might
look something like this I mentioned
we're going to Hash so we just take the
term pass it in we're going to take a
length of six on the hash just so that
it's short and keeps a reasonable number
of files in the uh file system and we
just simply do a simple md5 on it get a
hex Digest and return the first six uh
as far as saving a
segment this isn't in always sorted
order the code I'm going to present to
you at the end is always in sorted order
but it's simply just appending onto the
file the information and we're going to
tab to limit because all of our tokens
have already been wh space separated so
there's no chance of tabs showing
up so by getting to this point we've
already covered creating the terms
storing them storing the docum we've got
enough information that now we can start
query on it querying breaks down into a
query parser index reader and scoring at
a high level so let's dive into the
query parser um T typically the purpose
of a query parser is taking a users's uh
handwritten query maybe they have
advanced knowledge of it maybe not you
can think of this as analogous to SQL
and parsing it out into something that
the engine can start tackling you're
going to process all the elements of of
that user's query the same way you did
when preparing the document for
indexing so an example trivial python
implementation might look like something
like this we've got a list of common
English stop wordss and we're going to
we're going to really really cheat in
this case and we're just going to split
it up check that the token's not in stop
wordss if it's not there we're going to
toss in our list of tokens and then
we're going to make engrams it's the
same kind of stuff we just did when we
were preparing the
index index reading uh becomes easy now
because we have this list of edge andr
terms we can simply go through hash them
and we can find whatever file we just
stowed them in in a very quick efficient
lookup and we just rip through all the
to the tokens that we I'm sorry all the
terms that we generate from a user's
query grab all the hashes go into the
files pull out the results and collect
the position and document
information this is a little big I
apologize if it's small um but it's
basically the same kind of stuff we make
a segment name so that we know what the
what the file name of the term is check
if it's there if it's not we just return
nothing if it's there we go through and
we start ripping through the file and
there are more efficient ways to do this
but we're splitting on the tab for now
so that we've got the term and the info
if we found that term say we entered
hello as our user query and we found
H in the segment we're going to pull out
those results because we we found a
match and collecting the results is
simply a matter of just going through
all the terms that were generated from
the user query and applying that load
segment to those
terms finally so where we're at now is
we've got a user's query we've tokenized
it we've made terms we've gotten all the
results together but they're out of
order they have no meaningful order for
the query that we generated so what
we're going to do with scoring is
reorder that collection so that it means
something based on the user's query
there are tons and tons of choices of
scoring algorithms we're not going to go
through all of them uh a major one is
bm25 it's very famous you can look it up
on Wikipedia and it's it's pretty
interesting there's also what's called
the phased uh scoring query and Google's
page rank is a great example of this
this can be anything you may decide that
words that include Bob are really really
important and you're just like Bob
you're always up at the top buddy so uh
and an implementation of bm25 looks
something like this
uh I can't explain it I'm
sorry so as a quick demo of
this I have indexed the first 100 uh 50
or I'm sorry 1,500 documents of the
Enron email data set so
uh if anyone has any uh correct for
plight company terms they'd like to
search on fraud okay let's do fraud
a demo
fail I'm mad
okay I'm sad because this literally just
worked before I uh before I came down
why you don't do demos and talks thank
you uh so let's talk about some Advanced
topic shall we um I'm not going to cover
any code on this but it's diving into
this would be probably a great exercise
for you um God I sound like a college
professor horrible um so fasting fasting
has made out to be a lot of things but
what drill down really consists of is
for all the terms that are provided
giving accurate counts on the number of
documents that contain those things so
when you're on Amazon and you're digging
through digital cameras and you say
Canon what you're really and they give
you a count that's really all fating is
it's just that count but that count is
important because it lets the user know
both what things are matching within the
document set as well as how many are
there say there's only two Canon uh
cameras on Amazon which will never ever
happen you might be like well wow they
don't really have a really great
selection maybe I need to go elsewhere
so an implementation would probably look
like collecting all the terms counting
the length of the unique document IDs
for each of them and then just order by
descending because once you have those
terms and you know you know how many
things matched and how many documents
were there you just order by the count
and you can provide back something
that's useful to the people who are
using your software boost uh happens
during the scoring process literally
it's just saying hey we've got a score
that we generated but we want to
artificially bump things up so Bob gets
a bump up if it's matched you just tweak
the score uh there there's literally it'
be literally nothing more than modifying
the bm25 relevance score another common
one is something like more like this
what more like this provides is uh
basically contextually similar documents
so you've got this huge list of terms
that you generated when you
indexed you know all the terms that were
in that document and you can also find
other documents that have a very similar
set of terms uh and implementation would
look like collecting all the
terms uh based on documents so you
probably need an alternative uh data
structure to store document to term
mapping and then just sorting simply
based on how many times a given document
was seen in the set this is really
really simplistic as is most of this
presentation just because it's to get a
high level engines like solar and
whatnot use much more complete complex
Solutions like natural language
processing as well as other times types
of
oh other types of contextual analysis to
get a much better quality result
so I told you not to go invent a search
engine and I went and invented a search
engine uh code is up on GitHub BSD
licensed feel free to poke through uh
I've annotated the whole source so it
should be easy to dive through uh here's
a list of some additional resources um
lots of good stuff here uh the IR book
is fantastic and largely free on the
internet it's worth it for any kind of
textual processing and just learning
about how information retrieval kinds of
topics work as well as a bunch of other
things and thank you very
[Music]
much uh as I mentioned the uh the slides
are up on speaker deck at this URL once
I make it public huh um they'll be up
just after this
talk
there we have about oops we have about
six minutes for questions if anyone
would like to ask a question please line
up behind that microphone um quick note
on process ask your question I will
repeat your question and then um he will
answer
it so just to kind of review when when
where do you think the cut off is
between rolling your own search engine
and just basically trying to figure out
how solar works and implementing it
yourself in your
um to me the code I put up I would say
no one should ever deploy in production
it is horrific on iO because you're
literally writing with every single term
you you probably want something that
does something more like um indexes a
bunch of documents in memory and then
does a commit and flushes them out to
disk so to me it's a learning exercise
it was great to go through and like
solidify things for myself at a high
level um I'd encourage you to go ahead
and do that if it's of interest to you
but um it it rapidly becomes one of
those things where you'll probably start
hating life if you have to support
it hey um can you speak to uh the
relative importance of things like stop
words when you're using something like
bm25 or tfidf that takes into account
the frequency of words or better refer
to any resources that uh show those
kinds of differences sure so
um when I showed this I kind of glossed
over some things um the B so what BM
like he mentioned bm25 does a great job
of filtering out words that are
extremely frequent
um I can't again I can't explain the
math on this but essentially if
something has I believe it's like a 70
like a 75% chance of appearing or or
something or or a certain it's like 75%
repeats it'll just be ignored so that
the the way the relevance score is
calculated is it's kind of weird it's a
it's a trip to read on Wikipedia um
huh you can tune it 75% hard right um
that the b0 that's up there that's
passed in as a uh quar that actually is
document length if you set that it'll
adjust score slightly um
as well as the K modifier is just used
to like kind of normalize the range so
like when you do scoring you don't get a
zero to 100% match like you might have
seen in older search engines you get a
weird floating Point number and it
doesn't really mean a whole lot because
it's just based on the document set as
well as the query but like the next the
moment someone changes the query even
slightly you can get very different
scores so that's why you rarely see
people exposing scores on the front end
of things because they don't mean a lot
now bm25 and I believe
um I think the phased uh approach also
handles this they kind of already deal
with uh excluding things like stop words
and really frequent uh frequently
repeated words the problem is that you
don't if you lean on just that you don't
get as in my opinion you don't get as
good of quality results because other
things might be repeated very frequently
say you're in the python documentation
the word python is everywhere and it can
be it can be ranked lower by a score a
score out of bm25 just because of how
frequently it appears even though it's
not really a stop word um also if you
filter out stop wordss ahead of time you
save on lots and lots and lots and lots
and lots of dis space it it it's really
more of like an IO um saving cool crutch
how you doing uh what are some
techniques and algorithms for doing a
full phrase
matching for so the question was what
are some techniques for doing full
phrase
matching um it really depends on the
uh it depends on how the query is parsed
and how the index is read what what'll
typically happen is you'll find that a
query gets broken down into something
that's more like a
tree um with uh operators in between
them and so like if you double quote
something in a query chances are that'll
get pulled out into like if you imagine
a query tree it'll get pulled out into
an exact match node and so it'll
generate
terms so it like in the case of engrams
not everything uses engrams in fact most
things default to stemming but engrams
work better in a lot of circumstances
um so in that exact match node of this
this query tree it will still generate
those Eden gram terms and look them up
but probably generate them for the max
length of the word so like in the case
of hello world both terms are five so
generate an EDG gen gram of five for
both of them look them up find documents
where it has both of those words and
then look at the position information
and if the positions aren't right next
to each other you discard the result so
you're just you're just manually
filtering the results set then to find
the documents that actually have that
phrase essentially yes now some people
take a different some engines take a
different approach and they will store
full matches you have to you have to
realize most search engines are like
nearly infinitely configurable so you
can make it do whatever you want and
this is a really really simple view of
things but um some engines will store
you know full exact matches the thing
you don't want to do is you know try and
go through an entire blob because as
soon as you do that performance just
drops thanks a lot
mhm um any question
I actually have one um how does Ingram
searching work with languages that make
heavy use of prefixes like Greek for
instance so um I can't speak to Greek
but if you have
um it depends how it's prefixed if it's
prefixed with punctuation like you say
like it's hyphenated or something
typically you'll be splitting on
punctuation anyway if you're talking
about um I'm trying to think of examples
like in English of common PR Center pre
poost huh epicenter
epicenter um so another thing that I've
kind of Hidden Away in this talk is that
say for instance solar or lucens their
default implementation generates uh Ed
genr sizes from 2 to
15 so even though um it doesn't help in
the case of of prefixing it does in
postfixing and the way you get around it
for prefixed things is typically
generate NRS instead of Edge and grams
got it ladies and gentlemen Daniel
Lindley thank you very
much
5.0 / 5 (0 votes)