Building A Python-Based Search Engine

Next Day Video
13 Mar 201231:05

Summary

TLDRDaniel Lindley's talk offers a comprehensive overview of building a Python-based search engine, emphasizing the importance of understanding search technologies distinct from web scraping tools. He introduces core concepts such as document indexing, tokenization, stemming, and the use of inverted indexes and engrams for efficient querying. The presentation also touches on scoring algorithms like BM25 and advanced topics including faceted search, boosting, and 'more like this' functionality, providing both theoretical insights and practical code examples to guide the audience through the complexities of search engine development.

Takeaways

  • πŸ‘‹ Introduction: Daniel Lindley, a Kansas native and internet personality, is the speaker for the session on building a Python-based search engine.
  • πŸ” Purpose: The goal of the talk is to provide an overview of search technologies, emphasizing their differences from traditional databases and the importance of understanding these technologies.
  • πŸ“š Core Concepts: Search engines are document-based and rely on concepts like inverted indexes, stemming, engrams, and relevance for effective searching.
  • πŸ“ˆ Importance of Search: In-house search engines are valuable because they understand the data model better than external scrapers and can provide more accurate and relevant results.
  • πŸ€– Avoid Reinventing: The talk discourages creating new search libraries, as they are often unnecessary and existing solutions can be more efficient.
  • πŸ“ Indexing Process: Indexing involves receiving and storing documents, tokenizing text, generating terms, and creating an inverted index for efficient querying.
  • πŸ”‘ Inverted Index: The inverted index is central to search engines, acting as a dictionary that maps terms to their occurrences in documents.
  • πŸ” Querying: The process of querying a search engine includes parsing the user's query, reading the index, and scoring the results to determine relevance.
  • πŸ“ˆ Scoring Algorithms: Algorithms like BM25 are used to score and rank search results based on their relevance to the user's query.
  • πŸ› οΈ Advanced Topics: The talk touches on advanced search engine features such as faceting, boosting, and 'more like this' functionality, which enhance the search experience.
  • πŸ“š Resources: The speaker provides resources for further learning, including books on information retrieval and links to the slides and code examples.

Q & A

  • What is the main topic of Daniel Lindley's talk?

    -The main topic of Daniel Lindley's talk is building a Python-based search engine and providing an overview of how search technologies work.

  • Why might someone want to build an in-house search engine instead of relying on Google or other external search services?

    -An in-house search engine can be beneficial because it allows for better understanding and handling of the specific data model, avoiding issues with HTML soup and irrelevant content that external search services might encounter.

  • What is an inverted index in the context of search engines?

    -An inverted index is a data structure that stores a mapping from content, such as words or numbers, to its locations in a database file, which is essential for efficient searching in search engines.

  • What is the purpose of tokenization in search engine indexing?

    -Tokenization is the process of breaking down a large text blob into smaller individual words or tokens, which are then processed for indexing and searching.

  • What is stemming and why is it used in search engines?

    -Stemming is the process of reducing a word to its base or root form. It is used in search engines to match different forms of a word to the same root word, improving search results by considering variations like plurals or different endings.

  • What are n-grams and how do they differ from stemming?

    -N-grams are contiguous sequences of n items from a given sample of text. Unlike stemming, which focuses on reducing words to their root form, n-grams capture the context of words by considering sequences of characters or tokens, which can be beneficial for matching phrases or autocomplete features.

  • What is the significance of relevance in search engine results?

    -Relevance refers to the algorithms applied to the search results to determine how well a document or result matches an individual query. High relevance scoring helps ensure that the most pertinent results are returned for a given search query.

  • What is the purpose of a query parser in a search engine?

    -A query parser is responsible for interpreting and processing a user's search query, converting it into a format that the search engine can use to retrieve relevant documents from the index.

  • Can you explain the concept of 'facing' in the context of search engines?

    -Facing, or faceting, in search engines refers to the process of categorizing and counting documents based on specific attributes or terms, allowing users to filter search results by various criteria, such as brand or price range.

  • What is the role of an index reader in the search process?

    -An index reader is responsible for accessing the search index and retrieving the relevant document and position information based on the terms generated from a user's query.

  • How does the scoring process in search engines work?

    -Scoring in search engines involves using algorithms to evaluate the relevance of documents retrieved from the index based on a user's query. Algorithms like BM25 can be used to assign scores that reflect how well a document matches the query, with higher scores indicating more relevant results.

  • What are some advanced topics related to search engines that were briefly mentioned in the talk?

    -Some advanced topics mentioned include faceting for drill-down search capabilities, boosting to prioritize certain results, 'more like this' functionality for finding contextually similar documents, and the use of n-grams for better matching across different languages.

Outlines

00:00

πŸ‘‹ Introduction to Python-Based Search Engine Building

Daniel Lindley introduces himself as a speaker focusing on building a Python-based search engine. He mentions his background with open-source consulting and his work on 'hyack,' a Django app for plugable search. The session aims to provide an overview of search technologies, emphasizing their differences from other technologies like RDBMS. Lindley encourages understanding search engines' workings to gain insights into other engines and discourages the creation of redundant search libraries. The talk will include code and slides, touching on the importance of search, the inadequacy of web scraping for content understanding, and the advantages of in-house search capabilities.

05:04

πŸ“š Core Concepts of Search Engines and Document Indexing

This paragraph delves into the foundational concepts of search engines, defining terms such as 'engine,' 'documents,' 'corpus,' 'stop words,' 'stemming,' 'segments,' 'relevance,' 'faceting,' and 'boost.' It explains that search engines are document-based and not relational databases, highlighting the importance of denormalizing data for indexing. The paragraph discusses the process of indexing, which involves receiving, storing, tokenizing, and generating terms from documents. It also touches on the limitations of stemming and introduces the concept of engrams as a solution for handling different languages and improving search accuracy.

10:06

πŸ” Deep Dive into Engrams and Inverted Indexing

The speaker discusses engrams as a method for generating search terms, explaining how they work with a moving window over tokens to create terms for indexing. Engrams help in matching user queries with indexed documents by considering different word fragments. The paragraph also explains the construction of an inverted index, which is central to search engine functionality. The inverted index is likened to a Python dictionary, mapping terms to document IDs and their positions, facilitating efficient querying. The discussion covers challenges like the increased number of terms generated by engrams and their impact on storage and search quality.

15:07

πŸ“ Query Processing and Scoring Mechanisms

This section outlines the process of querying a search engine, which involves a query parser, index reader, and scoring system. The query parser's role is to interpret user queries, similar to SQL, and prepare them for the search engine. The index reader utilizes the inverted index to fetch relevant document information based on query terms. Scoring algorithms, such as BM25, reorder search results to reflect relevance to the user's query. The speaker also attempts a live demo using the Enron email dataset, searching for the term 'fraud,' but encounters a technical issue.

20:09

πŸ”‘ Advanced Search Engine Topics and Practical Considerations

The speaker touches on advanced topics like faceting, which provides counts of documents containing specific terms, allowing users to make informed decisions during their search. Boosting is mentioned as a method to artificially increase the score of certain results during the scoring process. The 'more like this' feature is introduced as a way to find contextually similar documents. The speaker also points out that the provided code is for educational purposes and not suitable for production use, emphasizing the complexity of building a robust search engine. Additional resources for further learning in information retrieval are suggested.

25:10

❓ Audience Q&A and Final Thoughts

The final paragraph consists of a question-and-answer session with the audience. Questions address the importance of stopwords in scoring algorithms like BM25 or TF-IDF, techniques for full phrase matching, and the handling of prefix-heavy languages like Greek in search engines. The speaker provides insights into these topics, explaining how algorithms filter out frequent words and the approaches to phrase matching. The discussion wraps up with acknowledgments and thanks to the audience, with the speaker's contact information and resources provided for further exploration.

Mindmap

Keywords

πŸ’‘Search Engine

A search engine is a software system that is designed to search for information on the internet. In the context of the video, Daniel Lindley discusses building a Python-based search engine, emphasizing the differences between web search engines like Google and in-house search technologies. The talk aims to provide insights into the workings of search engines, which is central to the video's theme of understanding and potentially developing custom search solutions.

πŸ’‘Inverted Index

An inverted index is a data structure used by search engines to store a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. In the video, the concept of inverted indexes is introduced as a fundamental part of how search engines work, allowing for efficient querying of documents based on terms.

πŸ’‘Tokenization

Tokenization is the process of breaking text up into tokens, which are the smallest units of meaning. In the script, tokenization is described as a step in preparing documents for indexing, where the text is split into individual words or terms that can be indexed and searched.

πŸ’‘Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root formβ€”generally a written word form. In the video, stemming is discussed as a technique to find root words, which helps in matching different forms of a word during search queries.

πŸ’‘N-grams

N-grams are contiguous sequences of n items (tokens or characters) from a given text. In the context of the video, n-grams are used as a technique for generating terms for the inverted index, which helps in capturing the context of the words and improving search results.

πŸ’‘Relevance

Relevance in the context of search engines refers to the pertinence of the search results to the user's query. The script discusses how relevance algorithms are applied to determine how well a document matches an individual query, which is a key aspect of the search engine's effectiveness.

πŸ’‘Drill Down (Folding)

Drill down, also known as folding, is a method of exploring data in a more granular way, often used in search interfaces to filter results based on specific attributes. The video script mentions drill down as a feature that allows users to refine their search queries by applying additional criteria.

πŸ’‘Boost

Boosting in search engines is a technique used to increase the ranking of certain results in the search output. In the script, boost is explained as a method to prioritize certain kinds of results, ensuring they appear higher in the search results set.

πŸ’‘Document

In the context of search engines, a document refers to the text or content that is being indexed and searched. The script emphasizes the importance of considering documents as text blobs with metadata, which are crucial for the indexing process and the quality of search results.

πŸ’‘Stop Words

Stop words are commonly used words (such as 'the', 'is', 'at', 'which', and 'on') that are filtered out before or after processing of text. In the video, stop words are mentioned as words that do not contribute much meaning to the content and are typically excluded from the indexing process to improve search efficiency.

πŸ’‘BM25

BM25 is a ranking function used by search engines to score the relevance of documents based on a search query. The script briefly touches on BM25 as an example of a scoring algorithm, highlighting its significance in determining the order of search results.

Highlights

Daniel Lindley introduces the concept of building a Python-based search engine and its significance.

The importance of understanding how search engines work to gain insights into working with other engines is emphasized.

Avoiding the creation of redundant search libraries is suggested as a community goal.

The necessity of in-house search capabilities due to the limitations of web scraping by external search engines like Google is discussed.

The advantage of having a better understanding of one's own data model for effective search is highlighted.

Core concepts of search engines, such as document-based systems and the avoidance of large text blobs, are introduced.

The concept of 'inverted indexes' as a key component of search engine technology is explained.

Terminology such as 'stemming', 'segments', 'relevance', 'faceting', and 'boost' is defined for clarity in search engine discussions.

The process of indexing, involving document receipt, storage, tokenization, and term indexing, is outlined.

The importance of document quality for effective search results is stressed.

Tokenization is described as the breakdown of text into individual words or tokens for search processing.

Stemming is presented as a method to find root words for more efficient search indexing.

The introduction of 'engrams' as an alternative to stemming to overcome language limitations is discussed.

The construction of an inverted index using terms and document metadata for efficient querying is detailed.

The process of querying a search engine involves a query parser, index reader, and scoring mechanism.

A demo of searching the Enron email dataset using the implemented search engine is attempted, showcasing practical application.

Advanced topics such as 'faceting', 'boosting', and 'more like this' features in search engines are briefly introduced.

The speaker provides resources and encourages further learning in the field of search engine technology.

A question-and-answer session explores practical considerations, such as the handling of stop words in scoring algorithms and full phrase matching techniques.

The challenges of implementing search engines for languages with heavy use of prefixes, like Greek, are discussed.

Transcripts

play00:03

good afternoon

play00:04

everyone our speaker this afternoon uh

play00:07

for the first half of this track is

play00:09

going to be Daniel Lindley he's going to

play00:11

be speaking on building a python based

play00:13

search

play00:21

engine how are you all do you have a

play00:24

good lunch yes all right good so my name

play00:27

is Daniel Lindley I am from Kansas not

play00:31

this Kansas but this

play00:36

Kansas but really I'm from the internets

play00:40

so I run a little web shop called toast

play00:42

driven um might you see this and I do

play00:46

Consulting in open source um and I'm the

play00:50

primary author of a package called hyack

play00:52

which is ajango app for plugable search

play00:54

but we're not here to talk about Hy

play00:55

stack today so what I'd like the goal of

play00:58

what we're going to talk about today is

play00:59

is to give you a broad overview of how

play01:02

search Technologies work because if

play01:03

you're familiar with other Technologies

play01:05

like rwm's they're not the same Beast

play01:08

they work very differently and

play01:10

understanding how a search engine works

play01:11

can give you insights as to how to work

play01:14

with other engines what we're going to

play01:16

try not to do is have all of you write

play01:19

yet another search library because we

play01:21

really probably don't need that so

play01:24

um there'll be code and a link to the

play01:27

slides at the end of the talk so bear

play01:29

with me so why should you care about

play01:31

search doesn't the Googles handle that

play01:35

Well turns out there's a lot of good

play01:37

reasons to have in-house search one of

play01:39

those is that things like Google or ask

play01:42

or Microsoft and Yahoo's Technologies

play01:44

they have to scrape web content which

play01:46

means they get HTML soup and they have

play01:49

to parse through it and try and pull out

play01:51

any kind of meaning getting

play01:53

reasonable proper content out of that

play01:56

can be a big nightmare

play01:58

um

play02:00

another good reason is you know your

play02:01

data model better than any scraper does

play02:04

they just have HTML they might have ad

play02:07

content in there they might have RSS

play02:09

feeds all kinds of things that aren't

play02:11

pertinent to the content that's really

play02:13

being displayed or maybe you're weird

play02:16

and you don't do web apps like I

play02:19

do so good reasons to have let's talk

play02:22

about some Core Concepts uh search

play02:24

engines are all document based these are

play02:26

not Road

play02:27

DBS in you know your rdbms you're never

play02:32

ever just gripping through a string that

play02:34

is like the worst thing you can possibly

play02:36

be doing we don't want to be looking at

play02:38

text Big text blobs every time we go to

play02:40

run a

play02:41

query uh you'll hear me talk a lot about

play02:44

inverted indexes we're going to get into

play02:45

this in a few minutes as well as

play02:47

stemming engrams and relevance so as

play02:51

with all Fields there's a ton of

play02:52

terminology so we're going to go through

play02:54

some stuff really quick so that what I

play02:55

say throughout the rest of the talk has

play02:56

some meaning so we're going to talk

play02:58

about these terms and we'll start with

play03:01

engine the engine is the Black Box you

play03:04

hand a query to and you get the results

play03:06

from the good examples of Open Source

play03:08

engines are things like solar elastic

play03:10

search zapien whoosh and many others

play03:14

Sphinx is an

play03:15

example uh

play03:18

documents documents are the text blob

play03:21

the text that you send to a search

play03:22

engine is one of the most important

play03:24

things that you can pay attention to and

play03:26

additionally you probably want some

play03:27

additional metadata to describe what

play03:30

that text blob consists of where it came

play03:32

from titles other attributes and stuff

play03:35

Corpus a corpus is just the collection

play03:38

of all the documents in the

play03:41

index stop words stop wordss are words

play03:44

that are basically filler these are

play03:46

things like and uhthe but and whatnot

play03:50

words that don't contribute a lot of

play03:52

meaning but appear very frequently in

play03:55

documents

play03:57

stemming stemming is finding root words

play04:00

and we're going to look at this a little

play04:01

bit

play04:03

later

play04:04

segments segments are the data that make

play04:07

up the overall index because you may be

play04:10

dealing with gigabytes or terabytes of

play04:11

data shoving everything in one big file

play04:14

really doesn't scale well so a common

play04:17

technique is to Shard things down into

play04:18

what are called segment files which as a

play04:21

whole make up the

play04:22

index

play04:24

relevance relevance are the algorithm or

play04:28

algorithms May apply to the results you

play04:31

get back out of the search engine to

play04:32

determine how well a certain document or

play04:35

result may fit a a individual

play04:38

query

play04:39

fating fasting is drill down if you've

play04:42

ever been on amazon.com that left-hand

play04:44

bar that lets you say hey I want a

play04:45

camera that's a canon in $200 to $400

play04:48

range that's and lets you drill down

play04:51

that's

play04:52

fasting boost boost is a way to push up

play04:57

certain kinds of results in the results

play04:59

set so that they bubble up to the

play05:04

top so let's start at the beginning

play05:07

let's talk about

play05:08

indexing indexing is the is taking a

play05:12

document that the blob that we were

play05:14

talking about with metadata and pulling

play05:15

it into the engine and Performing

play05:17

Transformations and storage so that we

play05:19

can query on it later on there's four

play05:21

major breakdowns receiving and storing

play05:23

documents tokenization generating terms

play05:26

and indexing those terms so let's start

play05:28

with documents

play05:30

documents are the simplest thing we will

play05:32

talk about all day they are not ever

play05:35

ever ever not not not not they're not a

play05:39

row in the

play05:41

DB really no really really um you really

play05:44

want to be thinking like I've said a

play05:46

blob of text and some metadata that goes

play05:48

with it uh like I've said quality is the

play05:52

most important thing what your users

play05:53

search on is what I'm sorry what you

play05:57

index is what your users will be able to

play05:58

search on if you put put a bunch of junk

play06:00

in they're going to get garbage results

play06:04

out search or documents are a very flat

play06:08

concept A lot of people try to work

play06:10

relations into search engines and it

play06:12

never ends well you want to be thinking

play06:14

things like um I I like to think of

play06:16

search engines as kind of the original

play06:18

no sequels after Berkeley DB so um don't

play06:22

try and put relations in it doesn't

play06:23

really work well and the the the one

play06:27

thing I really want you to take away

play06:28

from any of this because it applies

play06:30

everywhere is you want to denormalize

play06:32

the living crap out of anything you're

play06:35

indexing because the more you can shove

play06:37

on that has meaning again text quality

play06:41

is important the more you can shove on

play06:42

that has meaning onto the individual

play06:44

documents the better they will Bubble Up

play06:46

in results and the better results your

play06:48

users will get when they search so a

play06:51

document looks really familiar if we

play06:52

represent it as say Json or python it's

play06:55

just a dick there's really not a whole

play06:58

lot to it it's a text blob with optional

play07:02

metadata

play07:04

um from there and you can imagine all

play07:07

kinds of ways to store that data

play07:10

tokenization is the process of taking

play07:13

that big text blob and breaking it down

play07:15

into little individual word siiz chunks

play07:18

um typically what you'll find most

play07:20

engines doing is splitting on things

play07:21

like whites space they'll lowercase

play07:23

every single token that they get out of

play07:25

that filter out all those meaningless

play07:28

words like and and the but and um which

play07:31

you're going to hear a lot of um um and

play07:34

stripping punctuation etc etc many times

play07:37

is configurable so that you can tweak it

play07:39

to match your document

play07:41

set the point of that is to get some

play07:44

nice neat normalized little tokens out

play07:46

that you can work with when

play07:50

querying so we mentioned stemming and so

play07:53

that's where we'll go to next because

play07:54

now we've got documents we've got a list

play07:56

of tokens in those documents the next

play07:58

thing we want to do

play07:59

is be able to take those mostly

play08:02

normalized tokens and get something even

play08:03

better out of it so by doing

play08:06

stemming I'm sorry

play08:09

um so if you do something like GP you're

play08:12

processing through a file stream and

play08:14

it's literally just checking hey is it

play08:15

here is it here is it here is it here

play08:17

that's horrible when you've got

play08:19

gigabytes and gigabytes and gigabytes of

play08:20

stored document content so when we do

play08:23

this tokenization and the postprocessing

play08:26

we're really close to what we want but

play08:28

not quite so stemming is the process of

play08:31

finding the root word examples of this

play08:33

would be like test becoming testing

play08:35

because the ing ending is not

play08:37

particularly meaningful or Searcher

play08:40

becoming Searcher becoming search

play08:42

because we want to find root words so

play08:43

that if someone misspell something or if

play08:47

uh if there's a pluralized form we can

play08:49

get back down to that root word and

play08:51

store one variant rather than

play08:53

20 these then become the terms in the

play08:56

inverted index that we'll be querying on

play08:58

and if you you apply these same the same

play09:01

tokenization and stemming to the query

play09:04

what you get out of a user's query is

play09:05

the exact same thing as the terms you're

play09:07

shoving in the

play09:08

index the downsides of stemming are that

play09:11

it really only works for the language it

play09:15

was built for most stemmers that are out

play09:17

there are pretty much English only there

play09:19

are other languages but they don't work

play09:23

particularly well and they're poorly

play09:25

supported it's really really painful

play09:27

even in high quality search engines

play09:30

and they're really hard to make work

play09:31

cross language because the structure of

play09:33

German or French is not the same as

play09:36

English so this is a really bad

play09:39

shortcoming so there are many ways to

play09:41

solve this the approach we're going to

play09:43

look at today is

play09:46

engrams as I said it takes care of some

play09:49

of the shortcomings of stemming we're

play09:51

going to introduce some new problems but

play09:53

overall it's typically worth it what

play09:55

engram consist of is taking a window and

play09:58

pass passing it over your token over

play10:00

your

play10:02

tokens and that these new little terms

play10:06

that come out of this moving window are

play10:08

what become your terms in the index as

play10:10

opposed to a stemmed version of the word

play10:12

and we'll see an example of this in a

play10:13

few moments so uh for instance if we do

play10:18

a typical engram and we assume a gram

play10:20

size of three we've got three characters

play10:23

h l e l and then we jump to the next

play10:27

term or the next token

play10:29

and

play10:30

generate a complete list of terms based

play10:34

on the tokens that we have now that's

play10:37

great except that typically you're

play10:39

probably not typing in lllo as a query

play10:43

so what we want to do is find another

play10:46

way to approach this this shortcoming Ed

play10:49

gen grams typically solve this so what

play10:51

an ed genr will do is it will pin to a

play10:54

side of a token in this case uh if with

play10:58

a gram size of three we still have that

play11:00

H but instead of moving the window along

play11:03

we're going to generate different uh

play11:05

gram lengths so that we have a gradually

play11:07

more and more complete word that becomes

play11:10

our terms and then again once we switch

play11:12

to the next token we start with a gram

play11:14

size of three again move to four and

play11:17

five and as you can see these terms are

play11:20

much more likely to be in a user's query

play11:22

than

play11:24

l so this is great because it gives us a

play11:29

uh first of all it works great for

play11:31

autocomplete because user can start

play11:34

typing something very short and you can

play11:36

start immediately feeding results back

play11:38

to

play11:39

them it's also good because it works

play11:41

across other languages because it no

play11:43

longer relies on the grammatical

play11:44

structure you're just simply looking at

play11:46

tokens and if the word starts the same

play11:48

way in a in a different language it'll

play11:51

match well to however whatever terms you

play11:55

generated cons are that when you use NRS

play11:58

or Edge grams you generate a lot more

play12:01

terms um and as a result storing all

play12:05

those terms can be kind of a problem or

play12:08

at least it bloats up on space and

play12:11

initial quality of search can suffer a

play12:13

little bit because now you're dealing

play12:15

with fragments and something that you

play12:16

may not have meant to have matched now

play12:20

matches so an example in Python might

play12:23

look something like this we set a

play12:24

minimum and Max gram size like we did on

play12:26

the previous slide 3 to six we have a

play12:29

dictionary of terms that we're going to

play12:30

be storing out of this we pass in a list

play12:33

of tokens run through it make sure we

play12:35

grab position information as well as the

play12:38

individual token and then we just for

play12:40

the range of our gram size pass a window

play12:43

over it and shove the newly generated

play12:45

term into our

play12:47

terms so now we've got documents and

play12:51

we've got engram terms the next step is

play12:54

storing this in a way that we can

play12:56

actually query on that comes in in the

play12:59

inverted index the inverted index is the

play13:02

heart and soul of the search engine it

play13:04

is how things work you can think of it

play13:07

exactly like a python dictionary because

play13:09

it's largely key value those keys are

play13:14

the terms that we just got finished

play13:15

generating with engrs and it's all the

play13:18

terms from all the documents so you can

play13:21

imagine taking a large body of text

play13:23

generating tons and tons and tons of

play13:26

edrs each one of those that you gener

play13:28

ated becomes a new key in that big

play13:30

inverted

play13:31

index and of course this is across all

play13:34

documents not just one document at a

play13:36

time the position information that we

play13:38

were grabbing out of the previous

play13:39

example of of edge andrs uh is important

play13:43

because we can say how far into the

play13:45

document did this word appear we can

play13:48

also say hey how many times in this

play13:50

document did this word appear and of

play13:53

course we want to store document ID so

play13:54

that when we search we can map things

play13:56

back to the data that we've stored

play13:59

so an index as I said might just look

play14:01

just like a plain old python dictionary

play14:04

and you'll be storing the individual

play14:06

terms like blob and text uh if if you

play14:09

had a engram size of four for instance

play14:11

and document IDs such as document

play14:14

1524 and blob appeared at position three

play14:17

of that example we saw on a previous

play14:21

slide once you have these this

play14:24

index generated the problem then becomes

play14:27

querying over it efficiently

play14:29

like I said if you have this hugee

play14:31

massive data structure hanging out you

play14:34

can't put it all in one file and search

play14:36

over it effectively so there's lots and

play14:39

lots of ways to do segments to break

play14:40

that index down for efficient

play14:43

querying many many many projects follow

play14:46

um there's this this famous Java project

play14:48

called Lucen which is really really good

play14:50

at search um but we're going to cheat

play14:54

cuz it's easier so uh in the example

play14:58

code that I've got we're going to just

play14:59

use flat files we're going to take those

play15:01

engram terms that we generated when we

play15:03

ripped through the Big Blob of data and

play15:05

we're going to Hash them and we're going

play15:07

to take part of that hash the leading

play15:09

part and use it to map it onto a file in

play15:12

the file system this G this limits the

play15:14

number of files that we've got down to

play15:15

something that like a Unix like system

play15:17

can handle as well as ensures that if

play15:20

you got a given engram you can easily

play15:22

map right down into the proper segment

play15:24

file that it's associated with we're

play15:27

going to keep terms in always sorted

play15:28

order and we're going to use Json to

play15:31

store the document and position

play15:32

information not because it's

play15:33

particularly performant but because

play15:35

it'll work well across other um other

play15:38

languages or other

play15:40

tools so an example implementation might

play15:42

look something like this I mentioned

play15:44

we're going to Hash so we just take the

play15:46

term pass it in we're going to take a

play15:48

length of six on the hash just so that

play15:50

it's short and keeps a reasonable number

play15:52

of files in the uh file system and we

play15:55

just simply do a simple md5 on it get a

play15:57

hex Digest and return the first six uh

play16:01

as far as saving a

play16:02

segment this isn't in always sorted

play16:04

order the code I'm going to present to

play16:06

you at the end is always in sorted order

play16:08

but it's simply just appending onto the

play16:11

file the information and we're going to

play16:13

tab to limit because all of our tokens

play16:15

have already been wh space separated so

play16:17

there's no chance of tabs showing

play16:20

up so by getting to this point we've

play16:24

already covered creating the terms

play16:27

storing them storing the docum we've got

play16:29

enough information that now we can start

play16:30

query on it querying breaks down into a

play16:34

query parser index reader and scoring at

play16:37

a high level so let's dive into the

play16:39

query parser um T typically the purpose

play16:43

of a query parser is taking a users's uh

play16:46

handwritten query maybe they have

play16:48

advanced knowledge of it maybe not you

play16:49

can think of this as analogous to SQL

play16:51

and parsing it out into something that

play16:53

the engine can start tackling you're

play16:56

going to process all the elements of of

play16:58

that user's query the same way you did

play17:00

when preparing the document for

play17:02

indexing so an example trivial python

play17:06

implementation might look like something

play17:07

like this we've got a list of common

play17:09

English stop wordss and we're going to

play17:12

we're going to really really cheat in

play17:13

this case and we're just going to split

play17:15

it up check that the token's not in stop

play17:18

wordss if it's not there we're going to

play17:19

toss in our list of tokens and then

play17:21

we're going to make engrams it's the

play17:22

same kind of stuff we just did when we

play17:24

were preparing the

play17:26

index index reading uh becomes easy now

play17:30

because we have this list of edge andr

play17:32

terms we can simply go through hash them

play17:36

and we can find whatever file we just

play17:37

stowed them in in a very quick efficient

play17:40

lookup and we just rip through all the

play17:42

to the tokens that we I'm sorry all the

play17:45

terms that we generate from a user's

play17:46

query grab all the hashes go into the

play17:49

files pull out the results and collect

play17:51

the position and document

play17:53

information this is a little big I

play17:55

apologize if it's small um but it's

play17:57

basically the same kind of stuff we make

play17:59

a segment name so that we know what the

play18:01

what the file name of the term is check

play18:03

if it's there if it's not we just return

play18:05

nothing if it's there we go through and

play18:07

we start ripping through the file and

play18:10

there are more efficient ways to do this

play18:11

but we're splitting on the tab for now

play18:12

so that we've got the term and the info

play18:14

if we found that term say we entered

play18:16

hello as our user query and we found

play18:19

H in the segment we're going to pull out

play18:22

those results because we we found a

play18:24

match and collecting the results is

play18:27

simply a matter of just going through

play18:29

all the terms that were generated from

play18:30

the user query and applying that load

play18:32

segment to those

play18:35

terms finally so where we're at now is

play18:38

we've got a user's query we've tokenized

play18:40

it we've made terms we've gotten all the

play18:43

results together but they're out of

play18:45

order they have no meaningful order for

play18:47

the query that we generated so what

play18:49

we're going to do with scoring is

play18:50

reorder that collection so that it means

play18:53

something based on the user's query

play18:55

there are tons and tons of choices of

play18:57

scoring algorithms we're not going to go

play18:59

through all of them uh a major one is

play19:01

bm25 it's very famous you can look it up

play19:04

on Wikipedia and it's it's pretty

play19:05

interesting there's also what's called

play19:07

the phased uh scoring query and Google's

play19:12

page rank is a great example of this

play19:13

this can be anything you may decide that

play19:16

words that include Bob are really really

play19:18

important and you're just like Bob

play19:21

you're always up at the top buddy so uh

play19:24

and an implementation of bm25 looks

play19:27

something like this

play19:30

uh I can't explain it I'm

play19:33

sorry so as a quick demo of

play19:37

this I have indexed the first 100 uh 50

play19:42

or I'm sorry 1,500 documents of the

play19:44

Enron email data set so

play19:50

uh if anyone has any uh correct for

play19:54

plight company terms they'd like to

play19:56

search on fraud okay let's do fraud

play20:02

a demo

play20:09

fail I'm mad

play20:11

okay I'm sad because this literally just

play20:14

worked before I uh before I came down

play20:17

why you don't do demos and talks thank

play20:19

you uh so let's talk about some Advanced

play20:22

topic shall we um I'm not going to cover

play20:25

any code on this but it's diving into

play20:28

this would be probably a great exercise

play20:30

for you um God I sound like a college

play20:33

professor horrible um so fasting fasting

play20:37

has made out to be a lot of things but

play20:38

what drill down really consists of is

play20:41

for all the terms that are provided

play20:44

giving accurate counts on the number of

play20:46

documents that contain those things so

play20:48

when you're on Amazon and you're digging

play20:49

through digital cameras and you say

play20:51

Canon what you're really and they give

play20:54

you a count that's really all fating is

play20:56

it's just that count but that count is

play20:58

important because it lets the user know

play21:00

both what things are matching within the

play21:02

document set as well as how many are

play21:05

there say there's only two Canon uh

play21:08

cameras on Amazon which will never ever

play21:10

happen you might be like well wow they

play21:12

don't really have a really great

play21:13

selection maybe I need to go elsewhere

play21:15

so an implementation would probably look

play21:17

like collecting all the terms counting

play21:19

the length of the unique document IDs

play21:22

for each of them and then just order by

play21:24

descending because once you have those

play21:26

terms and you know you know how many

play21:28

things matched and how many documents

play21:30

were there you just order by the count

play21:32

and you can provide back something

play21:33

that's useful to the people who are

play21:35

using your software boost uh happens

play21:39

during the scoring process literally

play21:42

it's just saying hey we've got a score

play21:44

that we generated but we want to

play21:46

artificially bump things up so Bob gets

play21:48

a bump up if it's matched you just tweak

play21:51

the score uh there there's literally it'

play21:54

be literally nothing more than modifying

play21:55

the bm25 relevance score another common

play21:58

one is something like more like this

play22:00

what more like this provides is uh

play22:03

basically contextually similar documents

play22:06

so you've got this huge list of terms

play22:08

that you generated when you

play22:09

indexed you know all the terms that were

play22:12

in that document and you can also find

play22:14

other documents that have a very similar

play22:17

set of terms uh and implementation would

play22:20

look like collecting all the

play22:21

terms uh based on documents so you

play22:24

probably need an alternative uh data

play22:26

structure to store document to term

play22:28

mapping and then just sorting simply

play22:31

based on how many times a given document

play22:34

was seen in the set this is really

play22:36

really simplistic as is most of this

play22:38

presentation just because it's to get a

play22:40

high level engines like solar and

play22:43

whatnot use much more complete complex

play22:46

Solutions like natural language

play22:47

processing as well as other times types

play22:50

of

play22:52

oh other types of contextual analysis to

play22:56

get a much better quality result

play22:59

so I told you not to go invent a search

play23:01

engine and I went and invented a search

play23:05

engine uh code is up on GitHub BSD

play23:08

licensed feel free to poke through uh

play23:10

I've annotated the whole source so it

play23:12

should be easy to dive through uh here's

play23:14

a list of some additional resources um

play23:17

lots of good stuff here uh the IR book

play23:19

is fantastic and largely free on the

play23:22

internet it's worth it for any kind of

play23:24

textual processing and just learning

play23:26

about how information retrieval kinds of

play23:29

topics work as well as a bunch of other

play23:31

things and thank you very

play23:40

[Music]

play23:42

much uh as I mentioned the uh the slides

play23:47

are up on speaker deck at this URL once

play23:51

I make it public huh um they'll be up

play23:54

just after this

play23:56

talk

play24:02

there we have about oops we have about

play24:05

six minutes for questions if anyone

play24:07

would like to ask a question please line

play24:09

up behind that microphone um quick note

play24:12

on process ask your question I will

play24:14

repeat your question and then um he will

play24:16

answer

play24:17

it so just to kind of review when when

play24:21

where do you think the cut off is

play24:22

between rolling your own search engine

play24:24

and just basically trying to figure out

play24:25

how solar works and implementing it

play24:26

yourself in your

play24:28

um to me the code I put up I would say

play24:31

no one should ever deploy in production

play24:34

it is horrific on iO because you're

play24:36

literally writing with every single term

play24:38

you you probably want something that

play24:40

does something more like um indexes a

play24:42

bunch of documents in memory and then

play24:44

does a commit and flushes them out to

play24:46

disk so to me it's a learning exercise

play24:50

it was great to go through and like

play24:51

solidify things for myself at a high

play24:54

level um I'd encourage you to go ahead

play24:58

and do that if it's of interest to you

play25:00

but um it it rapidly becomes one of

play25:03

those things where you'll probably start

play25:05

hating life if you have to support

play25:09

it hey um can you speak to uh the

play25:13

relative importance of things like stop

play25:15

words when you're using something like

play25:17

bm25 or tfidf that takes into account

play25:20

the frequency of words or better refer

play25:23

to any resources that uh show those

play25:28

kinds of differences sure so

play25:32

um when I showed this I kind of glossed

play25:35

over some things um the B so what BM

play25:40

like he mentioned bm25 does a great job

play25:42

of filtering out words that are

play25:44

extremely frequent

play25:48

um I can't again I can't explain the

play25:50

math on this but essentially if

play25:52

something has I believe it's like a 70

play25:55

like a 75% chance of appearing or or

play25:57

something or or a certain it's like 75%

play26:01

repeats it'll just be ignored so that

play26:05

the the way the relevance score is

play26:07

calculated is it's kind of weird it's a

play26:09

it's a trip to read on Wikipedia um

play26:13

huh you can tune it 75% hard right um

play26:18

that the b0 that's up there that's

play26:20

passed in as a uh quar that actually is

play26:23

document length if you set that it'll

play26:25

adjust score slightly um

play26:28

as well as the K modifier is just used

play26:30

to like kind of normalize the range so

play26:32

like when you do scoring you don't get a

play26:35

zero to 100% match like you might have

play26:37

seen in older search engines you get a

play26:39

weird floating Point number and it

play26:41

doesn't really mean a whole lot because

play26:43

it's just based on the document set as

play26:45

well as the query but like the next the

play26:47

moment someone changes the query even

play26:49

slightly you can get very different

play26:51

scores so that's why you rarely see

play26:53

people exposing scores on the front end

play26:55

of things because they don't mean a lot

play26:57

now bm25 and I believe

play27:01

um I think the phased uh approach also

play27:04

handles this they kind of already deal

play27:07

with uh excluding things like stop words

play27:09

and really frequent uh frequently

play27:11

repeated words the problem is that you

play27:14

don't if you lean on just that you don't

play27:18

get as in my opinion you don't get as

play27:20

good of quality results because other

play27:22

things might be repeated very frequently

play27:24

say you're in the python documentation

play27:26

the word python is everywhere and it can

play27:30

be it can be ranked lower by a score a

play27:34

score out of bm25 just because of how

play27:36

frequently it appears even though it's

play27:38

not really a stop word um also if you

play27:41

filter out stop wordss ahead of time you

play27:43

save on lots and lots and lots and lots

play27:45

and lots of dis space it it it's really

play27:48

more of like an IO um saving cool crutch

play27:53

how you doing uh what are some

play27:54

techniques and algorithms for doing a

play27:56

full phrase

play27:58

matching for so the question was what

play28:01

are some techniques for doing full

play28:02

phrase

play28:03

matching um it really depends on the

play28:11

uh it depends on how the query is parsed

play28:15

and how the index is read what what'll

play28:17

typically happen is you'll find that a

play28:19

query gets broken down into something

play28:20

that's more like a

play28:22

tree um with uh operators in between

play28:25

them and so like if you double quote

play28:28

something in a query chances are that'll

play28:31

get pulled out into like if you imagine

play28:33

a query tree it'll get pulled out into

play28:35

an exact match node and so it'll

play28:37

generate

play28:39

terms so it like in the case of engrams

play28:41

not everything uses engrams in fact most

play28:43

things default to stemming but engrams

play28:46

work better in a lot of circumstances

play28:49

um so in that exact match node of this

play28:54

this query tree it will still generate

play28:56

those Eden gram terms and look them up

play29:00

but probably generate them for the max

play29:02

length of the word so like in the case

play29:04

of hello world both terms are five so

play29:07

generate an EDG gen gram of five for

play29:08

both of them look them up find documents

play29:11

where it has both of those words and

play29:13

then look at the position information

play29:16

and if the positions aren't right next

play29:17

to each other you discard the result so

play29:19

you're just you're just manually

play29:20

filtering the results set then to find

play29:22

the documents that actually have that

play29:23

phrase essentially yes now some people

play29:26

take a different some engines take a

play29:28

different approach and they will store

play29:29

full matches you have to you have to

play29:32

realize most search engines are like

play29:35

nearly infinitely configurable so you

play29:37

can make it do whatever you want and

play29:39

this is a really really simple view of

play29:41

things but um some engines will store

play29:43

you know full exact matches the thing

play29:45

you don't want to do is you know try and

play29:47

go through an entire blob because as

play29:49

soon as you do that performance just

play29:51

drops thanks a lot

play29:54

mhm um any question

play29:58

I actually have one um how does Ingram

play30:01

searching work with languages that make

play30:03

heavy use of prefixes like Greek for

play30:05

instance so um I can't speak to Greek

play30:09

but if you have

play30:12

um it depends how it's prefixed if it's

play30:15

prefixed with punctuation like you say

play30:17

like it's hyphenated or something

play30:18

typically you'll be splitting on

play30:19

punctuation anyway if you're talking

play30:21

about um I'm trying to think of examples

play30:24

like in English of common PR Center pre

play30:27

poost huh epicenter

play30:30

epicenter um so another thing that I've

play30:34

kind of Hidden Away in this talk is that

play30:36

say for instance solar or lucens their

play30:38

default implementation generates uh Ed

play30:41

genr sizes from 2 to

play30:44

15 so even though um it doesn't help in

play30:49

the case of of prefixing it does in

play30:52

postfixing and the way you get around it

play30:54

for prefixed things is typically

play30:56

generate NRS instead of Edge and grams

play30:59

got it ladies and gentlemen Daniel

play31:01

Lindley thank you very

play31:04

much

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Python SearchSearch EngineDaniel LindleyInverted IndexTokenizationEngramsBM25 ScoringQuery ParsingOpen SourceDocument Indexing