Web Search: Crash Course AI #17
Summary
TLDRCrash Course AI explores how search engines like Google operate using AI to find answers. It explains the process from crawling the web with web crawlers to organizing data with inverted indexes. The video also touches on how user behavior influences search rankings and the use of knowledge bases for direct answers. It highlights the challenges AI faces with nuanced questions and biases in data.
Takeaways
- đ **Search Engines as AI Systems**: Modern search engines like Google use AI to help users find information by gathering and organizing data from the World Wide Web.
- đ **From Libraries to Web Crawlers**: The concept of search engines dates back centuries, evolving from physical libraries to digital web crawlers that systematically download web pages.
- đ **The Web and the Internet**: The script clarifies the difference between the Internet (a network of computers) and the Web (part of the Internet that uses browsers to display documents).
- đ·ïž **Web Crawlers**: Web crawlers are programs that start from a 'seed' page and recursively download linked pages, forming the basis of search engine databases.
- đ **Inverted Index**: Search engines use an inverted index to organize web pages by words, allowing for quick searches when users enter queries.
- đ **Query Processing**: When a user submits a query, the AI uses the inverted index to find relevant web pages that contain the search terms.
- đ **Ranking Results**: Search engines rank web pages to ensure the most relevant results appear first, using user behavior data like bounces and click-throughs to refine rankings.
- đ§ **Knowledge Bases**: For direct answers, AI systems use knowledge bases that encode information as relationships between objects, unlike inverted indexes used for web page links.
- đ€ **NELL - Never Ending Language Learner**: An example of a knowledge base is NELL, which autonomously extracts facts from web pages and uses repetition and multiple sources to validate information.
- đł **Bias in AI**: The script highlights that AI systems can inherit biases present in the data they learn from, affecting the neutrality of search results.
- â **Limitations of AI**: Certain questions that are not commonly asked or have limited data available can stump AI systems, illustrating the ongoing challenges in training comprehensive AI models.
Q & A
What is the primary function of search engines?
-Search engines primarily gather data, create organization systems to sort that data, and find results to a question.
How do search engines compare to traditional libraries in terms of data organization?
-Search engines and traditional libraries both gather and organize data. Libraries use physical organization systems like shelving and cataloging, while search engines use digital systems like inverted indexes.
What is the role of a Web crawler in search engines?
-A Web crawler is a computer program that systematically finds and downloads Web pages to gather data for the search engine AI to process.
Can you explain what an inverted index is in the context of search engines?
-An inverted index is a lookup system used to organize Web pages. For each word, it lists all the Web pages that contain that word, usually represented by ID numbers instead of URLs.
How does the AI in search engines determine the relevance of search results?
-The AI uses an inverted index to find relevant pages and then ranks them based on various factors to ensure the top results are more likely to be relevant.
What is the significance of user behavior in training search engine AI?
-User behavior, such as bounces and click-throughs, provides training data for AI systems to learn how to rank search results and better answer user queries.
What is a knowledge base and how does it differ from an inverted index?
-A knowledge base encodes information as relationships between objects. Unlike an inverted index, which is used for searching, a knowledge base is used to directly answer questions by matching incomplete facts.
What is the Never Ending Language Learner (NELL) and how does it work?
-NELL is a huge knowledge base created by Carnegie Mellon University that extracts facts from Web pages. It starts with human-provided facts, identifies patterns, and learns new facts and relationships by searching the Web.
How does an AI system like Siri or John Green Bot answer direct questions?
-AI systems like Siri reformulate questions into incomplete facts and then search a knowledge base for matches to provide direct answers.
Why do some questions stump AI systems?
-Questions that stump AI systems are often those that not enough people ask, or for which the AI hasn't learned how to answer well yet due to lack of data or training.
What is the potential issue with biases in search engine AI systems?
-Search engine AI systems can be influenced by biases in the data online, leading to skewed or incomplete results, such as predominantly showing images of female nurses when searching for 'nurses'.
Outlines
đ Introduction to Search Engines and AI
Jabril introduces the topic of search engines in AI, comparing the modern digital search to the traditional library system. He explains that search engines like Google, Bing, and others are AI systems that gather data, organize it, and provide answers to users' queries. The analogy of a library is used to illustrate how librarians organize books, and how search engines use AI to organize web data. The video also distinguishes between the Internet and the Web, explaining that the Web is a part of the Internet that uses browsers to display content. The process of web crawlers is introduced as the initial step in data gathering for search engines.
đ How Search Engines Organize and Rank Information
This section delves into how search engines organize the vast amount of data they collect. It explains the concept of an inverted index, which is a system that lists all web pages containing a specific word, allowing search engines to quickly find relevant pages for a query. The paragraph also discusses the importance of ranking search results, ensuring that the most relevant results appear at the top. It highlights how user behavior, such as bounces and click-throughs, provides training data for AI systems to learn and improve search result rankings. The Never Ending Language Learner (NELL) is introduced as an example of an AI system that can extract facts from web pages to build a knowledge base.
đ€ AI's Use of Knowledge Bases and Challenges in Search
The final paragraph discusses the use of knowledge bases by AI systems to provide direct answers to users' questions, as opposed to just providing links to web pages. It explains how AI systems can reformulate questions into facts and search through knowledge bases for answers. The paragraph also touches on the limitations of AI in answering certain types of questions due to lack of data or the complexity of the question. It warns about the potential biases in search engine results due to the biases present in the data online, setting the stage for a future discussion on algorithmic bias.
Mindmap
Keywords
đĄSearch Engines
đĄWeb Crawler
đĄInverted Index
đĄAI Training
đĄKnowledge Base
đĄNever Ending Language Learner (NELL)
đĄClick Through
đĄBounce
đĄBias in AI
đĄPart of Speech Tagging
Highlights
Search engines have evolved from simple shouting matches to AI systems with vast human knowledge.
Google, Siri, and Alexa are examples of AI systems improving search capabilities.
IBMâs Watson demonstrated AI prowess by beating top Jeopardy players.
Search engines gather and organize data to find answers to questions.
Libraries serve as an early form of search engine with physical organization systems.
Web search engines use AI to look through data on the World Wide Web.
The Internet and the Web are not the same; the former is a network of computers, the latter is a part of it used for content delivery.
Web crawlers are used to systematically find and download web pages for search engines.
An inverted index is a lookup system used to organize web pages for search engines.
Search engines use inverted indexes to find relevant web pages based on search queries.
Search engines rank web pages to provide the most relevant results first.
User behavior, such as bounces and click-throughs, trains AI systems to improve search results.
AI systems use knowledge bases to provide direct answers rather than web page links.
The Never Ending Language Learner (NELL) is a knowledge base that extracts facts from web pages.
NELL uses repetition and multiple sources to confirm the accuracy of extracted facts.
AI can reformulate questions into facts and search knowledge bases for answers.
Search engines struggle with questions that are rarely asked or have incomplete data.
AI systems can have difficulty with nuance and are influenced by biases in online data.
Crash Course AI is produced in association with PBS Digital Studios, offering a community on Patreon.
Transcripts
Hi, Iâm Jabril and welcome to Crash Course AI! There used to be a time when a group of
friends at dinner could ask a question like âis a hot dog a sandwich?â and it would
turn into a basic shouting match with lots of gesturing and hypothetical examples.
But now, we have access to a LOT of human knowledge in the palm of our hands⊠so our
friends can look up memes and dictionary definitions and pictures of sandwiches to prove that none
of them have a connected bun like hot dogs (disappointed).
Search engines are a huge part of modern life. They help us access information, find directions
to places, shop, and participate in sandwich arguments.
But how does Google find answers to questions? How are Siri and Alexa so smart but also easily
stumped? How did IBMâs Watson beat the best Jeopardy players in the world?
Well, search engines are just AI systems that are getting better and better at helping us
find what weâre looking for.
INTRO
When we talk about search engines, we typically think about the AI systems online, like Google,
Bing, Duck Duck Go and Ask Jeeves.
But the basic ideas behind non-AI search engines have existed for centuries. Essentially, search
engines gather data, create organization systems to sort that data, and find results to a question.
For example, when you needed an answer to a question and couldnât search online, you
could go to the library! Libraries gather data in the form of books and newspapers that
are stacked neatly on the shelves.
Librarians have organization systems to help you find what youâre looking for. Knowing
that magazines are on shelves by the water fountain, while kids books are on the second
floor is a kind of organization. Plus, fiction books are sorted by the authorâs last name,
while nonfiction has the Dewey Decimal System, and so on.
Once you (or the librarian) have the resources you need, youâll be able to find results
to your question!
Now, rather than looking through books, web search engines look through all the data on
the World Wide Web, aka âthe Webâ. And instead of asking a human librarian where
to find information, we ask an AI like John-Green-bot instead.
Jabril: Oh John Green Bot?
[JGB dialup beeps]
Alright John Green Bot you're all set.
We're going to need that later.
And just so weâre clear, weâre using âWebâ throughout this video even though it might
sound a little old-fashioned. Thatâs because the Internet and the Web are not the same thing.
The Internet is a collection of computers that send messages to each other. Video services
like Netflix that play on your TV, for example, use the Internet, not the Web.
The Web, on the other hand, is part of the Internet and uses the Internetâs connections
to send documents and other content in a format that can be displayed by a browser like Chrome
or Safari.
As with most AI systems, the first step is to gather lots of data. To gather data on
the Web, we can use a computer program called a Web crawler, which systematically finds
and downloads Web pages. This is a HUGE task and happens before the search engine AI can
take any questions.
It starts on some Web page that we pick, called a seed, and downloads that page and finds
all its links. Then, the crawler downloads each of the linked Web pages and finds their
links, and so on... until weâve crawled the whole Web.
After we have collected all the data, the AIâs next step is to organize it by building
an index, which is a kind of lookup system. The kind thatâs used for organizing Web
pages is called an inverted index, which is like the index in the back of a textbook.
For each word, it lists all of the Web pages that contain that word. Usually, the Web pages
are represented by I.D. numbers so we donât have a long, messy list of URLs.
Letâs say 0 is the seed - which happens to be a page about Genghis Khan. It has a
lot of words on it like âthe, mongol, Khan, Genghis, who, and isâ. In this inverted
index, page 1 is about Marco Polo, but it mentions the word âGenghisâ along with
words like âthe, Marco, Polo, who, are, and is.â Page 2 is about the Mongols, page
3 is a different webpage about Marco Polo, and page 4 is about Water Polo.
So, letâs say we type âWho is Genghis Khan?â into a search engine.
Our AI can use this inverted index to find results, which in this case, are links to
Web pages. The AI will look at the words âwhoâ, âisâ, âGenghisâ, and âKhanâ and
use the inverted index to find relevant pages.
Our AI might find that Web pages zero, one, two and five have at least one of the words
from the question âwho is Genghis Khan?â When Siri says âI found this for you,â
the AI is just returning a list of Web pages that contain the same terms as the question.
Except⊠most search engines include one more step. There are millions of pages online
that contain the same terms. So itâs important for search engines to rank Web pages, so
that the top result is more likely to be relevant than the tenth result or the hundredth.
Of course, Google and Bing donât hire âsupervisorsâ to grade each possible question and answer
to help their AI systems learn from training data. That would take forever, and they wouldnât
be able to keep up with all the new content that gets created every day.
Really, regular users like us do this training for free all the time. Every time we use
a search engine, our behavior tells the AI whether or not the results answered our question.
For example, if we type in âwho is Genghis Khanâ into a search engine, and click on
a Web page about Star Trek II: The Wrath of Khan, we might be disappointed to find Genghis
Khan isnât ANYWHERE in that movie. So weâll bounce back to the search results, and try
again until we find a page that answers our question.
A bounce indicates a bad result. But if we click on a Wikipedia article about Genghis
Khan and stay for a while reading, thatâs a click through, which probably means that
we found what we were looking for⊠so that indicates a good result.
Human behavior like bounces and click throughs give AI systems the training data they need
to learn how to rank search results and better answer our questions. Data from the Web
and data from how we use the Web helps make better and better search engines.
Now, sometimes we ask our smart devices questions and we want actual answers⊠not links to
Web pages. When I say âOK Google, whatâs the weather like in Indianapolis?â I donât
want to scroll through results.
For this kind of problem, instead of using an inverted index, AIs rely on knowledge bases.
Which you might remember from our video about Symbolic AI. A knowledge base encodes information
about the universe as relationships between objects like "chocolate donut" and "John Green Bot wears polo".
One of the main problems with knowledge bases is that itâs really hard to write down all
of the facts in the universe, especially common sense things that humans take for granted
but computers need to be told.
Enter AI researcher Tom Mitchell and his team of scientists from Carnegie Mellon University.
In 2010, they created a huge knowledge base called the Never Ending Language Learner or
NELL, which was able to extract hundreds of thousands of facts from random Web pages.
The way it works is really clever, so letâs go to the Thought Bubble to see how.
NELL starts with some facts provided by a human, for example, the genre of music that
Mozart plays is classical. Which was represented like this:
Mozart. musicGenre. Classical.
Similarly,
Jimi Hendrix. plays. Guitar.
And Darth Vader. hasChild. Luke Skywalker.
Then, NELL gets to work and reads through each Web page one-by-one for words mentioned
in those facts. Maybe it finds the text âMozart plays the piano.â
NELL doesnât know much about these symbols, but this text matches the same pattern as
one of the facts provided by a human, specifically, the âplaysâ relationship. So NELL learns
a new object: Piano. And a new fact: Mozart. plays. Piano.
By searching over the entire Web, NELL can learn lots of facts based on just the three
original ones that humans gave it!
Some facts might appear hundreds or thousands of times online, like Lenny Kravitz. hasChild.
Zoë Kravitz. But NELL might also find facts that are mentioned SOMEWHERE online and extract
them as potentially true. Like, for example, Darth Vader. plays. Kloo Horn. We just donât
know!
Just like how we look for multiple sources when writing a paper, NELL uses repetition
and multiple sources to build confidence that the facts itâs finding are actually true.
To consider other relationships, NELL uses the highly confident facts it learned and
searches through the Web again. Only this time, NELL is looking for new relationships.
Maybe it finds the text âDarth Vader cuts off Luke Skywalkerâs hand,â and NELL learns
a new (very specific) relationship: cutsOffHand.
Over and over again, NELL will use known relationships to find new objects, and known objects to
find new relationships -- creating a huge knowledge base.
Thanks, Thought Bubble! AI systems can use huge knowledge bases, like this one extracted
by NELL, to answer our questions directly.
Instead of using the words from our questions to search through an inverted index, an AI
like Siri can reformulate our questions into incomplete facts and then look for matches
in a knowledge base.
Hey John Green BotâŠ.
John Green Bot: Yes, Jabril?
Jabril: âWho wrote The Bluest Eye?â
His AI could then reformulate that question into an incomplete fact, replacing âwhoâ
with a question mark. If John-Green-bot extracted that information earlier, he can find matches
in his knowledge base and return the most confident result.
John-Green-bot: Toni Morrison wrote The Bluest Eye!
Jabril: Hey. Thanks, John-Green-bot!
Different words are categorized differently, so an AI like John-Green-bot can tell the
difference between questions asking âwhoâ and âwhenâ and âwhere.â But that gets
more complicated, so weâre not going to dive into the details here. If you want to
learn more, you can read about part of speech tagging systems.
Using all these strategies, search engines have become really good at answering common
questions. But questions like âHow many trees are in Ohio?â or âHow many hotdogs
are eaten in the South Sandwich Islands annually?â still stump most AI systems, because not enough
people ask them and AI hasnât learned how to answer them well yet.
Itâs also important to watch out for search engine answers to questions like âWho invented
the time machine?â because AI systems have a tough time with nuance and incomplete data.
Sorry Doc Brown.
And a big, sort of hidden, problem is that search engine AI systems, are influenced by
any biases in data online. For example, if I ask Google for images of ânurses,â it
will mostly show pictures of female nurses. So next time, weâll talk about how an algorithm
can be biased, where bias comes from, and what we can do to address bias in AI. Iâll
see ya then.
Crash Course AI is produced in association with PBS Digital Studios! If you want to help
keep all Crash Course free for everybody, forever, you can join our community on Patreon.
And if you want to learn more about the history of the World Wide Web, check out this episode
of Crash Course Computer Science.
5.0 / 5 (0 votes)