Web Search: Crash Course AI #17

CrashCourse
6 Dec 201911:15

Summary

TLDRCrash Course AI explores how search engines like Google operate using AI to find answers. It explains the process from crawling the web with web crawlers to organizing data with inverted indexes. The video also touches on how user behavior influences search rankings and the use of knowledge bases for direct answers. It highlights the challenges AI faces with nuanced questions and biases in data.

Takeaways

  • 🔍 **Search Engines as AI Systems**: Modern search engines like Google use AI to help users find information by gathering and organizing data from the World Wide Web.
  • 📚 **From Libraries to Web Crawlers**: The concept of search engines dates back centuries, evolving from physical libraries to digital web crawlers that systematically download web pages.
  • 🌐 **The Web and the Internet**: The script clarifies the difference between the Internet (a network of computers) and the Web (part of the Internet that uses browsers to display documents).
  • 🕷️ **Web Crawlers**: Web crawlers are programs that start from a 'seed' page and recursively download linked pages, forming the basis of search engine databases.
  • 📈 **Inverted Index**: Search engines use an inverted index to organize web pages by words, allowing for quick searches when users enter queries.
  • 🔑 **Query Processing**: When a user submits a query, the AI uses the inverted index to find relevant web pages that contain the search terms.
  • 🏆 **Ranking Results**: Search engines rank web pages to ensure the most relevant results appear first, using user behavior data like bounces and click-throughs to refine rankings.
  • 🧠 **Knowledge Bases**: For direct answers, AI systems use knowledge bases that encode information as relationships between objects, unlike inverted indexes used for web page links.
  • 🤖 **NELL - Never Ending Language Learner**: An example of a knowledge base is NELL, which autonomously extracts facts from web pages and uses repetition and multiple sources to validate information.
  • 🌳 **Bias in AI**: The script highlights that AI systems can inherit biases present in the data they learn from, affecting the neutrality of search results.
  • ❓ **Limitations of AI**: Certain questions that are not commonly asked or have limited data available can stump AI systems, illustrating the ongoing challenges in training comprehensive AI models.

Q & A

  • What is the primary function of search engines?

    -Search engines primarily gather data, create organization systems to sort that data, and find results to a question.

  • How do search engines compare to traditional libraries in terms of data organization?

    -Search engines and traditional libraries both gather and organize data. Libraries use physical organization systems like shelving and cataloging, while search engines use digital systems like inverted indexes.

  • What is the role of a Web crawler in search engines?

    -A Web crawler is a computer program that systematically finds and downloads Web pages to gather data for the search engine AI to process.

  • Can you explain what an inverted index is in the context of search engines?

    -An inverted index is a lookup system used to organize Web pages. For each word, it lists all the Web pages that contain that word, usually represented by ID numbers instead of URLs.

  • How does the AI in search engines determine the relevance of search results?

    -The AI uses an inverted index to find relevant pages and then ranks them based on various factors to ensure the top results are more likely to be relevant.

  • What is the significance of user behavior in training search engine AI?

    -User behavior, such as bounces and click-throughs, provides training data for AI systems to learn how to rank search results and better answer user queries.

  • What is a knowledge base and how does it differ from an inverted index?

    -A knowledge base encodes information as relationships between objects. Unlike an inverted index, which is used for searching, a knowledge base is used to directly answer questions by matching incomplete facts.

  • What is the Never Ending Language Learner (NELL) and how does it work?

    -NELL is a huge knowledge base created by Carnegie Mellon University that extracts facts from Web pages. It starts with human-provided facts, identifies patterns, and learns new facts and relationships by searching the Web.

  • How does an AI system like Siri or John Green Bot answer direct questions?

    -AI systems like Siri reformulate questions into incomplete facts and then search a knowledge base for matches to provide direct answers.

  • Why do some questions stump AI systems?

    -Questions that stump AI systems are often those that not enough people ask, or for which the AI hasn't learned how to answer well yet due to lack of data or training.

  • What is the potential issue with biases in search engine AI systems?

    -Search engine AI systems can be influenced by biases in the data online, leading to skewed or incomplete results, such as predominantly showing images of female nurses when searching for 'nurses'.

Outlines

00:00

🔍 Introduction to Search Engines and AI

Jabril introduces the topic of search engines in AI, comparing the modern digital search to the traditional library system. He explains that search engines like Google, Bing, and others are AI systems that gather data, organize it, and provide answers to users' queries. The analogy of a library is used to illustrate how librarians organize books, and how search engines use AI to organize web data. The video also distinguishes between the Internet and the Web, explaining that the Web is a part of the Internet that uses browsers to display content. The process of web crawlers is introduced as the initial step in data gathering for search engines.

05:01

📚 How Search Engines Organize and Rank Information

This section delves into how search engines organize the vast amount of data they collect. It explains the concept of an inverted index, which is a system that lists all web pages containing a specific word, allowing search engines to quickly find relevant pages for a query. The paragraph also discusses the importance of ranking search results, ensuring that the most relevant results appear at the top. It highlights how user behavior, such as bounces and click-throughs, provides training data for AI systems to learn and improve search result rankings. The Never Ending Language Learner (NELL) is introduced as an example of an AI system that can extract facts from web pages to build a knowledge base.

10:04

🤖 AI's Use of Knowledge Bases and Challenges in Search

The final paragraph discusses the use of knowledge bases by AI systems to provide direct answers to users' questions, as opposed to just providing links to web pages. It explains how AI systems can reformulate questions into facts and search through knowledge bases for answers. The paragraph also touches on the limitations of AI in answering certain types of questions due to lack of data or the complexity of the question. It warns about the potential biases in search engine results due to the biases present in the data online, setting the stage for a future discussion on algorithmic bias.

Mindmap

Keywords

💡Search Engines

Search engines are AI systems that help users find information on the internet. They gather data, organize it, and provide results to user queries. In the video, search engines like Google, Bing, and Duck Duck Go are mentioned as examples. The script explains how search engines work by comparing them to libraries, where librarians organize books and help users find information, similarly, search engines organize web pages and provide relevant results.

💡Web Crawler

A web crawler is a computer program that systematically browses the World Wide Web, downloading web pages and following links to other pages. The script uses the analogy of a web crawler starting from a 'seed' page and then downloading linked pages to explain how search engines collect data before processing queries.

💡Inverted Index

An inverted index is a type of database index that stores a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. In the context of the video, inverted indexes are used by search engines to organize web pages based on the words they contain, allowing for efficient searching when users enter queries.

💡AI Training

AI training refers to the process of teaching AI systems to perform tasks by providing them with data and feedback. In the video, it is mentioned that user behavior, such as clicks and bounces, provides training data for search engines to learn how to rank search results better. This is a form of implicit training where the AI learns from user interactions rather than explicit programming.

💡Knowledge Base

A knowledge base is a collection of structured data that can be used to answer specific queries. In the video, knowledge bases are contrasted with inverted indexes, where AIs like Siri use them to provide direct answers to factual questions instead of links to web pages. The script explains how knowledge bases encode information as relationships between objects.

💡Never Ending Language Learner (NELL)

NELL, or the Never Ending Language Learner, is a project mentioned in the video that aims to extract knowledge from the web and build a large-scale knowledge base. NELL starts with a small set of facts and iteratively learns new facts and relationships from web pages, using repetition and multiple sources to build confidence in the accuracy of the extracted information.

💡Click Through

A click through is a user action where they select a search result and visit the linked page. The video explains how click throughs provide positive feedback to search engines, indicating that the result was relevant to the user's query. This helps AI systems learn which results are more likely to be useful.

💡Bounce

A bounce in the context of search engines occurs when a user clicks on a search result but quickly returns to the search results page. The video describes bounces as negative feedback, signaling to the AI that the result was not relevant or useful, thus helping the system to improve its ranking algorithms.

💡Bias in AI

Bias in AI refers to the tendency of AI systems to reflect and perpetuate the biases present in the data they are trained on. The video touches on this issue by noting that search engines can be influenced by biases in online data, such as the overrepresentation of certain groups or ideas.

💡Part of Speech Tagging

Part of speech tagging is a computational linguistics task that assigns a part of speech (like noun, verb, adjective) to each word in a sentence. The video briefly mentions this concept in relation to AI's ability to understand the difference between questions that ask 'who,' 'when,' or 'where,' which is crucial for accurately processing and answering queries.

Highlights

Search engines have evolved from simple shouting matches to AI systems with vast human knowledge.

Google, Siri, and Alexa are examples of AI systems improving search capabilities.

IBM’s Watson demonstrated AI prowess by beating top Jeopardy players.

Search engines gather and organize data to find answers to questions.

Libraries serve as an early form of search engine with physical organization systems.

Web search engines use AI to look through data on the World Wide Web.

The Internet and the Web are not the same; the former is a network of computers, the latter is a part of it used for content delivery.

Web crawlers are used to systematically find and download web pages for search engines.

An inverted index is a lookup system used to organize web pages for search engines.

Search engines use inverted indexes to find relevant web pages based on search queries.

Search engines rank web pages to provide the most relevant results first.

User behavior, such as bounces and click-throughs, trains AI systems to improve search results.

AI systems use knowledge bases to provide direct answers rather than web page links.

The Never Ending Language Learner (NELL) is a knowledge base that extracts facts from web pages.

NELL uses repetition and multiple sources to confirm the accuracy of extracted facts.

AI can reformulate questions into facts and search knowledge bases for answers.

Search engines struggle with questions that are rarely asked or have incomplete data.

AI systems can have difficulty with nuance and are influenced by biases in online data.

Crash Course AI is produced in association with PBS Digital Studios, offering a community on Patreon.

Transcripts

play00:00

Hi, I’m Jabril and welcome to Crash Course AI! There used to be a time when a group of

play00:04

friends at dinner could ask a question like “is a hot dog a sandwich?” and it would

play00:08

turn into a basic shouting match with lots of gesturing and hypothetical examples.

play00:13

But now, we have access to a LOT of human knowledge in the palm of our hands… so our

play00:17

friends can look up memes and dictionary definitions and pictures of sandwiches to prove that none

play00:23

of them have a connected bun like hot dogs (disappointed).

play00:29

Search engines are a huge part of modern life. They help us access information, find directions

play00:34

to places, shop, and participate in sandwich arguments.

play00:37

But how does Google find answers to questions? How are Siri and Alexa so smart but also easily

play00:44

stumped? How did IBM’s Watson beat the best Jeopardy players in the world?

play00:48

Well, search engines are just AI systems that are getting better and better at helping us

play00:52

find what we’re looking for.

play00:54

INTRO

play01:03

When we talk about search engines, we typically think about the AI systems online, like Google,

play01:08

Bing, Duck Duck Go and Ask Jeeves.

play01:11

But the basic ideas behind non-AI search engines have existed for centuries. Essentially, search

play01:17

engines gather data, create organization systems to sort that data, and find results to a question.

play01:23

For example, when you needed an answer to a question and couldn’t search online, you

play01:27

could go to the library! Libraries gather data in the form of books and newspapers that

play01:31

are stacked neatly on the shelves.

play01:33

Librarians have organization systems to help you find what you’re looking for. Knowing

play01:38

that magazines are on shelves by the water fountain, while kids books are on the second

play01:43

floor is a kind of organization. Plus, fiction books are sorted by the author’s last name,

play01:48

while nonfiction has the Dewey Decimal System, and so on.

play01:52

Once you (or the librarian) have the resources you need, you’ll be able to find results

play01:57

to your question!

play01:58

Now, rather than looking through books, web search engines look through all the data on

play02:03

the World Wide Web, aka “the Web”. And instead of asking a human librarian where

play02:07

to find information, we ask an AI like John-Green-bot instead.

play02:11

Jabril: Oh John Green Bot?

play02:17

[JGB dialup beeps]

play02:25

Alright John Green Bot you're all set.

play02:31

We're going to need that later.

play02:33

And just so we’re clear, we’re using “Web” throughout this video even though it might

play02:37

sound a little old-fashioned. That’s because the Internet and the Web are not the same thing.

play02:41

The Internet is a collection of computers that send messages to each other. Video services

play02:46

like Netflix that play on your TV, for example, use the Internet, not the Web.

play02:50

The Web, on the other hand, is part of the Internet and uses the Internet’s connections

play02:55

to send documents and other content in a format that can be displayed by a browser like Chrome

play03:01

or Safari.

play03:02

As with most AI systems, the first step is to gather lots of data. To gather data on

play03:07

the Web, we can use a computer program called a Web crawler, which systematically finds

play03:11

and downloads Web pages. This is a HUGE task and happens before the search engine AI can

play03:17

take any questions.

play03:18

It starts on some Web page that we pick, called a seed, and downloads that page and finds

play03:23

all its links. Then, the crawler downloads each of the linked Web pages and finds their

play03:28

links, and so on... until we’ve crawled the whole Web.

play03:31

After we have collected all the data, the AI’s next step is to organize it by building

play03:36

an index, which is a kind of lookup system. The kind that’s used for organizing Web

play03:41

pages is called an inverted index, which is like the index in the back of a textbook.

play03:47

For each word, it lists all of the Web pages that contain that word. Usually, the Web pages

play03:52

are represented by I.D. numbers so we don’t have a long, messy list of URLs.

play03:56

Let’s say 0 is the seed - which happens to be a page about Genghis Khan. It has a

play04:01

lot of words on it like “the, mongol, Khan, Genghis, who, and is”. In this inverted

play04:07

index, page 1 is about Marco Polo, but it mentions the word “Genghis” along with

play04:11

words like “the, Marco, Polo, who, are, and is.” Page 2 is about the Mongols, page

play04:18

3 is a different webpage about Marco Polo, and page 4 is about Water Polo.

play04:22

So, let’s say we type “Who is Genghis Khan?” into a search engine.

play04:26

Our AI can use this inverted index to find results, which in this case, are links to

play04:31

Web pages. The AI will look at the words “who”, “is”, “Genghis”, and “Khan” and

play04:36

use the inverted index to find relevant pages.

play04:39

Our AI might find that Web pages zero, one, two and five have at least one of the words

play04:44

from the question “who is Genghis Khan?” When Siri says “I found this for you,”

play04:50

the AI is just returning a list of Web pages that contain the same terms as the question.

play04:55

Except… most search engines include one more step. There are millions of pages online

play05:00

that contain the same terms. So it’s important for search engines to rank Web pages, so

play05:06

that the top result is more likely to be relevant than the tenth result or the hundredth.

play05:10

Of course, Google and Bing don’t hire “supervisors” to grade each possible question and answer

play05:15

to help their AI systems learn from training data. That would take forever, and they wouldn’t

play05:20

be able to keep up with all the new content that gets created every day.

play05:24

Really, regular users like us do this training for free all the time. Every time we use

play05:29

a search engine, our behavior tells the AI whether or not the results answered our question.

play05:34

For example, if we type in “who is Genghis Khan” into a search engine, and click on

play05:39

a Web page about Star Trek II: The Wrath of Khan, we might be disappointed to find Genghis

play05:44

Khan isn’t ANYWHERE in that movie. So we’ll bounce back to the search results, and try

play05:49

again until we find a page that answers our question.

play05:52

A bounce indicates a bad result. But if we click on a Wikipedia article about Genghis

play05:57

Khan and stay for a while reading, that’s a click through, which probably means that

play06:02

we found what we were looking for… so that indicates a good result.

play06:06

Human behavior like bounces and click throughs give AI systems the training data they need

play06:10

to learn how to rank search results and better answer our questions. Data from the Web

play06:16

and data from how we use the Web helps make better and better search engines.

play06:20

Now, sometimes we ask our smart devices questions and we want actual answers… not links to

play06:25

Web pages. When I say “OK Google, what’s the weather like in Indianapolis?” I don’t

play06:30

want to scroll through results.

play06:32

For this kind of problem, instead of using an inverted index, AIs rely on knowledge bases.

play06:38

Which you might remember from our video about Symbolic AI. A knowledge base encodes information

play06:43

about the universe as relationships between objects like "chocolate donut" and "John Green Bot wears polo".

play06:49

One of the main problems with knowledge bases is that it’s really hard to write down all

play06:53

of the facts in the universe, especially common sense things that humans take for granted

play06:58

but computers need to be told.

play07:00

Enter AI researcher Tom Mitchell and his team of scientists from Carnegie Mellon University.

play07:04

In 2010, they created a huge knowledge base called the Never Ending Language Learner or

play07:09

NELL, which was able to extract hundreds of thousands of facts from random Web pages.

play07:15

The way it works is really clever, so let’s go to the Thought Bubble to see how.

play07:20

NELL starts with some facts provided by a human, for example, the genre of music that

play07:25

Mozart plays is classical. Which was represented like this:

play07:29

Mozart. musicGenre. Classical.

play07:31

Similarly,

play07:32

Jimi Hendrix. plays. Guitar.

play07:34

And Darth Vader. hasChild. Luke Skywalker.

play07:37

Then, NELL gets to work and reads through each Web page one-by-one for words mentioned

play07:41

in those facts. Maybe it finds the text “Mozart plays the piano.”

play07:46

NELL doesn’t know much about these symbols, but this text matches the same pattern as

play07:50

one of the facts provided by a human, specifically, the “plays” relationship. So NELL learns

play07:56

a new object: Piano. And a new fact: Mozart. plays. Piano.

play08:01

By searching over the entire Web, NELL can learn lots of facts based on just the three

play08:05

original ones that humans gave it!

play08:08

Some facts might appear hundreds or thousands of times online, like Lenny Kravitz. hasChild.

play08:13

Zoë Kravitz. But NELL might also find facts that are mentioned SOMEWHERE online and extract

play08:18

them as potentially true. Like, for example, Darth Vader. plays. Kloo Horn. We just don’t

play08:24

know!

play08:25

Just like how we look for multiple sources when writing a paper, NELL uses repetition

play08:29

and multiple sources to build confidence that the facts it’s finding are actually true.

play08:34

To consider other relationships, NELL uses the highly confident facts it learned and

play08:38

searches through the Web again. Only this time, NELL is looking for new relationships.

play08:43

Maybe it finds the text “Darth Vader cuts off Luke Skywalker’s hand,” and NELL learns

play08:47

a new (very specific) relationship: cutsOffHand.

play08:51

Over and over again, NELL will use known relationships to find new objects, and known objects to

play08:56

find new relationships -- creating a huge knowledge base.

play08:59

Thanks, Thought Bubble! AI systems can use huge knowledge bases, like this one extracted

play09:04

by NELL, to answer our questions directly.

play09:07

Instead of using the words from our questions to search through an inverted index, an AI

play09:11

like Siri can reformulate our questions into incomplete facts and then look for matches

play09:16

in a knowledge base.

play09:17

Hey John Green Bot….

play09:22

John Green Bot: Yes, Jabril?

play09:24

Jabril: “Who wrote The Bluest Eye?”

play09:25

His AI could then reformulate that question into an incomplete fact, replacing “who”

play09:30

with a question mark. If John-Green-bot extracted that information earlier, he can find matches

play09:35

in his knowledge base and return the most confident result.

play09:38

John-Green-bot: Toni Morrison wrote The Bluest Eye!

play09:42

Jabril: Hey. Thanks, John-Green-bot!

play09:47

Different words are categorized differently, so an AI like John-Green-bot can tell the

play09:52

difference between questions asking “who” and “when” and “where.” But that gets

play09:56

more complicated, so we’re not going to dive into the details here. If you want to

play10:00

learn more, you can read about part of speech tagging systems.

play10:03

Using all these strategies, search engines have become really good at answering common

play10:07

questions. But questions like “How many trees are in Ohio?” or “How many hotdogs

play10:12

are eaten in the South Sandwich Islands annually?” still stump most AI systems, because not enough

play10:17

people ask them and AI hasn’t learned how to answer them well yet.

play10:21

It’s also important to watch out for search engine answers to questions like “Who invented

play10:26

the time machine?” because AI systems have a tough time with nuance and incomplete data.

play10:31

Sorry Doc Brown.

play10:32

And a big, sort of hidden, problem is that search engine AI systems, are influenced by

play10:37

any biases in data online. For example, if I ask Google for images of “nurses,” it

play10:42

will mostly show pictures of female nurses. So next time, we’ll talk about how an algorithm

play10:47

can be biased, where bias comes from, and what we can do to address bias in AI. I’ll

play10:53

see ya then.

play10:54

Crash Course AI is produced in association with PBS Digital Studios! If you want to help

play10:59

keep all Crash Course free for everybody, forever, you can join our community on Patreon.

play11:03

And if you want to learn more about the history of the World Wide Web, check out this episode

play11:07

of Crash Course Computer Science.

Rate This

5.0 / 5 (0 votes)

相关标签
AI SystemsSearch EnginesWeb CrawlerInverted IndexKnowledge BaseMachine LearningHuman BehaviorData BiasNELL AICrash Course
您是否需要英文摘要?