Natural Language Processing: Crash Course Computer Science #36
Summary
TLDR本视频介绍了计算机科学中的自然语言处理(NLP)技术,它是计算机科学与语言学的交叉领域。NLP旨在让计算机能够理解和生成人类语言。视频首先解释了自然语言与编程语言的不同之处,强调了自然语言的复杂性和多样性。随后,介绍了如何通过词性标注、短语结构规则和解析树来分析句子结构,以及如何利用这些技术来处理和回应信息。视频还探讨了聊天机器人的发展历程,从基于规则的系统到现代基于机器学习的系统,并提到了聊天机器人在客户服务中的应用。此外,还讨论了语音识别技术,包括早期的系统、算法的进步以及深度神经网络在提高语音识别准确性方面的作用。最后,视频介绍了语音合成技术,展示了从早期的机械式语音合成到现代更自然、流畅的语音输出的演变,并预测了语音技术在未来可能成为与屏幕、键盘等传统输入输出设备一样普遍的交互形式。
Takeaways
- 📚 自然语言处理(NLP)是一门结合计算机科学和语言学的跨学科领域,旨在让计算机能够理解语言。
- 🤖 计算机语言与人类自然语言不同,后者拥有更大的词汇量和更复杂的语法结构。
- 🌐 早期NLP的基本问题之一是将句子分解成小块,以便于计算机处理。
- 📝 句子的结构可以通过构建解析树来理解,这有助于计算机识别词性并理解句子构造。
- 🔍 语音搜索和命令处理依赖于语言的解析和生成,计算机通过类似乐高的方式处理自然语言任务。
- 🚫 计算机在处理过于复杂或含糊的句子时可能会失败,无法正确解析句子或捕捉意图。
- 📈 计算机生成自然语言文本时,会使用语义网络中的实体关系来构建信息性句子。
- 🤖 早期聊天机器人基于规则,但现代方法使用机器学习,通过大量真实对话数据训练。
- 👥 聊天机器人和对话系统在过去五十年中取得了显著进步,现在可以非常逼真。
- 🎤 语音识别是计算机从声音中提取单词的领域,它已经研究了几十年。
- 📊 深度神经网络是目前语音识别系统中准确性最高的技术,它们通过分析声波的频谱来工作。
- 🔊 语音合成是让计算机输出语音的能力,它与语音识别相反,涉及将文本分解成音素并连续播放。
Q & A
计算机视觉和自然语言处理(NLP)有什么不同?
-计算机视觉赋予计算机视觉信息的识别和理解能力,而自然语言处理(NLP)则是让计算机能够理解语言。计算机视觉处理的是视觉数据,如图像和视频,而NLP处理的是文本和语音数据。
自然语言处理(NLP)中的基础问题是什么?
-NLP中的一个基础问题是将句子分解成更小的、更易于处理的部分,这通常涉及到词性标注和解析树的构建,以帮助计算机理解句子的结构和意义。
什么是词性标注,它在NLP中有什么作用?
-词性标注是识别文本中每个单词的语法类别(如名词、动词等)的过程。在NLP中,词性标注有助于计算机理解单词在句子中的语法功能,从而更好地解析和理解句子。
如何使用短语结构规则来构建解析树?
-通过应用短语结构规则,可以识别句子中各个成分的语法角色,并将这些成分按照层级结构组织起来,形成解析树。解析树不仅标注了每个单词可能的词性,还揭示了句子的构造方式。
为什么自然语言处理对于计算机来说是一个挑战?
-自然语言处理对计算机来说是挑战,因为人类语言包含大量多样的词汇、多种含义的单词、不同口音的说话者以及各种有趣的文字游戏。此外,人类在书写和说话时也会犯错,如模糊不清、遗漏关键细节或发音错误,而计算机需要能够理解和处理这些复杂性。
早期的聊天机器人是如何工作的?
-早期的聊天机器人主要是基于规则的,专家会编码数百条规则,将用户可能说出的话映射到程序应该如何回复。例如,20世纪60年代中期在麻省理工学院创建的ELIZA就是一个采用基本句法规则来识别书面交流中的内容,然后将其反馈给用户的聊天机器人。
现代聊天机器人与早期版本相比有哪些进步?
-现代聊天机器人使用基于机器学习的方法,利用大量真实人类对话来训练。这使得它们在客户服务应用等领域变得更加高效和有说服力,并且能够更准确地理解和回应用户输入。
为什么计算机生成自然语言文本时需要使用语义网络?
-计算机生成自然语言文本时使用语义网络,是因为这样可以将数据存储在实体之间有意义的关系网中。这些关系网提供了构建信息性句子所需的所有成分,使得文本生成过程更加高效和准确。
语音识别技术是如何从声音中提取单词的?
-语音识别技术通过分析声音信号的波形,将其转换为频谱图,从而识别出不同的频率成分,这些成分对应于声音中的共振。通过识别这些共振,即声音的特定模式,计算机可以识别出构成单词的音素,进而将语音转换为文本。
什么是语言模型,它在语音识别中扮演什么角色?
-语言模型是包含关于单词序列统计信息的模型,它在语音识别中帮助提高转录的准确性。例如,如果语音识别器在“happy”和“harpy”之间不确定,语言模型会根据统计数据选择更可能的选项“happy”,因为“she was”后面紧跟形容词比名词更常见。
语音合成是如何工作的?
-语音合成是将文本句子分解为其音素成分,然后将这些声音连续地从计算机扬声器中播放出来的过程。现代语音合成技术已经变得非常先进,能够产生听起来非常自然的声音,尽管它们仍然与人类声音有所不同。
为什么语音技术可能很快成为与屏幕、键盘等物理输入输出设备一样常见的交互形式?
-语音技术的普及正在创造一个正反馈循环,人们越来越多地使用语音交互,这反过来为像谷歌、亚马逊和微软这样的公司提供了更多的数据来训练他们的系统。随着准确性的提高,人们会更频繁地使用语音,这又进一步促进了准确性的提升。预计语音技术将很快变得与我们今天使用的其他物理输入输出设备一样普遍。
Outlines
📚 自然语言处理(NLP)简介
本段落介绍了自然语言处理(NLP)的基本概念,包括计算机如何理解人类语言。强调了计算机语言与自然语言之间的差异,以及自然语言的复杂性,如词汇量大、一词多义、口音多样性等。提到了NLP的早期问题是如何将句子分解为更易于处理的小部分,这涉及到词性标注和语法规则。此外,还讨论了解析树的构建,以及如何通过这种方式理解句子的结构。最后,介绍了NLP在语音搜索和命令处理中的应用,以及其在理解复杂或含糊语言时的局限性。
🤖 聊天机器人和语音识别的发展
第二段落讨论了聊天机器人的发展历程,从基于规则的系统如ELIZA,到现代基于机器学习的聊天机器人,它们通过分析大量真实对话来学习。提到了谷歌的知识图谱,这是一个存储大量事实和实体关系的数据库。同时,也探讨了语音识别技术的历史,从早期的数字识别系统到现代使用深度神经网络的实时语音识别系统。解释了声波信号如何被转换成频谱图,以及如何通过分析声音的共振模式(形式)来识别不同的元音和单词。此外,还讨论了语音合成技术,即将文本转换为语音输出的过程,以及如何通过结合语言模型来提高语音识别的准确性。
🔊 语音技术的前景和影响
最后一个段落展望了语音技术的未来,预测它将变得和屏幕、键盘等物理输入输出设备一样普遍。强调了语音用户界面在手机、汽车、家庭中的普及,以及这种普及如何形成正反馈循环,促进了语音识别技术的改进。提到了合成计算机语音的改进,以及它们与人类语音的接近程度。最后,指出了语音技术对于机器人等无需物理键盘即可与人类交流的设备的重要性,并预告了下周将继续探讨机器人相关的话题。
Mindmap
Keywords
💡自然语言处理(Natural Language Processing,NLP)
💡词性标注(Parts of Speech)
💡解析树(Parse Tree)
💡语音识别(Speech Recognition)
💡语音合成(Speech Synthesis)
💡机器学习(Machine Learning)
💡聊天机器人(Chatbots)
💡深度神经网络(Deep Neural Networks)
💡频谱图(Spectrogram)
💡知识图谱(Knowledge Graph)
💡语言模型(Language Model)
Highlights
计算机视觉赋予了计算机视觉和理解视觉信息的能力。
本集讨论如何让计算机理解语言,这是自计算机诞生以来的愿望。
自然语言处理(NLP)是计算机科学和语言学的交叉学科。
句子被分解为小块以便于计算机处理,这是NLP早期的基本问题之一。
句子结构规则的开发帮助计算机理解语言的语法。
使用句子结构规则可以构建解析树,揭示句子结构。
计算机通过类似乐高的方式处理语言,能够回答问题和执行命令。
计算机在处理过于复杂的语言时可能会失败,无法正确解析句子或捕捉意图。
计算机使用语义网络生成自然语言文本,特别是当数据以有意义的关系链接时。
早期的聊天机器人是基于规则的,后来发展到基于机器学习的现代方法。
聊天机器人和更先进的对话系统在过去五十年中取得了显著进步。
语音识别是计算机从声音中获取单词的领域,研究已有数十年。
语音识别系统使用深度神经网络,这是目前最准确的技术。
频谱图帮助计算机从声波中识别出不同的频率成分。
语音识别软件通过模式匹配来识别构成单词的声音片段,即音素。
语言模型提高了转录的准确性,它包含有关词序的统计信息。
语音合成是让计算机输出语音的过程,与语音识别相反。
语音用户界面的普及正在创造一个正反馈循环,提高了语音交互的准确性。
许多人预测语音技术将成为与屏幕、键盘、触控板等物理输入输出设备一样常见的交互形式。
Transcripts
Hi, I’m Carrie Anne, and welcome to Crash Course Computer Science!
Last episode we talked about computer vision – giving computers the ability to see and
understand visual information.
Today we’re going to talk about how to give computers the ability to understand language.
You might argue they’ve always had this capability.
Back in Episodes 9 and 12, we talked about machine language instructions, as well as
higher-level programming languages.
While these certainly meet the definition of a language, they also tend to have small
vocabularies and follow highly structured conventions.
Code will only compile and run if it’s 100 percent free of spelling and syntactic errors.
Of course, this is quite different from human languages – what are called natural languages
– containing large, diverse vocabularies, words with several different meanings, speakers
with different accents, and all sorts of interesting word play.
People also make linguistic faux pas when writing and speaking, like slurring words
together, leaving out key details so things are ambiguous, and mispronouncing things.
But, for the most part, humans can roll right through these challenges.
The skillful use of language is a major part of what makes us human.
And for this reason, the desire for computers to understand and speak our language has been
around since they were first conceived.
This led to the creation of Natural Language Processing, or NLP, an interdisciplinary field
combining computer science and linguistics.
INTRO
There’s an essentially infinite number of ways to arrange words in a sentence.
We can’t give computers a dictionary of all possible sentences to help them understand
what humans are blabbing on about.
So an early and fundamental NLP problem was deconstructing sentences into bite-sized pieces,
which could be more easily processed.
In school, you learned about nine fundamental types of English words: nouns, pronouns, articles,
verbs, adjectives, adverbs, prepositions, conjunctions, and interjections.
These are called parts of speech.
There are all sorts of subcategories too, like singular vs. plural nouns and superlative
vs. comparative adverbs, but we’re not going to get into that.
Knowing a word’s type is definitely useful, but unfortunately, there are a lot words that
have multiple meanings – like “rose” and “leaves”, which can be used as nouns
or verbs.
A digital dictionary alone isn’t enough to resolve this ambiguity, so computers also
need to know some grammar.
For this, phrase structure rules were developed, which encapsulate the grammar of a language.
For example, in English there’s a rule that says a sentence can be comprised of a noun
phrase followed by a verb phrase.
Noun phrases can be an article, like “the”, followed by a noun or they can be an adjective
followed by a noun.
And you can make rules like this for an entire language.
Then, using these rules, it’s fairly easy to construct what’s called a parse tree,
which not only tags every word with a likely part of speech, but also reveals how the sentence
is constructed.
We now know, for example, that the noun focus of this sentence is “the mongols”, and
we know it’s about them doing the action of “rising” from something, in this case,
“leaves”.
These smaller chunks of data allow computers to more easily access, process and respond
to information.
Equivalent processes are happening every time you do a voice search, like: “where’s
the nearest pizza”.
The computer can recognize that this is a “where” question, knows you want the noun
“pizza”, and the dimension you care about is “nearest”.
The same process applies to “what is the biggest giraffe?” or “who sang thriller?”
By treating language almost like lego, computers can be quite adept at natural language tasks.
They can answer questions and also process commands, like “set an alarm for 2:20”
or “play T-Swizzle on spotify”.
But, as you’ve probably experienced, they fail when you start getting too fancy, and
they can no longer parse the sentence correctly, or capture your intent.
Hey Siri... methinks the mongols doth roam too much, what think ye on this most gentle
mid-summer’s day?
Siri: I’m not sure I got that.
I should also note that phrase structure rules, and similar methods that codify language,
can be used by computers to generate natural language text.
This works particularly well when data is stored in a web of semantic information, where
entities are linked to one another in meaningful relationships, providing all the ingredients
you need to craft informational sentences.
Siri: Thriller was released in 1983 and sung by Michael Jackson
Google’s version of this is called Knowledge Graph.
At the end of 2016, it contained roughly seventy billion facts about, and relationships between,
different entities.
These two processes, parsing and generating text, are fundamental components of natural
language chatbots - computer programs that chat with you.
Early chatbots were primarily rule-based, where experts would encode hundreds of rules
mapping what a user might say, to how a program should reply.
Obviously this was unwieldy to maintain and limited the possible sophistication.
A famous early example was ELIZA, created in the mid-1960s at MIT.
This was a chatbot that took on the role of a therapist, and used basic syntactic rules
to identify content in written exchanges, which it would turn around and ask the user
about.
Sometimes, it felt very much like human-human communication, but other times it would make
simple and even comical mistakes.
Chatbots, and more advanced dialog systems, have come a long way in the last fifty years,
and can be quite convincing today!
Modern approaches are based on machine learning, where gigabytes of real human-to-human chats
are used to train chatbots.
Today, the technology is finding use in customer service applications, where there’s already
heaps of example conversations to learn from.
People have also been getting chatbots to talk with one another, and in a Facebook experiment,
chatbots even started to evolve their own language.
This experiment got a bunch of scary-sounding press, but it was just the computers crafting
a simplified protocol to negotiate with one another.
It wasn’t evil, it’s was efficient.
But what about if something is spoken – how does a computer get words from the sound?
That’s the domain of speech recognition, which has been the focus of research for many
decades.
Bell Labs debuted the first speech recognition system in 1952, nicknamed Audrey – the automatic
digit recognizer.
It could recognize all ten numerical digits, if you said them slowly enough.
5…
9…
7?
The project didn’t go anywhere because it was much faster to enter telephone numbers
with a finger.
Ten years later, at the 1962 World's Fair, IBM demonstrated a shoebox-sized machine capable
of recognizing sixteen words.
To boost research in the area, DARPA kicked off an ambitious five-year funding initiative
in 1971, which led to the development of Harpy at Carnegie Mellon University.
Harpy was the first system to recognize over a thousand words.
But, on computers of the era, transcription was often ten or more times slower than the
rate of natural speech.
Fortunately, thanks to huge advances in computing performance in the 1980s and 90s, continuous,
real-time speech recognition became practical.
There was simultaneous innovation in the algorithms for processing natural language, moving from
hand-crafted rules, to machine learning techniques that could learn automatically from existing
datasets of human language.
Today, the speech recognition systems with the best accuracy are using deep neural networks,
which we touched on in Episode 34.
To get a sense of how these techniques work, let’s look at some speech, specifically,
the acoustic signal.
Let’s start by looking at vowel sounds, like aaaaa…and Eeeeeee.
These are the waveforms of those two sounds, as captured by a computer’s microphone.
As we discussed in Episode 21 – on Files and File Formats – this signal is the magnitude
of displacement, of a diaphragm inside of a microphone, as sound waves cause it to oscillate.
In this view of sound data, the horizontal axis is time, and the vertical axis is the
magnitude of displacement, or amplitude.
Although we can see there are differences between the waveforms, it’s not super obvious
what you would point at to say, “ah ha! this is definitely an eeee sound”.
To really make this pop out, we need to view the data in a totally different way: a spectrogram.
In this view of the data, we still have time along the horizontal axis, but now instead
of amplitude on the vertical axis, we plot the magnitude of the different frequencies
that make up each sound.
The brighter the color, the louder that frequency component.
This conversion from waveform to frequencies is done with a very cool algorithm called
a Fast Fourier Transform.
If you’ve ever stared at a stereo system’s EQ visualizer, it’s pretty much the same
thing.
A spectrogram is plotting that information over time.
You might have noticed that the signals have a sort of ribbed pattern to them – that’s
all the resonances of my vocal tract.
To make different sounds, I squeeze my vocal chords, mouth and tongue into different shapes,
which amplifies or dampens different resonances.
We can see this in the signal, with areas that are brighter, and areas that are darker.
If we work our way up from the bottom, labeling where we see peaks in the spectrum – what
are called formants – we can see the two sounds have quite different arrangements.
And this is true for all vowel sounds.
It’s exactly this type of information that lets computers recognize spoken vowels, and
indeed, whole words.
Let’s see a more complicated example, like when I say: “she.. was.. happy”
We can see our “eee” sound here, and “aaa” sound here.
We can also see a bunch of other distinctive sounds, like the “shh” sound in “she”,
the “wah” and “sss” in “was”, and so on.
These sound pieces, that make up words, are called phonemes.
Speech recognition software knows what all these phonemes look like.
In English, there are roughly forty-four, so it mostly boils down to fancy pattern matching.
Then you have to separate words from one another, figure out when sentences begin and end...
and ultimately, you end up with speech converted into text, allowing for techniques like we
discussed at the beginning of the episode.
Because people say words in slightly different ways, due to things like accents and mispronunciations,
transcription accuracy is greatly improved when combined with a language model, which
contains statistics about sequences of words.
For example “she was” is most likely to be followed by an adjective, like “happy”.
It’s uncommon for “she was” to be followed immediately by a noun.
So if the speech recognizer was unsure between, “happy” and “harpy”, it’d pick “happy”,
since the language model would report that as a more likely choice.
Finally, we need to talk about Speech Synthesis, that is, giving computers the ability to output
speech.
This is very much like speech recognition, but in reverse.
We can take a sentence of text, and break it down into its phonetic components, and
then play those sounds back to back, out of a computer speaker.
You can hear this chaining of phonemes very clearly with older speech synthesis technologies,
like this 1937, hand-operated machine from Bell Labs.
Say, "she saw me" with no expression.
She saw me.
Now say it in answer to these questions.
Who saw you?
She saw me.
Who did she see?
She saw me.
Did she see you or hear you?
She saw me.
By the 1980s, this had improved a lot, but that discontinuous and awkward blending of
phonemes still created that signature, robotic sound.
Thriller was released in 1983 and sung by Michael Jackson.
Today, synthesized computer voices, like Siri, Cortana and Alexa, have gotten much better,
but they’re still not quite human.
But we’re soo soo close, and it’s likely to be a solved problem pretty soon.
Especially because we’re now seeing an explosion of voice user interfaces on our phones, in
our cars and homes, and maybe soon, plugged right into our ears.
This ubiquity is creating a positive feedback loop, where people are using voice interaction
more often, which in turn, is giving companies like Google, Amazon and Microsoft more data
to train their systems on...
Which is enabling better accuracy, which is leading to people using voice more, which
is enabling even better accuracy… and the loop continues!
Many predict that speech technologies will become as common a form of interaction as
screens, keyboards, trackpads and other physical input-output devices that we use today.
That’s particularly good news for robots, who don’t want to have to walk around with
keyboards in order to communicate with humans.
But, we’ll talk more about them next week.
See you then.
Voir Plus de Vidéos Connexes
Artificial Intelligence Explained Simply in 1 Minute! ✨
How to build an IVR with Custom AI Voices (in Dialogflow)
Geoffrey Hinton: The Foundations of Deep Learning
Lecture 1.1 — Why do we need machine learning — [ Deep Learning | Geoffrey Hinton | UofT ]
Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy
Introduction to windows | computer software language learning | Computer Education for All
5.0 / 5 (0 votes)