Voice Forensics: Rita Singh | 2019 Wharton People Analytics Conference

Wharton School
15 May 201916:22

Summary

TLDRThis talk delves into the fascinating realm of voice profiling, where the speaker demonstrates the potential to deduce a person's physical and psychological traits from their voice. Utilizing artificial intelligence, the technology can analyze minute voice features to predict age, health, environment, and even reconstruct a person's physical appearance. The presentation showcases the technology's current capabilities and its future potential to enhance machine understanding of individuals.

Takeaways

  • 🕵️‍♂️ Profiling humans from their voice involves analyzing various aspects of a person's voice to deduce information about their identity, background, and even environment.
  • 📞 The script starts with a dramatic example of a bomb threat call to illustrate how voice profiling can be used in law enforcement to identify criminals.
  • 🔍 Voice profiling can reveal details such as a person's ethnicity, height, weight, age, and even their state of health or intoxication.
  • 🏠 The environment in which a person is speaking can leave traces in their voice, which can be analyzed to infer the room's materials and dimensions.
  • 🗣️ The human voice is a complex signal that carries a wealth of information, including physiological parameters, personality traits, and emotional states.
  • 🎶 Voice profiling uses artificial intelligence to extract micro-features from the voice that are not easily discernible by the human ear.
  • 📉 The script explains the voice production process, highlighting how the vocal tract acts as a resonance chamber and how its shape affects the sound produced.
  • 🎛️ Different aspects of the voice, such as pitch, harmonics, and resonance, can be visualized using spectrograms, which show the frequency content over time.
  • 👵 Aging and certain medical conditions like Parkinson's can affect the voice and leave detectable patterns that can be identified through voice analysis.
  • 🎨 The script demonstrates the potential of voice profiling to reconstruct physical features like the face and even the skeletal structure from a person's voice.
  • 🚀 The technology behind voice profiling is advancing, with live demonstrations and applications like recreating historical voices, indicating a promising future for this field.

Q & A

  • What is the main subject of the talk?

    -The main subject of the talk is profiling humans from their voice, which includes identifying various characteristics and information about a person based on their voice patterns.

  • Why would someone record a threatening voice and take it to the police?

    -Someone would record a threatening voice and take it to the police to help in identifying the perpetrator of a crime, such as bomb threats, harassment, extortion, or hoax calls.

  • What does the speaker do to demonstrate the concept of voice profiling?

    -The speaker plays out a bomb threat recording and asks the audience to form an opinion about the speaker based on the voice. Afterward, the speaker reveals detailed information about the speaker's characteristics, which were deduced from the voice.

  • What are some of the personal characteristics that can be inferred from a person's voice?

    -Some personal characteristics that can be inferred from a person's voice include age, gender, ethnicity, height, weight, health status, background, personality, and even the environment they are in.

  • How does the human vocal tract function as a resonance chamber?

    -The human vocal tract functions as a resonance chamber by producing echoes and resonances when air passes through it. The shape and dimensions of the vocal tract affect the nature of these resonances, which change the sound produced.

  • What is a spectrogram, and how does it represent voice information?

    -A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. It shows the frequency content of a voice signal, with time on the x-axis and frequency on the y-axis, color representing the energy in each frequency at a given time.

  • How does the voicing onset time differ between individuals and sounds?

    -Voicing onset time is the time it takes for the vocal folds to go from a state of rest to a state of motion. It differs between individuals due to the unique physical properties of their vocal tracts and also varies for the same person when producing different combinations of sounds.

  • What is an example of how voice analysis can reveal a person's environment?

    -Voice analysis can reveal a person's environment by identifying reverberation and smear in the voice signal caused by surrounding materials, such as glass, which can indicate the presence of a glass enclosure and its dimensions.

  • How can the structure of a person's skull and face be deduced from their voice?

    -The structure of a person's skull and face can be deduced from their voice because the skull and face are closely related to the vocal chambers. The contours of the face and the structure of the skull can affect the resonance and frequencies produced by the voice.

  • What was the outcome of the live demonstration of voice profiling technology in Tianjin in 2018?

    -In the live demonstration in Tianjin, about a thousand people tried out the technology, which included a virtual reality demo where participants' faces were recreated in 3D based on their voice, allowing them to pick up and examine their virtual faces.

  • What was the significance of recreating Rembrandt's voice from his facial portraits?

    -Recreating Rembrandt's voice from his facial portraits demonstrated the advanced capabilities of voice profiling technology, showing that it is possible to reverse-engineer a person's voice based on their physical features, even from historical figures.

  • What does the future hold for voice profiling technology according to the speaker?

    -The future of voice profiling technology holds the potential for machines to know individuals better, possibly even better than they know themselves, indicating a deeper integration of voice analysis in various aspects of life and technology.

Outlines

00:00

🕵️‍♂️ Profiling Humans from Voice Analysis

The speaker introduces the concept of profiling humans based on their voice, using a hypothetical scenario where an individual receives a threatening call and brings it to the police. The police, in turn, use voice profiling to deduce characteristics about the caller, such as age, physical attributes, and environment. The speaker demonstrates this by playing a recording of a bomb threat and then revealing specific details about the speaker, including his physical description, mental state, and the environment he is in. The talk emphasizes that humans naturally make judgments about others from their voices, and this process can be enhanced with artificial intelligence to extract more detailed information.

05:01

🎙️ Understanding Voice Production

This paragraph delves into the science behind voice production, comparing the human vocal tract to a resonance chamber that creates echoes and resonances. The speaker explains how the vocal cords vibrate to produce sound, which is then shaped by the movement of articulators such as the tongue, lips, and jaw. The resulting sound wave is a complex pattern of frequencies over time, visualized through a spectrogram. The speaker uses examples like a Ferrari's acceleration to illustrate the uniqueness of each person's voice, highlighting the differences in voicing onset time, which is a distinctive feature for each individual.

10:03

🎼 The Complexity of Voice in Time and Frequency

The speaker continues to explore the intricacies of voice, focusing on the micro features present in both time and frequency. Using a song and a spectrogram, the speaker shows the complexity of the human voice and attempts to recreate a snippet of it, emphasizing the difficulty of replicating the unique details. The paragraph also includes examples of voices from different individuals, including a singer with a distinctively sweet voice, to illustrate the vast differences in vocal characteristics. The speaker mentions the use of artificial intelligence to discover and extract these micro features, which are key to understanding the identity and uniqueness of a person's voice.

15:04

🏛️ Applications and Future of Voice Profiling

In this final paragraph, the speaker discusses the practical applications of voice profiling, such as recreating faces in 3D from voice recordings and even reviving the voice of historical figures like Rembrandt from his facial portraits. The speaker also touches on the potential future of this technology, suggesting that it could enable machines to understand individuals better than they understand themselves. The talk concludes with a mention of a forthcoming book by the speaker, which will elaborate on the technology of voice profiling in detail.

Mindmap

Keywords

💡Profiling

Profiling, in the context of the video, refers to the process of analyzing and making inferences about an individual based on specific characteristics or behaviors. It is central to the video's theme as it discusses how human voices can be analyzed to deduce personal information. The script uses the example of a threatening voice call to illustrate how profiling can be applied in real-world scenarios, such as identifying a speaker's demographic and environmental details.

💡Voice Analysis

Voice analysis is the examination of the human voice to extract information about the speaker. It is a key concept in the video, demonstrating how voice characteristics can reveal insights into a person's identity, emotional state, and even physical environment. The script mentions the use of artificial intelligence in voice analysis to predict and deduce information from the voice, such as the speaker's age, health, and background.

💡Artificial Intelligence (AI)

AI, as discussed in the video, is the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. It is integral to the process of voice profiling, where AI is used to analyze the voice's micro-features and make predictions about the speaker. The script highlights the potential of AI to perform this task more accurately than human hearing, which is not as precise.

💡Vocal Tract

The vocal tract is the part of the vocal system that includes the throat, mouth, and nasal passages, which are used to produce speech sounds. In the video, the vocal tract's role in voice production is explained, showing how changes in its shape affect the sound produced. The script uses the vocal tract's function to explain how different sounds are created and how these can be analyzed for profiling.

💡Resonance

Resonance, in the context of the video, refers to the phenomenon where sound waves are amplified within a particular space or structure. The script describes the vocal chamber as a resonance chamber, where the sound produced by the vocal cords is amplified and modified, creating unique echoes or resonances that can be analyzed for profiling the speaker.

💡Spectrogram

A spectrogram is a visual representation of the spectrum of frequencies of a sound signal as it varies with time. The video uses spectrograms to illustrate the frequency content of speech and how it can be used to analyze and identify unique characteristics of a voice. The script provides examples of spectrograms from different speakers, showing the distinct patterns that can be used in voice profiling.

💡Voicing Onset Time

Voicing onset time is the time it takes for the vocal folds to transition from a state of rest to a state of vibration when producing a sound. The video script explains this concept as a unique characteristic that varies between individuals and even within the same individual for different sounds, making it a valuable feature for voice profiling.

💡Micro-features

Micro-features are the subtle and detailed aspects of a voice that can be analyzed to extract information about the speaker. The video emphasizes the use of AI to discover and extract these micro-features, which are crucial for accurate voice profiling. The script mentions that these features can reveal a person's identity, even when other aspects of their appearance change.

💡Reverberation

Reverberation is the persistence of sound after the original sound source has ceased, caused by the sound waves reflecting off surfaces in an environment. The video script discusses how the environment can leave signatures on a voice, such as reverberation, which can be analyzed to determine the materials and dimensions of the space the speaker is in.

💡Physical Form Reconstruction

Physical form reconstruction, as mentioned in the video, is the process of deducing and reconstructing a person's physical appearance or stature from their voice. The script explains that the structure of the skull and face can be deduced from the voice, which can then be used to infer height, weight, and BMI, ultimately reconstructing the person's physical form.

💡Virtual Reality Demo

The virtual reality demo, as described in the script, is a technology demonstration where participants' faces were recreated in 3D based on their voice. This showcases the potential of voice profiling technology to not only analyze but also visualize a person's physical characteristics, emphasizing the advanced capabilities of voice analysis in profiling.

Highlights

Profiling humans from their voice is possible and can reveal personal characteristics and environment.

Voice is used in numerous crimes such as threats, harassment, extortion, and hoax calls.

Listeners can form opinions about a speaker's characteristics based on their voice.

Profiling involves detailed analysis of a person's voice to predict their physical and psychological attributes.

Voice carries information about age, health, background, personality, and environment.

Human vocal tract acts as a resonance chamber, affecting the nature of sound produced.

The process of voice production is complex and involves the coordination of various muscles and structures.

Voice analysis can reveal the size and materials of a room from the resonances in the voice signal.

Artificial intelligence is used to extract micro-features from the voice for profiling.

Different voice representations, such as spectrograms, can reveal unique characteristics of a speaker.

The Queen of England's voice has shown changes over 50 years, indicating the effects of aging.

Voice analysis can potentially trace the structure of a person's skull and skeleton from the sound produced.

Technology demonstrated in Tianjin allowed for 3D facial reconstruction from a person's voice.

Rembrandt's voice was recreated from his facial portraits using advanced voice profiling techniques.

The future of voice profiling technology may enable machines to know individuals better than they know themselves.

A forthcoming book by Springer will detail the technology of voice profiling in approximately 400 pages.

Transcripts

play00:09

everyone thank you for being here for my

play00:12

talk the subject of my talk today is

play00:16

profiling humans from their voice and I

play00:19

would like to introduce the subject to

play00:21

you through an example what if someone

play00:26

called you out of the blue and

play00:29

threatened to kill you and they did that

play00:31

repeatedly you've never heard the voice

play00:34

what would you do you would probably

play00:37

record the voice and take it to the

play00:39

police what would the police do that was

play00:48

a bomb threat there are hundreds of

play00:50

crimes that happen every day around the

play00:52

world through the medium of voice

play00:55

threats harassment extortion ransom

play00:58

calls hoax calls and a lot more what I'm

play01:02

going to do now is to play out a bomb

play01:04

threat to you I want you to listen to

play01:06

the voice very carefully and try to form

play01:11

an opinion about this speaker and then I

play01:15

will tell you a little bit about the

play01:16

speaker and let's match our notes here

play01:21

you kind of share fosters better off to

play01:22

knock off you hello sir can you hear me

play01:24

fine yes how can I help you

play01:26

all right listen to me very carefully

play01:27

okay all right listen very carefully and

play01:30

don't interrupt me I've got seven pipe

play01:33

bombs surrounding these beep are seven

play01:35

pipe bombs the blast radius will go off

play01:38

within a 500-meter radius telling

play01:41

everybody within it seven pipe bombs are

play01:44

located in on this close location and I

play01:46

see the point I'm not afraid of blow

play01:49

them up I have an ak-47

play01:51

if the bump don't detonate in one hour I

play01:54

want to run in with me ak-47 killing

play01:56

everybody in the arc

play01:57

do you understand

play02:00

I'm sure you've formed some opinion

play02:02

about the person what if I told you that

play02:06

this person is white he's Caucasian he's

play02:09

brought up in America probably in the

play02:11

northwest

play02:12

he's about 170 centimeters tall he

play02:16

weighs about 72 kilograms

play02:18

he's about 38 years old he's high on

play02:23

cocaine he's a heavy smoker he's in a

play02:27

small room the room has a wooden floor

play02:29

the room has gypsum walls there is a

play02:32

large glass window behind him he's using

play02:35

a laptop to make the call probably an

play02:37

IBM ThinkPad there's a ceiling fan in

play02:40

the room and so on what if I told you

play02:44

that this is what he looks like probably

play02:48

what if I went one step ahead and told

play02:52

you that this is what he looks like this

play02:55

is what I call profiling humans from

play02:58

their voice how can we do this it turns

play03:02

out we do this all the time we make

play03:05

judgments about other people from their

play03:07

voices all the time

play03:09

how many times have you met a friend or

play03:12

heard their voice over the phone and

play03:13

told that friend who sounds sad you

play03:16

sound depressed you sound happy and so

play03:19

on we make these judgments all the time

play03:21

here he sounded like a man who had slept

play03:24

well and then Oh too much money and we

play03:26

make these judgments so easily that we

play03:28

don't even realize it a remarkable ride

play03:32

only here in America I was born in

play03:34

there's a lineup of people up there did

play03:38

you get who did you did you understand

play03:41

or get who spoke might have spoken that

play03:44

yes you did

play03:46

right in that two seconds you judged the

play03:50

person's gender their age their

play03:53

ethnicity their state of mental health

play03:55

their state of physical health and we

play03:58

can do this because voice carries

play04:01

information we just don't realize how

play04:03

much information was carries it carries

play04:06

information about your age your your

play04:09

physiological parameters your physical

play04:11

stature your high

play04:12

wait your health your background your

play04:15

personality even about your environment

play04:16

if you're in a room there are signatures

play04:19

of the room in your voice and we'll see

play04:22

examples of that

play04:23

how big is a room what's the ceiling

play04:25

made of what's the floor made of and so

play04:27

on and so forth it's possible to extract

play04:29

all that and the science of profiling

play04:32

uses artificial intelligence to find

play04:35

this information in the voice and to

play04:38

make predictions from it except that we

play04:41

hope the machines will do it much better

play04:43

because human hearing is not all that

play04:46

good so we come to information now the

play04:52

voice carries information but where is

play04:54

that information in order to understand

play04:56

where that information is we need to

play04:59

understand a little bit about the voice

play05:01

production process the human vocal tract

play05:04

if the vocal chamber is a is a is a

play05:06

resonance chamber and if you imagine a

play05:10

building that looked like that and a

play05:12

little guy shouting into the building

play05:14

what do you think you would hear you

play05:18

would hear echoes you would hear

play05:19

resonances right if I change the shape

play05:22

of the that building or the dimensions

play05:25

what would happen the nature of those

play05:28

echoes and resonances would change so

play05:31

when we speak

play05:33

air comes out of the lungs and it goes

play05:35

through two vocal cords in our larynx

play05:38

here they're not really chords they're

play05:39

folds and they vibrate in response to

play05:42

the air creating a sound and this sound

play05:45

resonates in your vocal chambers we we

play05:49

produce different sounds as we speak by

play05:52

changing the dimensions or the shape of

play05:55

the vocal chambers by moving our

play05:57

articulate errs the our our tongue lip

play06:01

jaw and so forth and as we speak we

play06:04

produce thousands of frequencies we

play06:07

produce frequencies in the range of 50

play06:09

to 6800 Hertz what comes out as a result

play06:14

of this process is a pressure wave that

play06:17

looks like that the signal on the top in

play06:19

time

play06:20

and the picture below is it's frequency

play06:23

content on the y-axis you have frequency

play06:26

the thousands of frequencies that we

play06:28

produce on the x-axis you have time and

play06:31

the color at any pixel is the energy in

play06:35

that frequency at that time and these

play06:37

high-energy patterns that you see

play06:39

throughout the picture are the

play06:41

resonances of your vocal chambers so

play06:44

this is one representation of the speech

play06:47

signal where is the information in this

play06:50

one representation and there are many

play06:52

representations possible in this one

play06:55

representation the information is in

play06:57

time in time frequency and in frequency

play07:00

and we'll see some examples of that so

play07:03

I'll give you an example I'd start with

play07:05

an example of information in time and

play07:07

I'd like I like to start with this

play07:09

example so this is the Ferrari and it

play07:13

goes from 0 to 60 miles per hour in 5

play07:18

seconds flat now at one time it was

play07:21

touted as a car that was too fast to

play07:23

race but then other cars came along and

play07:26

these these these high-end cars go from

play07:30

from zero to 60 miles per hour in other

play07:33

times 2.1 seconds to 0.2 seconds and so

play07:36

on and so forth why are these times

play07:39

different for these different cars well

play07:43

the answer is simple these are complex

play07:44

machines each one is designed

play07:45

differently it turns out our vocal

play07:49

production process is even more complex

play07:53

here is an example of let me play this 3

play08:00

3 3

play08:01

it's a nine-year-old boy saying 3 3 3

play08:04

when we say a word like 3 the first

play08:08

sound is we produce the sound by

play08:11

creating an obstruction in our vocal

play08:13

tract building up air pressure behind it

play08:16

and releasing it suddenly the vocal

play08:18

folds are not vibrating the very next

play08:21

sound is Erb and that sound requires

play08:24

your vocal folds to vibrate at full

play08:26

potential your vocal tract has muscles

play08:29

they have a certain inertia everyone's

play08:32

vocal track has

play08:33

inertia and in going for the vocal folds

play08:36

to go from a state of complete rest to a

play08:39

state of complete motion takes a small a

play08:42

very small amount of time and that time

play08:45

is called the voicing onset time it is

play08:48

different for different people it is not

play08:50

only different for different people it

play08:52

is different for the same person for

play08:54

different combinations of sounds that

play08:57

makes this characteristic very unique

play08:59

this is a characteristic in time so

play09:03

let's go on ahead and see some of the

play09:06

examples of information in frequency and

play09:08

time frequency I'm going to play out

play09:10

this very beautiful song to you

play09:13

[Music]

play09:29

Oh

play09:35

beautiful let's look at the spectrogram

play09:39

how complex is this how complex is this

play09:43

if I wanted to artificially reproduce it

play09:47

I would not be able to do it and I'll

play09:50

show you some of my best efforts I take

play09:53

a small snippet of this and by the way

play09:55

you all these fine details that you see

play09:59

are in time and frequency so this is

play10:02

information in time and frequency in

play10:04

this one representation so here are my

play10:07

attempts to recreate a small sliver of

play10:09

this sound I start with this

play10:12

mark knows getting close this is a real

play10:22

one this is the closest I could get

play10:27

mathematically and this is just a small

play10:31

snippet of the sound produced by this

play10:34

person every person's voice is unique

play10:38

every person's voice is unique let's

play10:41

look at another example

play10:42

this is a singer from India at one time

play10:45

she

play10:46

I think held the record for having the

play10:49

sweetest voice in the world

play10:50

and this is her voice

play10:55

[Music]

play11:10

Media

play11:15

[Music]

play11:25

if you compare the two voices

play11:28

side-by-side nothing would match

play11:31

nothing would match there there there's

play11:33

such detail in these voices and

play11:36

incidentally what she is singing and

play11:38

this with this song was recorded in 1977

play11:41

she's saying my name will be lost this

play11:42

face will change my voice is my only

play11:45

identity the information in voice is in

play11:51

the micro features and we use artificial

play11:53

intelligence to discover and extract

play11:55

these micro features I was talking about

play11:58

representations so I'll show you

play12:00

information and a couple of other

play12:02

representations very interesting these

play12:05

this is the Queen of England and these

play12:09

are examples of her voice 50 years apart

play12:12

when she was very young at age 35 years

play12:15

ago my grandfather broadcast the first

play12:18

of these Christmas messages one of the

play12:21

features of growing old is it-- and

play12:23

awareness of change can you hear the

play12:26

difference so this representation is

play12:28

called the constant hue spectrogram and

play12:31

the arrows point to the same word spoken

play12:34

50 years apart and you can see the

play12:37

differences the pitch is shifted the

play12:38

harmonics are smeared and there are

play12:40

other differences here's yet another

play12:43

representation these are pitch pulses of

play12:46

three different people saying the word

play12:49

and do you see the difference in pattern

play12:53

between the ones on the sides the people

play12:57

on the sides and the one in the center

play12:58

the people in the person in the center

play13:01

has Parkinson's so and the the markers

play13:05

show up on this representation that's

play13:16

Hitler in 1935 addressing the Nazi Party

play13:19

and what do you see in his voice in the

play13:22

same representation people suspected

play13:24

from video evidence that he had

play13:25

Parkinson's this shows that he might

play13:29

have had Parkinson's indeed

play13:31

your environment leaves signatures on

play13:34

your voice this is an example if you're

play13:36

surrounded by glass

play13:38

it causes reverberation and causes a

play13:40

smear in the voice signal you can see

play13:43

this mirror on the spectrogram and by

play13:45

backtracking from that smear you can

play13:48

actually trace the materials of the

play13:51

enclosure around the person and also the

play13:55

dimensions of the person the room so and

play13:58

a lot more as possible using voice

play14:01

analysis your skull is closely related

play14:07

to your face your skull is also very

play14:13

closely related to your vocal chambers

play14:16

so is your face it is therefore possible

play14:20

to trace the contours of your face from

play14:25

the voice it is possible to deduce the

play14:31

structure of your skull from voice your

play14:34

skull is connected to your skeleton it

play14:36

is therefore possible to deduce the

play14:40

structure of your skeleton from the

play14:41

voice it's possible to get your height

play14:43

and weight I can get your BMI I can fill

play14:47

in your skeletal structure it is then

play14:49

possible to reconstruct your stature or

play14:52

your physical form from voice and one

play14:57

day we hope we're going to be perfect at

play14:59

doing that or near-perfect at doing that

play15:01

where are we today

play15:03

last year in 2018 and September in

play15:07

Tianjin this technology was demonstrated

play15:09

live about a thousand people tried it

play15:11

out and one part of the technology was a

play15:15

virtual reality demo where people were

play15:19

wearing a headset and saying something

play15:21

and their face was recreated in 3d you

play15:23

could pick it up and examine it this

play15:27

year in February we reversed the

play15:29

technology and we recreated Rembrandt's

play15:32

voice from his facial portraits this was

play15:37

done in collaboration with JWT Walter

play15:41

Thompson in Rijksmuseum and ing in

play15:44

lenz what's the future of this

play15:46

technology your voice will help machines

play15:50

know you better perhaps better than even

play15:54

you can know yourself this is a book

play15:59

that have just finished writing it's

play16:01

going to be published very soon by

play16:04

Springer it spells out the technology in

play16:06

about 400 pages and hopefully it'll be

play16:11

out very soon in a couple of months

play16:13

thank you very much

play16:18

you

Rate This

5.0 / 5 (0 votes)

Связанные теги
Voice ProfilingHuman BehaviorAI AnalysisSpeech RecognitionPersonal TraitsCriminal InvestigationAudio AnalysisSecurity MeasuresCommunication ScienceTechnology Innovation
Вам нужно краткое изложение на английском?