This Advanced Kind Of AI Could Be The Secret To AI Assistants

Forbes
28 May 202429:31

Summary

TLDRこのパネルディスカッションでは、多模態AIの未来について熱く語られました。多模態AIは、画像、ビデオ、テキストを処理できる機械学習モデルであり、人間の生活を高める可能性があるとされています。参加者は、多模様な産業でのAIの活用、ビデオデータの理解強化、そして倫理的・技術的な課題について討議しました。特に、ビデオの意味を理解し、人々の感情を捉える技術の進歩に注目しました。彼らは、AIが信頼を築く上で重要な役割を果たすことを強調し、垂直化とデータの豊富さ、そしてテクノロジーと人間の間の対話の信頼性について語りました。

Takeaways

  • 🧠 マルチモーダルAIは、画像、ビデオ、テキストを処理できる機械学習モデルであり、さまざまな情報の形式を理解することが可能です。
  • 👥 参加者はそれぞれの分野でマルチモーダルAIを活用する企業を立ち上げ、その技術を応用して革新的なサービスを提供しています。
  • 🔍 Avoka AIは、音声AIを利用して最も先進的なレセプションを構築しており、特に家事サービス業に焦点を当てています。
  • 🎥 12 Appsはビデオ理解のためのマルチモーダルAIを開発しており、ビデオ内の視覚要素、音声、テキストを解釈し、包括的なビデオ表現を提供しています。
  • 📈 Lexi Millsはデジタル倫理とデジタルフォレンジクスに焦点を当てた財団を紹介し、AIツールを使って虐待事件のデータを掘り下げ、検証し、訴訟を進めています。
  • 🛠️ マルチモーダルAIの分野では、技術的な課題だけでなく、倫理的な課題も存在しており、それらをどう克服するかが重要な議論となっています。
  • 📱 ビデオや音声を活用したマルチモーダルAIは、顧客の感情や非言語的なコミュニケーションを理解し、より効果的なコミュニケーションを可能にします。
  • 🌐 今後のマルチモーダルAIの分野で成功する企業は、特定の業界に深く根差ししてデータを収集し、信頼関係を築くことができる企業になるでしょう。
  • 🔑 信頼性はAIの成功の鍵であり、企業は技術の信頼性を高めるだけでなく、人間とAIとの間の対話を信頼性のあるものにすることも重要です。
  • 🚀 マルチモーダルAIは多くの分野に適用されており、ビデオ編集、スポーツ、医療、学習など、さまざまな垂直市場においてその価値が認められています。
  • 🌟 今後のマルチモーダルAIの分野では、より大きなコンテキストモデルが登場し、長い会話の全体を保持し、より自然な対話を実現するでしょう。

Q & A

  • マルチモーダルAIとはどのような技術ですか?

    -マルチモーダルAIは、画像、ビデオ、テキストなど、さまざまな情報の形式を処理できる機械学習モデルです。これにより、より豊かな情報の統合や理解が可能になります。

  • avoka AIはどのような企業ですか?

    -avoka AIは、MITの出身者による共同創業者による企業で、音声AIを利用して最も先進のレセプションを構築しています。特に、電気工、水道工、エアコンなどの古い業界に焦点を当てています。

  • マルチモーダルAIが持つ最も魅力的な能力は何ですか?

    -感情的な側面を組み込む能力が最も魅力的です。販売や説得に必要な非言語的な要素を理解することも重要です。

  • 12 Appsはどのような企業ですか?

    -12 Appsは、ビデオを理解するためのマルチモーダルAIを開発している企業です。ビデオ内の視覚要素、音声要素、テキスト要素を解釈し、ビデオの包括的な表現を作成します。

  • ビデオ理解における大きな課題は何ですか?

    -ビデオの時間的次元、物体の動き、視覚と音声、テキストの整合性など、ビデオデータの取り扱いが非常に難しい点です。

  • デジタル倫理についてLexi Millsさんはどのように考えていますか?

    -Lexi Millsさんはデジタル倫理に焦点を当てた財団活動を行い、AIツールを用いて虐待事件のデータを掘り下げ、証拠を集めることにより、以前は曖昧とされていた事件を検証しています。

  • avoka AIではどのようにして販売を改善していますか?

    -avoka AIでは、AIを活用して販売における感情的な側面や人間のnatureを理解し、顧客とのコミュニケーションを改善しています。

  • 12 Labsのビデオ理解技術はどのように使われていますか?

    -12 Labsの技術は、例えばYouTubeのビデオから特定の瞬間を見つけるための検索や、ビデオ編集、新しいTV番組や映画のトレーラーの作成、セキュリティー監視など様々な場面で使われています。

  • 感情分析は過去数年でどのように進化しましたか?

    -感情分析はテキスト分析の精度が向上した一方で、より重要なのは音声そのものに基づくモデルの開発です。エネルギーの高低などから感情的な状態を推測することができるようになりました。

  • AIが人間の感情を理解することの重要性とは?

    -AIが人間の感情を理解することは、特に販売や顧客サービスにおいて非常に重要です。顧客の関心や要望を読み取り、適切な対話の進行形を決定するのに役立ちます。

  • マルチモーダルAIが広く採用されるにつれて、企業が成功するためにはどのような要素が必要ですか?

    -企業が成功するためには、特定の業界への深い垂直化、高品質のデータの蓄積、信頼性の高め、そしてテクノロジーが人間の能力を補助するというポジショニングが重要です。

Outlines

00:00

🤖 マルチモーダルAIの定義と展望

この段落ではマルチモーダルAIの基本的な定義と、その適用分野に関する議論が行われています。マルチモーダルAIは、画像、ビデオ、テキストを処理できる機械学習モデルであり、様々な形のモダリティを扱うことができます。GPD Visionの例が挙げられており、材料の写真を入力するとレシピを作成するといった応用が説明されています。また、参加者が自己紹介を行い、マルチモーダルAIの最も興味深い能力について話している部分も含まれています。

05:02

🛠️ マルチモーダルAIの応用と倫理的課題

第二段落では、マルチモーダルAIの応用例がさらに詳述されています。ビデオデータを理解するAIの開発、ビデオ内の視覚要素や音声、テキストを解釈し、包括的なビデオの表現を得る試みが語されています。さらに、ビデオデータの扱いに関する技術的課題や倫理的問題も提起されており、ビデオ分析を通じて得られるデータの多様な用途について触れられています。

10:02

🗣️ コミュニケーションにおけるマルチモーダルAIの役割

第三段落では、コミュニケーションにおけるマルチモーダルAIの役割が議論されています。テキスト分析の進歩だけでなく、人間の声のエネルギーなど、テキスト以外の要素に焦点を当てたモデルの重要性が強調されています。また、AIが感情的な側面を理解し、販売や説得に役立つ能力を持つことの重要性についても言及されています。

15:02

🔍 マルチモーダルAIの倫理的・法的な問題

第四段落では、マルチモーダルAIの倫理的および法的な問題が探求されています。ニューヨーク・タイムズが記事のテキストをAIのトレーニングデータとして使用したことに関する問題、バイアスの問題、著作権の問題などが提起されています。倫理的懸念とそれに対する見解が議論されており、テクノロジーの進歩とそれに伴う課題について深く掘り下げられています。

20:04

🚀 マルチモーダルAIの技術的進歩と課題

第五段落では、マルチモーダルAIの技術的進歩と現在の課題が語されています。会話の最初の20〜30秒でAIが人間を超える能力を発揮する一方、会話が長引くにつれて人間との対話の自然さを維持するのが難しいと指摘されています。また、AIへの不信感の問題も触れられており、技術的進歩とそれに伴うユーザーの期待と不信感の間にあるギャップについても議論されています。

25:05

🌟 マルチモーダルAIの将来性と成功の鍵

最後の段落では、マルチモーダルAIがより広く採用されるにつれて、どの企業が成功するかについて話されています。信頼性、技術の信頼性、顧客との関係の構築、特定の業界への深い参入、高品質のデータの重要性などが成功の鍵として挙げられています。また、参加者は自分たちの企業がどのようにその成功の鍵に沿って発展しているかについて語っています。

Mindmap

Keywords

💡マルチモーダルAI

マルチモーダルAIとは、画像、ビデオ、テキストなどの複数の情報源を処理し、それらを組み合わせて理解する人工知能のモデルです。このビデオでは、マルチモーダルAIが人間の生活をどのように補完し、様々な業界でどのように活用されるかが議論されています。例えば、GPD Visionというサービスが挙げられており、食材の写真を提供することでレシピを作成するといった応用例が紹介されています。

💡感情分析

感情分析とは、人工知能を用いてテキストや音声などから人間の感情を解釈する技術です。ビデオでは、感情分析が販売や顧客サービスにおいて重要な役割を果たしていると語られており、AIが非言語的な情報や話者の感情を理解することの重要性が強調されています。

💡ビデオ理解

ビデオ理解は、ビデオ内のビジュアル要素、音声、テキストを分析し、ビデオの内容全体を理解するプロセスです。ビデオでは、12 Appsという企業がビデオ理解のためのマルチモーダルAIを開発していると紹介されており、ビデオデータから意味を引き出すことで、さまざまな業界に適用が可能になるという利点があります。

💡デジタル倫理

デジタル倫理とは、デジタル技術を利用する上で倫理的原則を守ることを意味します。ビデオでは、デジタル倫理とデジタルフォレンジクスを専門とする財団が、AIツールを用いて虐待事件のデータを分析し、証拠を集めることで、以前は訴追が困難だった事件を捜査する上で貢献していると語られています。

💡ビデオトランスフォーマー

ビデオトランスフォーマーとは、ビデオデータを理解し、意味のある情報を引き出すための人工知能モデルです。ビデオでは、12 Appsがビデオトランスフォーマーを開発し、ビデオ内の動きや時間的な変化を捉え、より包括的なビデオの表現を提供する試みがなされていると紹介されています。

💡非言語的なコミュニケーション

非言語的なコミュニケーションとは、言葉以外の仕草、表情、声の調子などから成るコミュニケーションです。ビデオでは、AIが非言語的なコミュニケーションを理解し、顧客の感情や意図を把握することが重要であると強調されています。

💡テレフォニーAI

テレフォニーAIとは、電話通信において人工知能を活用したサービスです。ビデオでは、Avokaという企業が声認識AIを利用して家政サービス業界のための先進的なレセプションストを開発していると紹介されています。

💡コンテキストモデル

コンテキストモデルとは、会話やテキストの前後文を理解し、それに応じて適切な応答を行う人工知能の機能です。ビデオでは、コンテキストモデルが会話の中で重要な役割を果たし、会話の継続性と自然さを確保する上で不可欠であると語られています。

💡デジタルマーケティング

デジタルマーケティングとは、インターネットを利用して商品やサービスを宣伝することです。ビデオでは、Lexi Millsがテクニカル検索知識と心理学を組み合わせて、人間の行動に影響を与えるデジタルマーケティング戦略を策定していると紹介されています。

💡倫理的課題

倫理的課題とは、技術開発やビジネス活動において倫理的に問題となる点に関する議論のことです。ビデオでは、AIが人間の感情や意図を分析し、それに応じた広告を表示するなどの倫理的な問題が提起されており、それに対して様々な意見が交わされていると語られています。

Highlights

Multimodal AI is defined as a machine learning model capable of processing images, videos, and text.

An example of multimodal AI is GPD Vision, which creates recipes from pictures of ingredients.

Avoka AI leverages voice AI to build advanced receptionists for antiquated industries like home services.

The emotional aspect of communication is crucial in sales and customer service, and AI is working to understand nonverbal cues.

Depex is building multimodal AI for video understanding, interpreting visual, speech, and text elements in videos.

Video understanding is challenging due to the temporal dimension and consistency between visual and speech elements.

Use cases for multimodal AI in video understanding include sports, media, entertainment, e-learning, security, and healthcare.

Lexi Mills discusses using AI tools for digital forensics to prosecute cases that were previously hard to evidence.

Aorv, co-founder of Avoka, emphasizes the potential of AI in improving sales by understanding human nature and emotions.

12 Labs helps identify specific moments in videos through semantic search on video embeddings.

Transformer models are changing video understanding in a way similar to how they transformed natural language processing.

AI is being trained to understand the energy and emotional tone of voice calls for better customer service.

Sentiment analysis has improved significantly, with models now focusing on the sound of human voice rather than just text.

Hum.ai is working on giving AI emotional intelligence through a set of modalities.

Lexi discusses the ethical implications of analyzing individuals at a detailed emotional level and the importance of trust in AI.

James from 12 Labs talks about the importance of having access to high-quality labeled video data for training AI models.

The panel agrees that trust, verticalization, and rich data are key to differentiating successful multimodal AI companies.

Transcripts

play00:01

welcome welcome

play00:03

welcome let's let's get started uh this

play00:06

is the multimodal AI Revolution panel uh

play00:09

we have a very exciting conversation for

play00:11

you here today uh we'll be discussing

play00:14

what's coming up next for multimodal AI

play00:18

uh a quick definition for multimodal AI

play00:21

uh is it's

play00:27

a I think that worked actually all right

play00:31

um so in this conversation we'll be

play00:33

covering what multimod modal AI is

play00:36

whether it can augment human lives how

play00:38

variables can leverage it um and we'll

play00:41

also discuss some ethical and Technical

play00:43

challenges that surround the space uh a

play00:46

quick definition for multimodal AI is

play00:49

it's a machine learning model that's

play00:50

capable of processing uh images videos

play00:54

and text and it can do other forms of

play00:56

modality as well uh an example of this

play00:59

currently is is using GPD Vision where

play01:01

you can give it a picture of the

play01:03

ingredients you have access to and it

play01:05

can create a recipe for you um so let's

play01:08

get started let's do a quick round of

play01:10

introductions uh talk about you know

play01:12

introduce yourself what are you working

play01:13

on and what is one ability of multimodal

play01:17

AI that you find most exciting uh let's

play01:19

start with you yeah absolutely uh well

play01:22

uh yeah my name is Tyson uh I'm one of

play01:24

the co-founders of avoka AI uh it was

play01:28

actually um a company that uh you know

play01:31

my co-founder porv and I started a bit

play01:33

over two years ago um we were students

play01:36

uh at MIT uh over uh s eight years ago

play01:39

um class of 2017 and actually did a lot

play01:42

of research at the media lab so uh great

play01:45

great to be back today uh avoka

play01:47

leverages voice AI to build the world's

play01:50

most advanced receptionist for um uh a

play01:54

lot of uh antiquate Industries including

play01:57

home services so these are electricians

play01:59

plumbers HVAC the people that you

play02:03

probably think would be the last people

play02:04

to be using utilizing AI uh but uh yeah

play02:08

that that that's kind of what we're

play02:10

working on in a nutshell and uh in terms

play02:12

of multimodal I think um the area that

play02:14

I'm most excited about is the ability to

play02:18

uh actually incorporate um uh the not

play02:23

not just uh text but the emotional

play02:25

aspect and understand because in the

play02:29

world of uh beac and plumbing it's not

play02:31

just customer service you're dealing

play02:33

with you actually need to make a sale

play02:36

and in order to make a sale and be

play02:38

convincing um we need AI that cannot

play02:41

just understand what people are saying

play02:43

but the nonverbal stuff uh and the stuff

play02:46

around what they actually mean uh

play02:48

because that's actually u a lot of where

play02:51

where the interesting parts

play02:53

are James everyone uh I'm James uh

play02:57

currently running uh depex developer

play03:00

relations at 12 apps and our company

play03:04

building multimotor AI for video

play03:06

understanding um so you know back to

play03:08

aush definition of multimodal right just

play03:11

like how you know like the say baby

play03:14

trying to acquire knowledge they they

play03:16

read the text they hear SS feel the

play03:19

emotion smell the order right all this

play03:21

different you know sense and modalities

play03:23

coming in um you know we're trying to

play03:25

build the T of Parish models start doing

play03:27

the same interpreting you know the

play03:29

visual element uh the speech element as

play03:32

well as the text element inside the

play03:34

video and then you know uh come up with

play03:36

a comprehensive you know representation

play03:38

of that video um and you know if you

play03:41

think about the T of L down on the world

play03:43

I'd say like more than 80% uh is UN

play03:45

structure data more more than 80% is

play03:47

actually video data and unlike text and

play03:49

image video is very challenging um thing

play03:52

to tackle because of the you know the

play03:54

temporal Dimension how things move our

play03:56

time um the consistency between you know

play03:58

Visual and and and and speech and text

play04:01

so we're trying to build the tip of AI

play04:03

that can tackle this challenging

play04:04

thetical problem um in terms of use

play04:07

cases I think um we have a lot of you

play04:09

know different verticles ranging from

play04:12

Sports to Meda and entertainment to

play04:15

e-learning and even like security

play04:17

surveillance Healthcare um you know

play04:19

people building video search to to find

play04:22

interesting moment in like a football

play04:24

game or baseball game uh they they use

play04:26

to to like quickly edit video to make

play04:29

new TV show or or movies trailer they

play04:32

even use our to to like you know fight

play04:34

uh weapons um violence on body count

play04:37

footage of of uh you know the police so

play04:40

I think you know any industry that

play04:42

require a lot of You Know video data can

play04:45

can benefit from you know the T

play04:47

understanding that

play04:48

kuu hi uh can you hear me is it working

play04:52

my name is Lexi Mills I'm a digital

play04:54

communication specialist we focus on

play04:57

emerging technology so anything where

play05:00

there isn't a word or people aren't

play05:01

searching for a word it's our job to

play05:04

help people use the word understand it

play05:06

um on the other side of what we do we

play05:09

have a foundation that looks a lot at

play05:11

digital ethics and more recently in

play05:13

digital forensics so after the last 3

play05:16

four years we've been using our skills

play05:18

to in an inverse way to mine data and

play05:22

information for different types of abuse

play05:25

cases which are typically quite hard to

play05:27

prosecute whereas now we get huge

play05:30

amounts of data using free off-the-shelf

play05:33

AI tools to be able to prosecute cases

play05:36

that previously would have just slipped

play05:38

under the radar due to lack of

play05:42

evidence hey everyone my name is aorv um

play05:44

I'm the other co-founder of aoka that

play05:46

Tyson previously mentioned um so yeah

play05:48

just to recap we're like a receptionist

play05:50

for these Home Service businesses I

play05:52

think Tyson covered most of it on I

play05:53

think what's exciting and what we're

play05:55

actually working on I would say the big

play05:57

thing to maybe EMP emphasize around

play05:59

where AI is headed now it's kind of like

play06:01

we've always seen a lot of these

play06:03

customer support startups working in AI

play06:05

infiltrating so many different companies

play06:07

while we are working on avoka and I

play06:08

think what we're starting to see happen

play06:09

more is how that can infiltrate sales

play06:12

sales requires a lot more emotions a lot

play06:14

it requires a lot more understanding of

play06:16

the human nature and I I don't think AI

play06:18

can do all of sales but it can

play06:20

significantly improve it and so yeah

play06:22

that's essentially what we're working on

play06:23

it's quite exciting very exciting uh

play06:26

James so you work for a company called

play06:28

12 labs uh it basically helps understand

play06:31

video so right now if I go to YouTube

play06:33

and I do a search it's semantic and I

play06:35

try to find the exact transcript or find

play06:37

the keyword but what your company does

play06:40

is it finds Snippets from the video so I

play06:42

can just ask and be like where was that

play06:44

robot uh which which part of the movie

play06:47

did that robot come in and 12 Labs will

play06:49

help me help me identify it is that

play06:51

correct yeah that that's correct um I

play06:53

think how hypothesis is like video

play06:56

understanding has not evolved a lot over

play06:58

the past decade like the way way um

play07:01

research tle that is they VI specific

play07:03

Compu Vision optimized for a very

play07:06

specific task like you know Keo

play07:08

estimation object detection semetic

play07:10

segmentation Etc they generate like

play07:12

metad datas or Tex from the video and

play07:15

then when they they's say perform search

play07:16

they actually do keyword search or meta

play07:18

data search based on that text or

play07:21

transcript but like you know um that

play07:23

cannot capture the visual element of the

play07:25

video and also maybe can put totally

play07:27

disconnected from what's Happening and

play07:29

so with the rise of like Transformer and

play07:31

the versatility of like multimotor data

play07:34

um you know we can create basically

play07:37

embeddings from this video which is like

play07:40

a vector representation of the video

play07:42

content and when you perform search you

play07:43

actually do semantic search on that

play07:45

video embedding space and that the

play07:47

resale is much more holistic and um and

play07:49

native to the way like uh you know

play07:51

models learn so I think uh that's I

play07:54

think the the future like just like how

play07:56

you know uh Transformer transform NLP

play07:59

we're seeing the same thing happening uh

play08:01

with video Transformer transforming your

play08:03

video understanding awesome and Tyson so

play08:06

this ties to the work you guys are doing

play08:08

where you're trying to identify

play08:09

non-verbal communication I had a stat

play08:12

somewhere that said 80% of communication

play08:14

is non-verbal so the way I'm moving my

play08:16

hands the way you're looking your your

play08:19

facial features changing Etc is this the

play08:22

VD are you capturing the video footage

play08:24

as well cuz I know right now you're

play08:25

doing just voice calls but at at some

play08:27

point you plan to capture video footage

play08:29

as well use something like 12 labbs uh

play08:31

to get that visual context for emotional

play08:34

intelligence yeah so right now we're

play08:36

primarily um or almost exclusively uh

play08:40

working uh purely in the voice realm

play08:42

because um remember like most of our

play08:44

customers are um you know antiquate

play08:47

industry folks Like Home Services and

play08:50

they they unfortunately don't have the

play08:51

luxury of getting their uh customers to

play08:54

call in on zoom and so that it's all

play08:56

purely uh phone

play08:58

communication um but but even within

play09:00

phone communication there's so much that

play09:02

is not captured just simply from

play09:04

transcribing that and analyzing the

play09:07

words you know there's there's the

play09:09

tonality um whether the the customer is

play09:11

uh angry upset one thing that we're um

play09:14

really keen on is um kind of

play09:16

understanding you know at the beginning

play09:18

of the call measuring the customer

play09:20

sentiment and then seeing what the

play09:22

customer sentiment is at the end of the

play09:24

call and seeing what that Delta is and

play09:27

that's a good metric for us to determine

play09:29

whether or not we did a good job with

play09:31

improving the customers day and uh you

play09:34

know talking to them so this is for you

play09:37

aor and for you Tyson as well um we've

play09:39

had sentimental analysis for a while

play09:42

right uh do you think the models now

play09:44

have just made it 10 times better what

play09:47

what is the difference you're seeing

play09:48

with What's Happening Now versus you

play09:50

know the sentiment analysis and natural

play09:51

language stuff we had in 2015 yeah I

play09:54

think maybe I can add to that I think

play09:55

there's two things I think one I think

play09:57

the analysis of text has definitely 10x

play10:00

um in terms of our ability to do that in

play10:02

several years but I think the more

play10:03

important and bigger thing that's going

play10:04

to be emerging is actually models that

play10:06

do not even look at text they're focused

play10:08

more on the sound that the human is

play10:11

making on the other side and so I think

play10:12

there's another company that's actually

play10:13

exclusively focused on this like hum. um

play10:16

want to check out but essentially what

play10:18

you can do is you can actually train

play10:19

models now that go off of a base layer

play10:21

where they can go and actually hear what

play10:23

the person is saying and be like okay is

play10:25

this more likely to be high energy or is

play10:27

this more likely to be low energy and I

play10:28

think understanding that gives us like

play10:30

the next wave of unlocking voice

play10:31

applications yeah this is uh it's very

play10:33

interesting cuz hum and I we had a we

play10:36

had a conversation after a hack of one

play10:37

cuz I was trying to build this thing

play10:39

which would give Chad GPD emotional

play10:40

intelligence and I'm I'm using Chad GPD

play10:43

to to teach me stuff but and I I teach

play10:46

myself so I can when I'm talking to a

play10:47

student I can tell this person's losing

play10:50

interest or this is too hard for them

play10:51

this is too easy uh and I can change the

play10:54

content I'm I'm sharing with them

play10:56

similarly I was trying to use my my

play10:58

facial features and like a modality

play11:00

which I can give to J GPD to a GPD

play11:03

plugin uh and that's when I came across

play11:05

human and the work they're doing is

play11:06

around basically giving emotional

play11:08

intelligence which is a like a whole set

play11:11

of modalities to uh to AI do you think

play11:14

that has would you be able to integrate

play11:16

that into into work in your into your uh

play11:19

startup and what are the implications of

play11:21

that going to be like for you yeah I

play11:23

think I mean tremendous I think you know

play11:24

one of the things that you know even um

play11:27

you know obviously the with a with a

play11:30

voice agent um I think one of the most

play11:33

common problems is that the voice agent

play11:36

is is not able to actually understand

play11:38

how the End customer is feeling and so

play11:41

when it when it comes to you know

play11:43

elevating the the level to to actually

play11:45

sales when someone has you know 10

play11:47

options for who they want to install a

play11:49

new HVAC unit the the the the the cues

play11:52

around you know are are are they

play11:54

actually interested in buying do they do

play11:56

they want to hear about all the upgrades

play11:59

or do are they just someone that just

play12:00

wants to get the cheapest option be able

play12:03

to decipher that and then navigate the

play12:05

conversation from there it is extremely

play12:08

um important that is very interesting

play12:10

and Lexi in your previous jobs you've

play12:11

worked as a as a head of communications

play12:13

and on your LinkedIn I saw this in your

play12:15

bio and I thought this was incredible it

play12:17

said Lexi combines technical search

play12:20

knowledge with psychology to create

play12:22

datadriven measurable communication

play12:24

strategies that maximize influence on

play12:26

human behavior you think tactics like

play12:28

this where we're using more than what we

play12:31

more than what we naturally know and

play12:33

we're augmenting our life through AI is

play12:35

going to have significant impact on on

play12:37

human communication yes definitely you

play12:40

know there's almost no AI that isn't

play12:41

somewhat trained on internet data and

play12:44

the thing is Google's objective is

play12:47

primarily to give us what we want and as

play12:50

fast as possible but what we want and

play12:52

what we need can often be very different

play12:55

and so I do a lot of work in debt

play12:57

management I bu Bonkers about how we

play13:00

communicate around debt and when someone

play13:03

types in get rid of debt they'll get

play13:06

different search results compared to

play13:07

someone who uses good language or good

play13:09

grammar but that is admitting what level

play13:12

of fear are they in at that point in

play13:13

time and we could be regulating

play13:16

Advertising based on Son's emotional

play13:18

state so they're making emotionally

play13:20

Intelligent Decisions or not emotionally

play13:23

deficient and if we take it a little bit

play13:26

further if you think about something

play13:28

like um lung cancer survival

play13:31

statistics you're either researching

play13:33

that because you're a researcher or

play13:36

you're most likely researching that

play13:37

because you know someone with it now

play13:40

getting the statistics isn't super

play13:42

helpful unless you have the context you

play13:44

know there are several tests you need to

play13:46

interpret that data getting the

play13:48

information fast actually isn't even

play13:50

giving you accurate information because

play13:52

it's not giving you the context to

play13:54

digest it and knowing who's online that

play13:57

you could speak to that coming up first

play13:59

giving you a warning actually this

play14:01

information won't be helpful unless you

play14:03

understand x and y means that people's

play14:05

entire search Journey becomes more

play14:08

intelligent and then we're going to be

play14:09

looking at how we optimize for that

play14:11

thereafter because the structure of the

play14:13

internet feeds directly in to what we

play14:16

see optimizing in certain llms it's

play14:19

interesting do you from a from an ethic

play14:21

standpoint do you think it is right to

play14:25

be analyzing someone in this much level

play14:27

of detail to where I'm getting you know

play14:29

micro changes in their facial features

play14:32

and I'm able to decipher what they might

play14:34

be thinking deep down you think that's

play14:37

ethical I think there are ethical

play14:39

challenges to it but I think it's also

play14:41

unethical to not be doing so right right

play14:43

now a lot of the adte is coded to take

play14:46

advantage of emotional states that we

play14:48

understand through language time of

play14:50

search as well and so by not doing it we

play14:53

have ethical concerns it's just that

play14:55

we're already in that flow so we're not

play14:57

questioning it we tend to question new

play14:59

problems new challenges new technology

play15:02

but actually a lot of the challenges we

play15:04

see with new technology have existed

play15:07

previously that's interesting um one of

play15:11

the things I've been I I I'm guessing

play15:13

you've been following the New York Times

play15:14

suang uh open AI uh for using the text

play15:18

they've generated their articles to

play15:19

train their training data um you

play15:22

anticipate issues with you know Marvel

play15:24

Studios coming or Universal coming and

play15:27

going to Sora open AI and being like

play15:29

listen this is you use our training data

play15:32

to generate these videos you think that

play15:34

could be be an issue starting with you I

play15:37

think there are going to be issues I

play15:38

mean the New York Times has quite an

play15:40

issue with bias over the years and so I

play15:43

think there are what we're seeing people

play15:44

conveying as fears and what are the

play15:46

underlying fears um if you there's a

play15:48

great book called The gy Lady winked

play15:50

which is about the historical bias

play15:52

across the New York Times and we've seen

play15:54

it in all news search is bias so

play15:59

the content it's drawing for from is

play16:01

also biased then you've got double bias

play16:03

then you've got double bias scaled and

play16:05

then you've got copyright issues

play16:07

thereafter um from a business standpoint

play16:11

if you want to be really prude and

play16:13

vicious yeah you should probably stop

play16:15

people learning from your content just

play16:17

to protect anything else that could be

play16:18

revealed within it not just to protect

play16:20

your financial

play16:22

interests but that has ethical

play16:24

implications too yeah I I do think there

play16:27

is going to be significant can change in

play16:29

how we uh process conversations and how

play16:32

we make decisions uh James are you

play16:35

seeing any interesting work coming out

play16:37

in this space with with the new models

play16:39

yeah uh that are that are being trained

play16:41

right now or that are being released

play16:43

yeah yeah for sure um and big company

play16:47

obviously doing it right now right with

play16:48

uh I think we have a couple of folks

play16:50

from Gemini at the summit um and you

play16:53

know we talk about GPT 4V Sora uh

play16:57

anthropic you know even Claud got vision

play16:59

cilities and then uh within the startup

play17:02

atmosphere like like competing with us

play17:04

with as ad death uh reca you know I

play17:09

think even H Fest start releasing like

play17:11

you know open source Vision language

play17:12

model um in academic open source

play17:15

Community I think the most popular one

play17:17

is lava um and they have couple of V you

play17:20

know multiple version of that um and you

play17:22

know I think there's there new research

play17:24

coming out from Academia all the time

play17:26

and you know uh people interested in

play17:28

learning more just

play17:29

uh trying to like checking out

play17:31

conferences like you know cvpr or you

play17:33

know scml um yeah those are very

play17:36

powerful and I think I think internally

play17:38

at 12s you also in the process of like

play17:40

keep building more and more video fation

play17:42

models video language model like can

play17:44

enable like you know this interesting

play17:45

use cases and I think the the best

play17:48

feeling is like when you know developers

play17:50

and user actually using our models for

play17:52

real use cases and uh last year we host

play17:55

hion with actually with 11 Labs uh you

play17:58

know a funny name 11 laps and 12 laps

play18:00

but um we focused on multimotor a and a

play18:03

lot of like people like ker was speing

play18:06

interesting application from you know

play18:07

e-learning to you know social impact use

play18:10

cases and uh that's actually how I got

play18:12

connected with aush because he's he was

play18:14

on the winning team of that Heaton and

play18:16

you know we got to stay in touch and uh

play18:18

glad to to to see more those how how

play18:21

much the F has change just over the past

play18:23

like six months yeah it is uh so I I'll

play18:27

go to Tyson and improved next and we

play18:28

talk about the challenges but uh just

play18:30

reflecting real quick uh James and I met

play18:33

at that hackathon and uh we were trying

play18:36

to I realized I was watching lecture

play18:38

videos and I would tend to zone out a

play18:40

fair bit and I realized different people

play18:42

have different interests where uh you

play18:44

know where there's where they're Focus

play18:46

completely and certain areas where they

play18:47

zone out uh so we used an EG headset to

play18:50

measure brain waves and build this

play18:52

knowledge graph and then use

play18:54

Transformers to literally make any part

play18:56

of the lecture that's not exciting

play18:58

exciting for the things you care about

play19:01

uh and then that was enabled by 12 labs

play19:03

and 11 Labs which made it easier for us

play19:05

to uh generate voices so you know we had

play19:08

Steve Jobs coming out and asking us a

play19:10

question and being like hey are you

play19:11

losing interest come back in and like

play19:13

try to trying to bring us back in uh so

play19:15

Aur and uh Tyson what you guys are using

play19:18

this in production for your company what

play19:20

are some some challenges you're seeing

play19:22

right now which is you know that are

play19:24

preventing you from making it uh highly

play19:27

scalable where everyone else could use

play19:29

it I think um maybe I can start there I

play19:32

think I think right now with how voice

play19:34

AI is um and with where we're at with

play19:37

the product in terms of understanding

play19:38

human emotions being able to be emotive

play19:40

back and sound humanlike I think the

play19:42

first 20 to 30 seconds of a conversation

play19:45

can be very well done by an AI like the

play19:47

AI can essentially understand the

play19:49

human's problem understand what to do

play19:51

next like should you be closing this

play19:53

person on a sale should you be answering

play19:55

some kind of question should you be

play19:56

routing it to someone else that part I

play19:59

think we're at a place where AI is

play20:00

actually better than a human because the

play20:02

AI will always pick up every call within

play20:04

one second where I think we are seeing

play20:06

challenges with us and I think generally

play20:08

in the industry is the part after that

play20:10

so for our use case exam pretend that

play20:13

your sink just broke and it's just

play20:14

flooding with water and you need to get

play20:16

that repaired if you go call and you see

play20:18

something that's robotic answering you

play20:20

for a minute or two minutes you're going

play20:22

to start getting very agitated you're

play20:23

like okay please transfer me to a human

play20:25

this is a serious issue I don't need I I

play20:27

don't want to waste time talking to a

play20:28

robot and I think that shift starts

play20:30

happening after that 20 to 30 second

play20:31

Mark and so what we need to see right

play20:34

now is for the AI to be much smarter in

play20:36

terms of understanding their services

play20:38

understanding you know when technicians

play20:40

can come out or when like how to

play20:42

actually solve the End customer problem

play20:45

and I think that change is still I think

play20:47

like you still need to be a little bit

play20:49

more understanding of human emotions

play20:50

being able to empathize with the

play20:51

customer um circling back with them and

play20:54

so we're quite not there

play20:57

yet is anything you'd like to add to

play20:59

this yeah yeah I think that's that's the

play21:02

primary one I mean the the other one is

play21:04

just um a lot of uh you know before I

play21:07

started a VCA I was working at a

play21:09

self-driving uh car company neuro and

play21:12

one of the Ina biases that or challenges

play21:16

is that uh you know people have a

play21:18

fundamental distrust of AI uh and so

play21:22

even you know on nuro we saw um you

play21:24

there there were times where our uh like

play21:27

miles per critical dis engagement is

play21:29

like a self-driving kind of gold uh

play21:31

golden metric you know there was times

play21:33

where I was getting you know certain um

play21:36

you know uh areas uh that that our mpcd

play21:40

was better than the human average but

play21:43

but people are still afraid because as

play21:46

long as an AI makes a mistake you know

play21:48

they're they're upset so we're running

play21:50

to the same thing at Voca where

play21:52

sometimes the AI may be better at

play21:55

solving their problem but because

play21:57

they've had so many bad experiences with

play22:01

um you know uh AI you know phone Ai and

play22:04

and ivrs and stuff in the past they're

play22:06

just starting at a baseline where

play22:09

they're they have a fundamental distrust

play22:11

and so it's you know you have to you

play22:13

have to almost be much better in order

play22:15

to get people to to change their

play22:17

behavior that is interesting uh one of

play22:20

the things you touched on was the the

play22:22

lack of large context models right now

play22:25

that can hold everything like let's say

play22:27

a conversation's been going on for 10

play22:28

minutes holding that in memory and you

play22:31

know most I think uh GPD 4 is like 32k

play22:35

Claude is 200k token size and now we

play22:37

have Gemini which 1.5 which is a million

play22:40

tokens you think as these larger models

play22:42

come out the space for ai ai variables

play22:45

becomes big because we have we'll be

play22:47

able to hold all the conversations we're

play22:49

having throughout the day in in in one

play22:51

context maybe do like multiple

play22:53

conversations back and forth uh do you

play22:55

think that that is going to be the the

play22:57

key solution for the problems yeah from

play23:00

my point of view that's going to be huge

play23:02

I don't think it's going to be

play23:03

everything like so for example even with

play23:04

Gemini with how big the context length

play23:06

is generally being in the beginning of

play23:08

the prompt or beginning of the context

play23:10

is usually leads to higher accuracy and

play23:12

there's a lot of things like that so I

play23:13

think generally it's going to need to be

play23:15

able to consume that information quite

play23:16

well so context length is one but then

play23:18

the depth to which you can actually

play23:20

consume that context is a second but

play23:22

that will be totally a game changer I I

play23:24

do think that there's other aspects too

play23:26

though like I think with human

play23:27

conversation it's it's not just

play23:29

something you can codify it's often like

play23:31

a lot of the things are your brain

play23:32

understanding like what if you think

play23:34

about it as how a brain process a human

play23:35

conversation there's a lot of

play23:37

similarities from like you know whatever

play23:39

many years you've had conversations with

play23:40

humans that you pick up on that's like

play23:42

emotional intelligence that part needs

play23:44

to we need to figure out how to codify

play23:45

that better and that's where multimodal

play23:47

modality can be huge with like video

play23:49

hearing sounds things like that but that

play23:51

would be the next step how to actually

play23:53

codify that properly into a context

play23:55

awesome so we are coming up on time I'll

play23:57

do one final question uh we're seeing a

play23:59

lot of multimodal AI companies come up

play24:02

uh and this is for everyone who go down

play24:04

the down the the row um what is one as

play24:09

it becomes widely adopted what do you

play24:11

think will differentiate companies that

play24:13

really succeed in the space and stay

play24:14

around versus all the fluff that we

play24:17

seeing uh you want to start off sure I

play24:20

mean I think um you know many things I

play24:22

think for us one of the bets that that

play24:24

we have at aoka is um kind of a deep

play24:27

verticalization and so you know by being

play24:29

the company that is so ingrained in Home

play24:32

Services we eventually develop a um you

play24:34

know a mode on the types of data but

play24:37

then also around the Integrations and

play24:39

how we are able to um you know fit kind

play24:41

of every single one of these uh needs

play24:44

and then also you know the types of um

play24:48

you know uh use cases and objections and

play24:50

and paths that we are able to find we're

play24:53

able to really fine-tune our models and

play24:55

just uh you know Ser service this one

play24:58

extremely Niche industry um you know

play25:00

super well yeah so my answer is probably

play25:04

somewhat follow what what tach just said

play25:07

um we see a lot of flash Shey demos on

play25:09

on social but I think the application

play25:12

that that actually you know G revenue

play25:14

and transforming Enterprise is going to

play25:16

be embedded deeply in the workflow of

play25:18

those organization um you know so I

play25:21

think we got a lot of comparison with

play25:23

company like like Runway and and and P

play25:25

and other VTO generation you know

play25:27

companies but uh we're actually doing

play25:29

video understanding not video generation

play25:31

and From perspective of like you know

play25:34

video editors fil makers our tools

play25:37

actually augment you know their workflow

play25:39

and help them you know do their job

play25:41

better not actually replacing their job

play25:44

right and so come up with that

play25:45

positioning and you know make sure that

play25:47

we we augment you know human cilities

play25:49

not replacing them is very important um

play25:51

and the second part of you know

play25:54

uh the the the am man also around like

play25:57

being propri data set um I think um like

play26:01

for video uh it's not a lot of Open

play26:04

Source or you know openly available V

play26:07

compared to like text images so I think

play26:10

uh getting access to them and and more

play26:11

importantly getting high quality label v

play26:14

data is even more important how do we

play26:16

like generate description label this

play26:18

this video data given the you know the

play26:20

challenges of dealing with you know the

play26:23

temporal Dimension Etc so we invest a

play26:25

lot of effort on you know video labeling

play26:28

as well as the infrastructure to process

play26:31

Theo efficiently uh and you know we have

play26:34

already seen some of a very promising

play26:35

result in the Ty of performance that

play26:37

amodo was able to uh produce uh given

play26:40

that you know higher quality of v data I

play26:42

would

play26:44

collect I think going back to your point

play26:47

about uh building up the trust and we

play26:50

expect AI to perform perfectly and it

play26:52

never will in its early stages I think

play26:54

the firms that I see making good Headway

play26:57

are the ones that are able to to

play26:58

communicate that this is a process not

play27:00

an event because they will Garner trust

play27:03

based on truth and the beauty of

play27:05

multimodal is that we have so many ways

play27:07

to have that dialogue and the people

play27:09

that choose to invest not just in

play27:11

getting the technology to be more

play27:12

reliable but getting their

play27:14

Communications and their dialogue with

play27:16

humans to be more reliable and allowing

play27:18

them the context for where the techn

play27:21

technology sits and goes will give them

play27:23

Runway because we need that relationship

play27:26

with human beings and Technology to

play27:28

continue and for that to happen we need

play27:30

to have

play27:32

trust yeah I I think I definitely agree

play27:34

with trust that's huge also

play27:36

verticalization I think maybe one more

play27:38

thing to add that does probably tie into

play27:40

verticalization a bit is around data

play27:42

like having very rich data that's

play27:44

important for your customer is I think

play27:45

essential so for example for us for aoka

play27:48

the way we viewed is we got a lot of

play27:49

data around sales how sales

play27:51

conversations are happening and there's

play27:52

so many nuances that are actually just

play27:54

different in sales than it is in

play27:56

customer support which is often what AI

play27:58

models trained in the past that actually

play27:59

makes it such that maybe the best

play28:01

companies will be the ones that can

play28:03

capture not only the Nuance between

play28:04

sales versus customer support

play28:05

conversations but also between your

play28:07

company versus other companies like how

play28:09

exactly do you handle this objection

play28:11

like what is like the right steps to do

play28:12

after that and things like that and so

play28:14

that can only come from verticalization

play28:16

and having customers in

play28:18

trust awesome yeah uh I think one of the

play28:21

key points everyone's touched on here

play28:22

has been trust infrastructure all these

play28:25

things have to be upgraded uh and as we

play28:27

see this is this is just a start like we

play28:29

are seeing a lot of variables come out

play28:31

we just saw uh Humane uh release and

play28:34

we've had other variables uh companies

play28:37

announced that their own products are

play28:38

coming out this is just going to be more

play28:40

and more important and if something's

play28:41

recording me 24/7 the trust factor and

play28:44

that ability to like really augment my

play28:46

life has to be present um you've just

play28:50

had the chance to learn about multimodal

play28:51

AI from some of the experts in the field

play28:54

uh and these people are on the ground

play28:55

they're working and they're building

play28:57

stuff so they're very well up to date on

play28:59

what's Happening uh so for that I'd love

play29:01

you to just give them a huge round of

play29:03

applause and thank you all for listening

play29:04

to us thank you

play29:08

[Applause]

Rate This

5.0 / 5 (0 votes)

Related Tags
マルチモーダルAI専門家パネルAI技術ビデオ理解音声AI人間の感情倫理課題技術チャレンジ垂直化信頼構築
Do you need a summary in English?