This Advanced Kind Of AI Could Be The Secret To AI Assistants
Summary
TLDRこのパネルディスカッションでは、多模態AIの未来について熱く語られました。多模態AIは、画像、ビデオ、テキストを処理できる機械学習モデルであり、人間の生活を高める可能性があるとされています。参加者は、多模様な産業でのAIの活用、ビデオデータの理解強化、そして倫理的・技術的な課題について討議しました。特に、ビデオの意味を理解し、人々の感情を捉える技術の進歩に注目しました。彼らは、AIが信頼を築く上で重要な役割を果たすことを強調し、垂直化とデータの豊富さ、そしてテクノロジーと人間の間の対話の信頼性について語りました。
Takeaways
- 🧠 マルチモーダルAIは、画像、ビデオ、テキストを処理できる機械学習モデルであり、さまざまな情報の形式を理解することが可能です。
- 👥 参加者はそれぞれの分野でマルチモーダルAIを活用する企業を立ち上げ、その技術を応用して革新的なサービスを提供しています。
- 🔍 Avoka AIは、音声AIを利用して最も先進的なレセプションを構築しており、特に家事サービス業に焦点を当てています。
- 🎥 12 Appsはビデオ理解のためのマルチモーダルAIを開発しており、ビデオ内の視覚要素、音声、テキストを解釈し、包括的なビデオ表現を提供しています。
- 📈 Lexi Millsはデジタル倫理とデジタルフォレンジクスに焦点を当てた財団を紹介し、AIツールを使って虐待事件のデータを掘り下げ、検証し、訴訟を進めています。
- 🛠️ マルチモーダルAIの分野では、技術的な課題だけでなく、倫理的な課題も存在しており、それらをどう克服するかが重要な議論となっています。
- 📱 ビデオや音声を活用したマルチモーダルAIは、顧客の感情や非言語的なコミュニケーションを理解し、より効果的なコミュニケーションを可能にします。
- 🌐 今後のマルチモーダルAIの分野で成功する企業は、特定の業界に深く根差ししてデータを収集し、信頼関係を築くことができる企業になるでしょう。
- 🔑 信頼性はAIの成功の鍵であり、企業は技術の信頼性を高めるだけでなく、人間とAIとの間の対話を信頼性のあるものにすることも重要です。
- 🚀 マルチモーダルAIは多くの分野に適用されており、ビデオ編集、スポーツ、医療、学習など、さまざまな垂直市場においてその価値が認められています。
- 🌟 今後のマルチモーダルAIの分野では、より大きなコンテキストモデルが登場し、長い会話の全体を保持し、より自然な対話を実現するでしょう。
Q & A
マルチモーダルAIとはどのような技術ですか?
-マルチモーダルAIは、画像、ビデオ、テキストなど、さまざまな情報の形式を処理できる機械学習モデルです。これにより、より豊かな情報の統合や理解が可能になります。
avoka AIはどのような企業ですか?
-avoka AIは、MITの出身者による共同創業者による企業で、音声AIを利用して最も先進のレセプションを構築しています。特に、電気工、水道工、エアコンなどの古い業界に焦点を当てています。
マルチモーダルAIが持つ最も魅力的な能力は何ですか?
-感情的な側面を組み込む能力が最も魅力的です。販売や説得に必要な非言語的な要素を理解することも重要です。
12 Appsはどのような企業ですか?
-12 Appsは、ビデオを理解するためのマルチモーダルAIを開発している企業です。ビデオ内の視覚要素、音声要素、テキスト要素を解釈し、ビデオの包括的な表現を作成します。
ビデオ理解における大きな課題は何ですか?
-ビデオの時間的次元、物体の動き、視覚と音声、テキストの整合性など、ビデオデータの取り扱いが非常に難しい点です。
デジタル倫理についてLexi Millsさんはどのように考えていますか?
-Lexi Millsさんはデジタル倫理に焦点を当てた財団活動を行い、AIツールを用いて虐待事件のデータを掘り下げ、証拠を集めることにより、以前は曖昧とされていた事件を検証しています。
avoka AIではどのようにして販売を改善していますか?
-avoka AIでは、AIを活用して販売における感情的な側面や人間のnatureを理解し、顧客とのコミュニケーションを改善しています。
12 Labsのビデオ理解技術はどのように使われていますか?
-12 Labsの技術は、例えばYouTubeのビデオから特定の瞬間を見つけるための検索や、ビデオ編集、新しいTV番組や映画のトレーラーの作成、セキュリティー監視など様々な場面で使われています。
感情分析は過去数年でどのように進化しましたか?
-感情分析はテキスト分析の精度が向上した一方で、より重要なのは音声そのものに基づくモデルの開発です。エネルギーの高低などから感情的な状態を推測することができるようになりました。
AIが人間の感情を理解することの重要性とは?
-AIが人間の感情を理解することは、特に販売や顧客サービスにおいて非常に重要です。顧客の関心や要望を読み取り、適切な対話の進行形を決定するのに役立ちます。
マルチモーダルAIが広く採用されるにつれて、企業が成功するためにはどのような要素が必要ですか?
-企業が成功するためには、特定の業界への深い垂直化、高品質のデータの蓄積、信頼性の高め、そしてテクノロジーが人間の能力を補助するというポジショニングが重要です。
Outlines
🤖 マルチモーダルAIの定義と展望
この段落ではマルチモーダルAIの基本的な定義と、その適用分野に関する議論が行われています。マルチモーダルAIは、画像、ビデオ、テキストを処理できる機械学習モデルであり、様々な形のモダリティを扱うことができます。GPD Visionの例が挙げられており、材料の写真を入力するとレシピを作成するといった応用が説明されています。また、参加者が自己紹介を行い、マルチモーダルAIの最も興味深い能力について話している部分も含まれています。
🛠️ マルチモーダルAIの応用と倫理的課題
第二段落では、マルチモーダルAIの応用例がさらに詳述されています。ビデオデータを理解するAIの開発、ビデオ内の視覚要素や音声、テキストを解釈し、包括的なビデオの表現を得る試みが語されています。さらに、ビデオデータの扱いに関する技術的課題や倫理的問題も提起されており、ビデオ分析を通じて得られるデータの多様な用途について触れられています。
🗣️ コミュニケーションにおけるマルチモーダルAIの役割
第三段落では、コミュニケーションにおけるマルチモーダルAIの役割が議論されています。テキスト分析の進歩だけでなく、人間の声のエネルギーなど、テキスト以外の要素に焦点を当てたモデルの重要性が強調されています。また、AIが感情的な側面を理解し、販売や説得に役立つ能力を持つことの重要性についても言及されています。
🔍 マルチモーダルAIの倫理的・法的な問題
第四段落では、マルチモーダルAIの倫理的および法的な問題が探求されています。ニューヨーク・タイムズが記事のテキストをAIのトレーニングデータとして使用したことに関する問題、バイアスの問題、著作権の問題などが提起されています。倫理的懸念とそれに対する見解が議論されており、テクノロジーの進歩とそれに伴う課題について深く掘り下げられています。
🚀 マルチモーダルAIの技術的進歩と課題
第五段落では、マルチモーダルAIの技術的進歩と現在の課題が語されています。会話の最初の20〜30秒でAIが人間を超える能力を発揮する一方、会話が長引くにつれて人間との対話の自然さを維持するのが難しいと指摘されています。また、AIへの不信感の問題も触れられており、技術的進歩とそれに伴うユーザーの期待と不信感の間にあるギャップについても議論されています。
🌟 マルチモーダルAIの将来性と成功の鍵
最後の段落では、マルチモーダルAIがより広く採用されるにつれて、どの企業が成功するかについて話されています。信頼性、技術の信頼性、顧客との関係の構築、特定の業界への深い参入、高品質のデータの重要性などが成功の鍵として挙げられています。また、参加者は自分たちの企業がどのようにその成功の鍵に沿って発展しているかについて語っています。
Mindmap
Keywords
💡マルチモーダルAI
💡感情分析
💡ビデオ理解
💡デジタル倫理
💡ビデオトランスフォーマー
💡非言語的なコミュニケーション
💡テレフォニーAI
💡コンテキストモデル
💡デジタルマーケティング
💡倫理的課題
Highlights
Multimodal AI is defined as a machine learning model capable of processing images, videos, and text.
An example of multimodal AI is GPD Vision, which creates recipes from pictures of ingredients.
Avoka AI leverages voice AI to build advanced receptionists for antiquated industries like home services.
The emotional aspect of communication is crucial in sales and customer service, and AI is working to understand nonverbal cues.
Depex is building multimodal AI for video understanding, interpreting visual, speech, and text elements in videos.
Video understanding is challenging due to the temporal dimension and consistency between visual and speech elements.
Use cases for multimodal AI in video understanding include sports, media, entertainment, e-learning, security, and healthcare.
Lexi Mills discusses using AI tools for digital forensics to prosecute cases that were previously hard to evidence.
Aorv, co-founder of Avoka, emphasizes the potential of AI in improving sales by understanding human nature and emotions.
12 Labs helps identify specific moments in videos through semantic search on video embeddings.
Transformer models are changing video understanding in a way similar to how they transformed natural language processing.
AI is being trained to understand the energy and emotional tone of voice calls for better customer service.
Sentiment analysis has improved significantly, with models now focusing on the sound of human voice rather than just text.
Hum.ai is working on giving AI emotional intelligence through a set of modalities.
Lexi discusses the ethical implications of analyzing individuals at a detailed emotional level and the importance of trust in AI.
James from 12 Labs talks about the importance of having access to high-quality labeled video data for training AI models.
The panel agrees that trust, verticalization, and rich data are key to differentiating successful multimodal AI companies.
Transcripts
welcome welcome
welcome let's let's get started uh this
is the multimodal AI Revolution panel uh
we have a very exciting conversation for
you here today uh we'll be discussing
what's coming up next for multimodal AI
uh a quick definition for multimodal AI
uh is it's
a I think that worked actually all right
um so in this conversation we'll be
covering what multimod modal AI is
whether it can augment human lives how
variables can leverage it um and we'll
also discuss some ethical and Technical
challenges that surround the space uh a
quick definition for multimodal AI is
it's a machine learning model that's
capable of processing uh images videos
and text and it can do other forms of
modality as well uh an example of this
currently is is using GPD Vision where
you can give it a picture of the
ingredients you have access to and it
can create a recipe for you um so let's
get started let's do a quick round of
introductions uh talk about you know
introduce yourself what are you working
on and what is one ability of multimodal
AI that you find most exciting uh let's
start with you yeah absolutely uh well
uh yeah my name is Tyson uh I'm one of
the co-founders of avoka AI uh it was
actually um a company that uh you know
my co-founder porv and I started a bit
over two years ago um we were students
uh at MIT uh over uh s eight years ago
um class of 2017 and actually did a lot
of research at the media lab so uh great
great to be back today uh avoka
leverages voice AI to build the world's
most advanced receptionist for um uh a
lot of uh antiquate Industries including
home services so these are electricians
plumbers HVAC the people that you
probably think would be the last people
to be using utilizing AI uh but uh yeah
that that that's kind of what we're
working on in a nutshell and uh in terms
of multimodal I think um the area that
I'm most excited about is the ability to
uh actually incorporate um uh the not
not just uh text but the emotional
aspect and understand because in the
world of uh beac and plumbing it's not
just customer service you're dealing
with you actually need to make a sale
and in order to make a sale and be
convincing um we need AI that cannot
just understand what people are saying
but the nonverbal stuff uh and the stuff
around what they actually mean uh
because that's actually u a lot of where
where the interesting parts
are James everyone uh I'm James uh
currently running uh depex developer
relations at 12 apps and our company
building multimotor AI for video
understanding um so you know back to
aush definition of multimodal right just
like how you know like the say baby
trying to acquire knowledge they they
read the text they hear SS feel the
emotion smell the order right all this
different you know sense and modalities
coming in um you know we're trying to
build the T of Parish models start doing
the same interpreting you know the
visual element uh the speech element as
well as the text element inside the
video and then you know uh come up with
a comprehensive you know representation
of that video um and you know if you
think about the T of L down on the world
I'd say like more than 80% uh is UN
structure data more more than 80% is
actually video data and unlike text and
image video is very challenging um thing
to tackle because of the you know the
temporal Dimension how things move our
time um the consistency between you know
Visual and and and and speech and text
so we're trying to build the tip of AI
that can tackle this challenging
thetical problem um in terms of use
cases I think um we have a lot of you
know different verticles ranging from
Sports to Meda and entertainment to
e-learning and even like security
surveillance Healthcare um you know
people building video search to to find
interesting moment in like a football
game or baseball game uh they they use
to to like quickly edit video to make
new TV show or or movies trailer they
even use our to to like you know fight
uh weapons um violence on body count
footage of of uh you know the police so
I think you know any industry that
require a lot of You Know video data can
can benefit from you know the T
understanding that
kuu hi uh can you hear me is it working
my name is Lexi Mills I'm a digital
communication specialist we focus on
emerging technology so anything where
there isn't a word or people aren't
searching for a word it's our job to
help people use the word understand it
um on the other side of what we do we
have a foundation that looks a lot at
digital ethics and more recently in
digital forensics so after the last 3
four years we've been using our skills
to in an inverse way to mine data and
information for different types of abuse
cases which are typically quite hard to
prosecute whereas now we get huge
amounts of data using free off-the-shelf
AI tools to be able to prosecute cases
that previously would have just slipped
under the radar due to lack of
evidence hey everyone my name is aorv um
I'm the other co-founder of aoka that
Tyson previously mentioned um so yeah
just to recap we're like a receptionist
for these Home Service businesses I
think Tyson covered most of it on I
think what's exciting and what we're
actually working on I would say the big
thing to maybe EMP emphasize around
where AI is headed now it's kind of like
we've always seen a lot of these
customer support startups working in AI
infiltrating so many different companies
while we are working on avoka and I
think what we're starting to see happen
more is how that can infiltrate sales
sales requires a lot more emotions a lot
it requires a lot more understanding of
the human nature and I I don't think AI
can do all of sales but it can
significantly improve it and so yeah
that's essentially what we're working on
it's quite exciting very exciting uh
James so you work for a company called
12 labs uh it basically helps understand
video so right now if I go to YouTube
and I do a search it's semantic and I
try to find the exact transcript or find
the keyword but what your company does
is it finds Snippets from the video so I
can just ask and be like where was that
robot uh which which part of the movie
did that robot come in and 12 Labs will
help me help me identify it is that
correct yeah that that's correct um I
think how hypothesis is like video
understanding has not evolved a lot over
the past decade like the way way um
research tle that is they VI specific
Compu Vision optimized for a very
specific task like you know Keo
estimation object detection semetic
segmentation Etc they generate like
metad datas or Tex from the video and
then when they they's say perform search
they actually do keyword search or meta
data search based on that text or
transcript but like you know um that
cannot capture the visual element of the
video and also maybe can put totally
disconnected from what's Happening and
so with the rise of like Transformer and
the versatility of like multimotor data
um you know we can create basically
embeddings from this video which is like
a vector representation of the video
content and when you perform search you
actually do semantic search on that
video embedding space and that the
resale is much more holistic and um and
native to the way like uh you know
models learn so I think uh that's I
think the the future like just like how
you know uh Transformer transform NLP
we're seeing the same thing happening uh
with video Transformer transforming your
video understanding awesome and Tyson so
this ties to the work you guys are doing
where you're trying to identify
non-verbal communication I had a stat
somewhere that said 80% of communication
is non-verbal so the way I'm moving my
hands the way you're looking your your
facial features changing Etc is this the
VD are you capturing the video footage
as well cuz I know right now you're
doing just voice calls but at at some
point you plan to capture video footage
as well use something like 12 labbs uh
to get that visual context for emotional
intelligence yeah so right now we're
primarily um or almost exclusively uh
working uh purely in the voice realm
because um remember like most of our
customers are um you know antiquate
industry folks Like Home Services and
they they unfortunately don't have the
luxury of getting their uh customers to
call in on zoom and so that it's all
purely uh phone
communication um but but even within
phone communication there's so much that
is not captured just simply from
transcribing that and analyzing the
words you know there's there's the
tonality um whether the the customer is
uh angry upset one thing that we're um
really keen on is um kind of
understanding you know at the beginning
of the call measuring the customer
sentiment and then seeing what the
customer sentiment is at the end of the
call and seeing what that Delta is and
that's a good metric for us to determine
whether or not we did a good job with
improving the customers day and uh you
know talking to them so this is for you
aor and for you Tyson as well um we've
had sentimental analysis for a while
right uh do you think the models now
have just made it 10 times better what
what is the difference you're seeing
with What's Happening Now versus you
know the sentiment analysis and natural
language stuff we had in 2015 yeah I
think maybe I can add to that I think
there's two things I think one I think
the analysis of text has definitely 10x
um in terms of our ability to do that in
several years but I think the more
important and bigger thing that's going
to be emerging is actually models that
do not even look at text they're focused
more on the sound that the human is
making on the other side and so I think
there's another company that's actually
exclusively focused on this like hum. um
want to check out but essentially what
you can do is you can actually train
models now that go off of a base layer
where they can go and actually hear what
the person is saying and be like okay is
this more likely to be high energy or is
this more likely to be low energy and I
think understanding that gives us like
the next wave of unlocking voice
applications yeah this is uh it's very
interesting cuz hum and I we had a we
had a conversation after a hack of one
cuz I was trying to build this thing
which would give Chad GPD emotional
intelligence and I'm I'm using Chad GPD
to to teach me stuff but and I I teach
myself so I can when I'm talking to a
student I can tell this person's losing
interest or this is too hard for them
this is too easy uh and I can change the
content I'm I'm sharing with them
similarly I was trying to use my my
facial features and like a modality
which I can give to J GPD to a GPD
plugin uh and that's when I came across
human and the work they're doing is
around basically giving emotional
intelligence which is a like a whole set
of modalities to uh to AI do you think
that has would you be able to integrate
that into into work in your into your uh
startup and what are the implications of
that going to be like for you yeah I
think I mean tremendous I think you know
one of the things that you know even um
you know obviously the with a with a
voice agent um I think one of the most
common problems is that the voice agent
is is not able to actually understand
how the End customer is feeling and so
when it when it comes to you know
elevating the the level to to actually
sales when someone has you know 10
options for who they want to install a
new HVAC unit the the the the the cues
around you know are are are they
actually interested in buying do they do
they want to hear about all the upgrades
or do are they just someone that just
wants to get the cheapest option be able
to decipher that and then navigate the
conversation from there it is extremely
um important that is very interesting
and Lexi in your previous jobs you've
worked as a as a head of communications
and on your LinkedIn I saw this in your
bio and I thought this was incredible it
said Lexi combines technical search
knowledge with psychology to create
datadriven measurable communication
strategies that maximize influence on
human behavior you think tactics like
this where we're using more than what we
more than what we naturally know and
we're augmenting our life through AI is
going to have significant impact on on
human communication yes definitely you
know there's almost no AI that isn't
somewhat trained on internet data and
the thing is Google's objective is
primarily to give us what we want and as
fast as possible but what we want and
what we need can often be very different
and so I do a lot of work in debt
management I bu Bonkers about how we
communicate around debt and when someone
types in get rid of debt they'll get
different search results compared to
someone who uses good language or good
grammar but that is admitting what level
of fear are they in at that point in
time and we could be regulating
Advertising based on Son's emotional
state so they're making emotionally
Intelligent Decisions or not emotionally
deficient and if we take it a little bit
further if you think about something
like um lung cancer survival
statistics you're either researching
that because you're a researcher or
you're most likely researching that
because you know someone with it now
getting the statistics isn't super
helpful unless you have the context you
know there are several tests you need to
interpret that data getting the
information fast actually isn't even
giving you accurate information because
it's not giving you the context to
digest it and knowing who's online that
you could speak to that coming up first
giving you a warning actually this
information won't be helpful unless you
understand x and y means that people's
entire search Journey becomes more
intelligent and then we're going to be
looking at how we optimize for that
thereafter because the structure of the
internet feeds directly in to what we
see optimizing in certain llms it's
interesting do you from a from an ethic
standpoint do you think it is right to
be analyzing someone in this much level
of detail to where I'm getting you know
micro changes in their facial features
and I'm able to decipher what they might
be thinking deep down you think that's
ethical I think there are ethical
challenges to it but I think it's also
unethical to not be doing so right right
now a lot of the adte is coded to take
advantage of emotional states that we
understand through language time of
search as well and so by not doing it we
have ethical concerns it's just that
we're already in that flow so we're not
questioning it we tend to question new
problems new challenges new technology
but actually a lot of the challenges we
see with new technology have existed
previously that's interesting um one of
the things I've been I I I'm guessing
you've been following the New York Times
suang uh open AI uh for using the text
they've generated their articles to
train their training data um you
anticipate issues with you know Marvel
Studios coming or Universal coming and
going to Sora open AI and being like
listen this is you use our training data
to generate these videos you think that
could be be an issue starting with you I
think there are going to be issues I
mean the New York Times has quite an
issue with bias over the years and so I
think there are what we're seeing people
conveying as fears and what are the
underlying fears um if you there's a
great book called The gy Lady winked
which is about the historical bias
across the New York Times and we've seen
it in all news search is bias so
the content it's drawing for from is
also biased then you've got double bias
then you've got double bias scaled and
then you've got copyright issues
thereafter um from a business standpoint
if you want to be really prude and
vicious yeah you should probably stop
people learning from your content just
to protect anything else that could be
revealed within it not just to protect
your financial
interests but that has ethical
implications too yeah I I do think there
is going to be significant can change in
how we uh process conversations and how
we make decisions uh James are you
seeing any interesting work coming out
in this space with with the new models
yeah uh that are that are being trained
right now or that are being released
yeah yeah for sure um and big company
obviously doing it right now right with
uh I think we have a couple of folks
from Gemini at the summit um and you
know we talk about GPT 4V Sora uh
anthropic you know even Claud got vision
cilities and then uh within the startup
atmosphere like like competing with us
with as ad death uh reca you know I
think even H Fest start releasing like
you know open source Vision language
model um in academic open source
Community I think the most popular one
is lava um and they have couple of V you
know multiple version of that um and you
know I think there's there new research
coming out from Academia all the time
and you know uh people interested in
learning more just
uh trying to like checking out
conferences like you know cvpr or you
know scml um yeah those are very
powerful and I think I think internally
at 12s you also in the process of like
keep building more and more video fation
models video language model like can
enable like you know this interesting
use cases and I think the the best
feeling is like when you know developers
and user actually using our models for
real use cases and uh last year we host
hion with actually with 11 Labs uh you
know a funny name 11 laps and 12 laps
but um we focused on multimotor a and a
lot of like people like ker was speing
interesting application from you know
e-learning to you know social impact use
cases and uh that's actually how I got
connected with aush because he's he was
on the winning team of that Heaton and
you know we got to stay in touch and uh
glad to to to see more those how how
much the F has change just over the past
like six months yeah it is uh so I I'll
go to Tyson and improved next and we
talk about the challenges but uh just
reflecting real quick uh James and I met
at that hackathon and uh we were trying
to I realized I was watching lecture
videos and I would tend to zone out a
fair bit and I realized different people
have different interests where uh you
know where there's where they're Focus
completely and certain areas where they
zone out uh so we used an EG headset to
measure brain waves and build this
knowledge graph and then use
Transformers to literally make any part
of the lecture that's not exciting
exciting for the things you care about
uh and then that was enabled by 12 labs
and 11 Labs which made it easier for us
to uh generate voices so you know we had
Steve Jobs coming out and asking us a
question and being like hey are you
losing interest come back in and like
try to trying to bring us back in uh so
Aur and uh Tyson what you guys are using
this in production for your company what
are some some challenges you're seeing
right now which is you know that are
preventing you from making it uh highly
scalable where everyone else could use
it I think um maybe I can start there I
think I think right now with how voice
AI is um and with where we're at with
the product in terms of understanding
human emotions being able to be emotive
back and sound humanlike I think the
first 20 to 30 seconds of a conversation
can be very well done by an AI like the
AI can essentially understand the
human's problem understand what to do
next like should you be closing this
person on a sale should you be answering
some kind of question should you be
routing it to someone else that part I
think we're at a place where AI is
actually better than a human because the
AI will always pick up every call within
one second where I think we are seeing
challenges with us and I think generally
in the industry is the part after that
so for our use case exam pretend that
your sink just broke and it's just
flooding with water and you need to get
that repaired if you go call and you see
something that's robotic answering you
for a minute or two minutes you're going
to start getting very agitated you're
like okay please transfer me to a human
this is a serious issue I don't need I I
don't want to waste time talking to a
robot and I think that shift starts
happening after that 20 to 30 second
Mark and so what we need to see right
now is for the AI to be much smarter in
terms of understanding their services
understanding you know when technicians
can come out or when like how to
actually solve the End customer problem
and I think that change is still I think
like you still need to be a little bit
more understanding of human emotions
being able to empathize with the
customer um circling back with them and
so we're quite not there
yet is anything you'd like to add to
this yeah yeah I think that's that's the
primary one I mean the the other one is
just um a lot of uh you know before I
started a VCA I was working at a
self-driving uh car company neuro and
one of the Ina biases that or challenges
is that uh you know people have a
fundamental distrust of AI uh and so
even you know on nuro we saw um you
there there were times where our uh like
miles per critical dis engagement is
like a self-driving kind of gold uh
golden metric you know there was times
where I was getting you know certain um
you know uh areas uh that that our mpcd
was better than the human average but
but people are still afraid because as
long as an AI makes a mistake you know
they're they're upset so we're running
to the same thing at Voca where
sometimes the AI may be better at
solving their problem but because
they've had so many bad experiences with
um you know uh AI you know phone Ai and
and ivrs and stuff in the past they're
just starting at a baseline where
they're they have a fundamental distrust
and so it's you know you have to you
have to almost be much better in order
to get people to to change their
behavior that is interesting uh one of
the things you touched on was the the
lack of large context models right now
that can hold everything like let's say
a conversation's been going on for 10
minutes holding that in memory and you
know most I think uh GPD 4 is like 32k
Claude is 200k token size and now we
have Gemini which 1.5 which is a million
tokens you think as these larger models
come out the space for ai ai variables
becomes big because we have we'll be
able to hold all the conversations we're
having throughout the day in in in one
context maybe do like multiple
conversations back and forth uh do you
think that that is going to be the the
key solution for the problems yeah from
my point of view that's going to be huge
I don't think it's going to be
everything like so for example even with
Gemini with how big the context length
is generally being in the beginning of
the prompt or beginning of the context
is usually leads to higher accuracy and
there's a lot of things like that so I
think generally it's going to need to be
able to consume that information quite
well so context length is one but then
the depth to which you can actually
consume that context is a second but
that will be totally a game changer I I
do think that there's other aspects too
though like I think with human
conversation it's it's not just
something you can codify it's often like
a lot of the things are your brain
understanding like what if you think
about it as how a brain process a human
conversation there's a lot of
similarities from like you know whatever
many years you've had conversations with
humans that you pick up on that's like
emotional intelligence that part needs
to we need to figure out how to codify
that better and that's where multimodal
modality can be huge with like video
hearing sounds things like that but that
would be the next step how to actually
codify that properly into a context
awesome so we are coming up on time I'll
do one final question uh we're seeing a
lot of multimodal AI companies come up
uh and this is for everyone who go down
the down the the row um what is one as
it becomes widely adopted what do you
think will differentiate companies that
really succeed in the space and stay
around versus all the fluff that we
seeing uh you want to start off sure I
mean I think um you know many things I
think for us one of the bets that that
we have at aoka is um kind of a deep
verticalization and so you know by being
the company that is so ingrained in Home
Services we eventually develop a um you
know a mode on the types of data but
then also around the Integrations and
how we are able to um you know fit kind
of every single one of these uh needs
and then also you know the types of um
you know uh use cases and objections and
and paths that we are able to find we're
able to really fine-tune our models and
just uh you know Ser service this one
extremely Niche industry um you know
super well yeah so my answer is probably
somewhat follow what what tach just said
um we see a lot of flash Shey demos on
on social but I think the application
that that actually you know G revenue
and transforming Enterprise is going to
be embedded deeply in the workflow of
those organization um you know so I
think we got a lot of comparison with
company like like Runway and and and P
and other VTO generation you know
companies but uh we're actually doing
video understanding not video generation
and From perspective of like you know
video editors fil makers our tools
actually augment you know their workflow
and help them you know do their job
better not actually replacing their job
right and so come up with that
positioning and you know make sure that
we we augment you know human cilities
not replacing them is very important um
and the second part of you know
uh the the the am man also around like
being propri data set um I think um like
for video uh it's not a lot of Open
Source or you know openly available V
compared to like text images so I think
uh getting access to them and and more
importantly getting high quality label v
data is even more important how do we
like generate description label this
this video data given the you know the
challenges of dealing with you know the
temporal Dimension Etc so we invest a
lot of effort on you know video labeling
as well as the infrastructure to process
Theo efficiently uh and you know we have
already seen some of a very promising
result in the Ty of performance that
amodo was able to uh produce uh given
that you know higher quality of v data I
would
collect I think going back to your point
about uh building up the trust and we
expect AI to perform perfectly and it
never will in its early stages I think
the firms that I see making good Headway
are the ones that are able to to
communicate that this is a process not
an event because they will Garner trust
based on truth and the beauty of
multimodal is that we have so many ways
to have that dialogue and the people
that choose to invest not just in
getting the technology to be more
reliable but getting their
Communications and their dialogue with
humans to be more reliable and allowing
them the context for where the techn
technology sits and goes will give them
Runway because we need that relationship
with human beings and Technology to
continue and for that to happen we need
to have
trust yeah I I think I definitely agree
with trust that's huge also
verticalization I think maybe one more
thing to add that does probably tie into
verticalization a bit is around data
like having very rich data that's
important for your customer is I think
essential so for example for us for aoka
the way we viewed is we got a lot of
data around sales how sales
conversations are happening and there's
so many nuances that are actually just
different in sales than it is in
customer support which is often what AI
models trained in the past that actually
makes it such that maybe the best
companies will be the ones that can
capture not only the Nuance between
sales versus customer support
conversations but also between your
company versus other companies like how
exactly do you handle this objection
like what is like the right steps to do
after that and things like that and so
that can only come from verticalization
and having customers in
trust awesome yeah uh I think one of the
key points everyone's touched on here
has been trust infrastructure all these
things have to be upgraded uh and as we
see this is this is just a start like we
are seeing a lot of variables come out
we just saw uh Humane uh release and
we've had other variables uh companies
announced that their own products are
coming out this is just going to be more
and more important and if something's
recording me 24/7 the trust factor and
that ability to like really augment my
life has to be present um you've just
had the chance to learn about multimodal
AI from some of the experts in the field
uh and these people are on the ground
they're working and they're building
stuff so they're very well up to date on
what's Happening uh so for that I'd love
you to just give them a huge round of
applause and thank you all for listening
to us thank you
[Applause]
Weitere verwandte Videos ansehen
Tsuzumi
How Google Translate Uses Math to Understand 134 Languages | WSJ Tech Behind
プログラミングの学習はAIの登場で無駄になるの?
Full interview: "Godfather of artificial intelligence" talks impact and potential of AI
NTT R&Dフォーラム2023 特別セッション2:汎用AIはヒトと暮らす夢を見るか?〜 大規模言語モデル tsuzumi の研究開発 〜
Rethinking How AI And Humans Interact To Get The Best Of Both
5.0 / 5 (0 votes)