"VoT" Gives LLMs Spacial Reasoning AND Open-Source "Large Action Model"

Matthew Berman
8 May 202416:25

Summary

TLDR微软发布了一个开源的大型动作模型,名为Pi win助手,它利用自然语言控制Windows环境中的应用程序。该技术基于一篇名为“可视化思维激发空间推理和大型语言模型”的白皮书,该白皮书介绍了如何通过可视化思维提示(VOT)来增强大型语言模型的空间推理能力。空间推理是大型语言模型历史上表现不佳的一个领域,但这项研究表明,通过可视化思维提示,可以显著提高模型在空间推理任务中的表现。Pi win助手展示了如何通过自然语言指令执行复杂的任务,如在Twitter上发帖或在网页上导航,而无需任何视觉上下文。这项技术不仅展示了大型语言模型的新可能性,也揭示了它们在空间推理方面的潜力。

Takeaways

  • 📄 微软发布了一篇研究论文和开源项目,展示了如何通过自然语言控制Windows环境中的应用程序。
  • 🤖 该技术被称为“可视化思维”(Visualization of Thought),旨在提升大型语言模型的空间推理能力。
  • 🧠 空间推理是指在三维或二维环境中想象不同对象之间的关系,这是大型语言模型历史上表现不佳的一个领域。
  • 🚀 通过“可视化思维”提示技术,大型语言模型能够创建内部心理图像,从而增强其空间推理能力。
  • 🔍 该技术通过零样本提示而非少量样本演示或图像到文本的可视化,来评估空间推理的有效性。
  • 📈 通过可视化思维提示,大型语言模型在自然语言导航、视觉导航和视觉铺砖等任务中取得了显著的性能提升。
  • 📊 实验结果显示,使用可视化思维提示的大型语言模型在多个任务中的表现优于未使用可视化的模型。
  • 📝 该技术通过在每个推理步骤后可视化状态,生成推理路径和可视化,从而增强了模型的空间感知能力。
  • 📚 论文中提到,尽管可视化思维提示技术有效,但它依赖于先进的大型语言模型的能力,可能在较不先进的模型或更具挑战性的任务中导致性能下降。
  • 🌐 微软的开源项目名为Pi win assistant,是第一个开源的大型动作模型,能够仅通过自然语言控制完全的人类用户界面。
  • 💡 Pi win assistant展示了如何将可视化思维提示技术应用于实际的Windows环境控制,如打开浏览器、导航到特定网站、输入文本等。
  • 🔗 对于想要深入了解这项技术的人,可以通过阅读微软的研究论文或尝试使用Pi win assistant开源项目来获取更多信息。

Q & A

  • 什么是空间推理?

    -空间推理是指在三维或二维环境中,对不同对象之间的关系进行视觉化的能力。它是人类认知的一个重要方面,涉及到对物体、它们的动作和交互之间空间关系的理解和推理。

  • 为什么空间推理对于大型语言模型来说是一个挑战?

    -空间推理对于大型语言模型来说是一个挑战,因为这些模型传统上在处理需要视觉化或空间操作的任务时表现不佳。它们通常只能处理文本语言,而空间推理需要能够像人类一样在“心灵的眼睛”中创建和操作视觉图像。

  • 微软发布的开源项目是什么,它如何与空间推理相关?

    -微软发布的开源项目名为 Pi win Assistant,它是第一个开源的大型行动模型,专门用于通过自然语言控制完全由人类使用的界面。这个项目使用了论文中提到的“思维可视化”技术,通过在每一步中可视化推理过程,来增强模型的空间推理能力。

  • 什么是“思维可视化”(Visualization of Thought, VOT)?

    -“思维可视化”是一种提示技术,它要求大型语言模型在处理任务的每一步中都生成一个内部的视觉空间草图,以此来可视化其推理步骤,并指导后续步骤。这种方法可以帮助模型更好地进行空间推理。

  • 在测试中使用了哪些任务来评估空间推理能力?

    -在测试中使用了三种需要空间意识的任务来评估大型语言模型的空间推理能力,包括自然语言导航、视觉导航和视觉铺砌。这些任务通过设计二维网格世界和使用特殊字符作为输入格式,来挑战模型的空间推理能力。

  • Pi win Assistant 如何在 Windows 环境中控制应用程序?

    -Pi win Assistant 通过自然语言处理用户的指令,然后自动执行一系列的操作来控制 Windows 环境中的应用程序。例如,它可以打开 Firefox 浏览器,导航到 YouTube,搜索特定的内容,或者在 Twitter 上发布新的帖子。

  • 为什么 Pi win Assistant 需要在每一步中可视化?

    -在每一步中可视化是为了提供推理过程的可追溯性,类似于“思维链”(Chain of Thought)的概念。这不仅可以帮助模型更准确地执行任务,还可以让开发者或用户理解模型是如何一步步到达最终结果的。

  • Pi win Assistant 的性能如何,它在哪些方面表现出色?

    -根据测试结果,使用 VOT 提示技术(即在每一步中可视化)的 GPT-4 在所有任务中都表现出色,特别是在路线规划、下一步预测、视觉铺砌和自然语言导航等方面,其成功率和完成率都显著高于未使用可视化技术的模型。

  • Pi win Assistant 有什么局限性?

    -Pi win Assistant 的局限性在于它依赖于先进的大型语言模型的能力,因此在性能较差的语言模型或者更复杂的任务上可能会导致性能下降。此外,它需要能够理解和处理自然语言描述的二维空间,而不是直接处理图形或图像。

  • 如何获取并使用 Pi win Assistant?

    -Pi win Assistant 是一个开源项目,用户可以从相关的开源社区或微软的官方网站获取源代码。根据提供的文档和指南,用户可以下载、安装并在本地环境中使用 Pi win Assistant。

  • 除了控制 Windows 环境,Pi win Assistant 是否可以扩展到其他操作系统?

    -虽然 Pi win Assistant 目前是为 Windows 环境设计的,但其基于自然语言控制接口的核心概念理论上可以扩展到其他操作系统。这将需要对不同操作系统的用户界面和控件进行适配和开发。

  • Pi win Assistant 的未来发展方向可能是什么?

    -Pi win Assistant 的未来发展方向可能包括提高模型的空间推理能力,扩展到更多的操作系统和应用程序,以及增强用户交互体验。此外,它还可以被集成到更复杂的自动化系统中,或者用于辅助视觉障碍人士使用计算机。

Outlines

00:00

📄 大型语言模型的空间推理能力

本段介绍了微软发布的开源项目,该项目旨在为Windows环境提供类似于Android环境下Rabbit R1应用的控制功能。微软不仅发布了研究论文,还提供了一个开源项目,允许用户下载并立即使用。重点在于一篇名为'Visualization of Thought Elicits Spatial Reasoning and Large Language Models'的白皮书,它描述了如何赋予大型语言模型空间推理能力,这是以往这些模型做得非常差的一项能力。空间推理是指在3D或2D环境中不同对象间的关系可视化,是大型语言模型尚未具备的核心特性。微软的研究展示了如何通过'visualization of thought'(思维可视化)提示来激发大型语言模型的'Mind's Eye'(思维之眼),从而实现空间推理。

05:01

🤖 大型语言模型在空间任务中的表现

这一段深入探讨了大型语言模型在空间任务中的表现,包括自然语言导航、视觉导航和视觉铺砌任务。研究者们设计了2D网格世界,使用特殊字符作为输入格式,以测试模型的空间推理能力。大型语言模型在这些任务中的表现得到了显著提升,这表明通过可视化思维(VOT)提示,可以有效地增强模型在空间推理方面的表现。研究结果表明,VOT提示技术在所有任务中都取得了最佳性能,这证明了不同的提示技术确实会影响结果。

10:03

📈 GPT-4在不同提示技术下的性能对比

本段讨论了GPT-4在不同提示技术下的性能对比,包括链式思考(Chain of Thought)和可视化思维(Visualization of Thought)。实验结果表明,使用VOT提示技术在路线规划、下一步预测、视觉铺砌和自然语言导航等任务中均取得了最佳完成率和成功率。此外,文中还提到了VOT提示技术的限制,即它依赖于先进的大型语言模型的能力,可能会在较不先进的语言模型或更具挑战性的任务中导致性能下降。

15:05

🚀 Pi Win Assistant:开源的Windows环境控制模型

最后一段介绍了一个名为Pi Win Assistant的开源项目,这是一个使用自然语言控制Windows环境的通用人工智能模型。通过一系列的例子,展示了Pi Win Assistant如何通过自然语言指令执行任务,如打开Firefox浏览器、点击YouTube、搜索内容等。此外,还展示了如何创建Twitter帖子,说明了Pi Win Assistant在执行任务时的步骤可视化。这些例子证明了Pi Win Assistant的有效性,并且鼓励观众阅读研究论文并尝试使用该模型。

Mindmap

Keywords

💡大型语言模型

大型语言模型是指具有大量参数的人工智能模型,它们能够处理和生成自然语言文本。在视频中,大型语言模型通过空间推理能力的提升,能够更好地理解和操作三维或二维环境中对象间的关系。例如,视频提到大型语言模型在没有图形理解能力的情况下,通过自然语言描述来解决2D网格世界中的问题。

💡空间推理

空间推理是指在三维或二维环境中可视化不同对象间关系的能力。视频中提到,人类通过所谓的“心智之眼”进行空间推理,而大型语言模型通过“思维可视化”技术,也能够实现类似的空间推理能力。例如,视频中的一个例子是考虑从北极点开始走50码后左转并继续走,思考是否会回到原点的问题。

💡思维可视化(VOT)

思维可视化(Visualization of Thought, VOT)是一种提示技术,它要求大型语言模型在每一步推理过程中都生成内部的视觉图像。这种方法使模型能够更好地理解和操作空间关系,从而提高其在空间推理任务中的表现。视频中展示了VOT在导航、视觉平铺和自然语言导航等任务中的应用。

💡

💡PiWin助手

PiWin助手是由微软发布的首个开源的大型行动模型,它是一个通用的人工狭义智能,可以通过自然语言控制完全由人类使用的界面。视频中提到,PiWin助手使用了与研究论文中相同的技术来控制Windows环境,展示了如何通过自然语言指令来执行一系列复杂的任务。

💡自然语言导航

自然语言导航是一种使用自然语言指令来指导模型在虚拟空间中导航的技术。在视频中,通过描述一个3x3的网格,并给出一步步的指令,大型语言模型能够理解并执行导航任务。例如,模型能够根据指令逐步从一个点移动到另一个点。

💡视觉导航

视觉导航是一种要求模型使用视觉提示来导航的技术。在视频中,模型面临的挑战是在合成的2D网格世界中使用视觉线索来导航,同时避免障碍物。这涉及到多跳空间推理,模型必须生成导航指令来到达目的地。

💡视觉平铺

视觉平铺是一种经典的空间推理挑战,要求模型在一个限定区域内理解和操作形状。在视频中,模型被赋予一个带有不同颜色和形状的网格,并需要找到放置新对象的位置。例如,在一个给定的网格中放置一个红色的4x1或1x4对象。

💡链式思考(Chain of Thought)

链式思考是一种高级提示技术,要求模型在得到输出之前,逐步展示其思考过程。在视频中,这种技术被用来提高大型语言模型在空间推理任务中的表现。与VOT相比,链式思考不要求在每一步都可视化,而是在关键步骤中展示思考过程。

💡零样本提示(Zero Shot Prompting)

零样本提示是一种不依赖于少量示例或演示的技术,它允许模型在没有先前经验的情况下完成任务。在视频中,VOT采用了零样本提示,这意味着模型在没有任何图像或视觉提示的情况下,通过自然语言描述来理解和操作空间关系。

💡性能提升

性能提升指的是通过应用特定的技术或方法来提高模型的执行效率和准确性。在视频中,VOT提示技术显著提高了大型语言模型在空间推理任务中的表现,例如在路线规划、下一步预测、视觉平铺和自然语言导航等任务中取得了更好的成功率。

💡人工智能

人工智能是指使机器模拟人类智能行为的技术,包括学习、推理、自我修正和感知。在视频中,人工智能通过PiWin助手展示了如何通过自然语言与人类交互,并执行复杂的任务,如在Twitter上发帖或在网页上导航。

Highlights

微软发布了一个开源的大型动作模型,类似于兔子R1,可以通过自然语言控制Android环境中的应用程序。

该模型被称为Pi win助手,是首个开源的通用人工智能助手,能够仅通过自然语言控制完全由人类使用的界面。

微软的研究论文《思维可视化促进空间推理和大型语言模型》展示了如何让大型语言模型具备空间推理能力。

空间推理是指在3D或2D环境中不同对象之间关系的视觉化能力,这是大型语言模型历史上做得不好的一个领域。

论文提出,通过思维可视化提示(Visualization of Thought, VOT)可以激发大型语言模型的空间推理能力。

研究者使用三种任务来测试模型的空间意识,包括自然语言导航、视觉导航和视觉铺砖。

VOT提示技术通过在每一步推理过程中可视化状态,显著提高了大型语言模型在相应任务上的性能。

研究结果表明,VOT提示技术在路线规划、下一步预测、视觉铺砖和自然语言导航等任务中表现最佳。

Pi win助手能够执行各种任务,如打开Firefox浏览器、点击YouTube、搜索Rick Roll视频等。

Pi win助手展示了如何通过自然语言指令在Windows环境中执行复杂的任务序列。

该技术允许大型语言模型在没有图形理解的情况下,通过自然语言描述来理解和操作2D空间。

Pi win助手的实现证明了VOT提示技术在实际应用中的有效性,它可以自动迭代提示并生成测试用例。

尽管VOT提示技术在空间推理任务中表现出色,但它依赖于先进的大型语言模型,可能会在性能上影响较不先进的模型。

该研究提供了一个开源项目,允许用户下载并立即使用,展示了大型语言模型在空间推理方面的潜力。

Pi win助手通过可视化每个步骤的路径,展示了如何逐步导航到目的地,类似于人类的空间推理过程。

该技术在Twitter上发布新帖子的任务中,展示了如何通过自然语言指令生成和执行一系列动作。

Pi win助手的实现是微软研究论文中提出理论的一个实际应用示例,展示了大型语言模型在用户界面控制方面的应用。

该技术的成功演示了大型语言模型在没有先前训练的情况下,如何实时理解和执行屏幕上的操作。

Transcripts

play00:00

today we have an opsource large action

play00:04

model so very similar to how the rabbit

play00:06

R1 can control applications within the

play00:08

Android environment just by speaking

play00:10

natural language we now have a

play00:12

completely open- Source version of that

play00:13

for the windows environment released by

play00:15

Microsoft so not only did Microsoft

play00:18

release a research paper outlining how

play00:20

they were able to achieve it they also

play00:22

have an open-source project which you

play00:24

can download and use right away and I'm

play00:26

going to show you that today so first

play00:28

let's go over the white paper this is

play00:30

called visualization of thought elicits

play00:32

spatial reasoning and large language

play00:34

models and essentially what this paper

play00:36

describes is a way to give large

play00:39

language models spatial reasoning and if

play00:42

you're not familiar with what spatial

play00:43

reasoning means it's basically just the

play00:45

ability to visualize the relationships

play00:47

in a 3D environment or even a 2d

play00:50

environment between different objects

play00:52

and this is something that large

play00:53

language models have historically done

play00:56

really poorly and the lead of meta AI

play00:58

Yan laon has talked about about this as

play01:00

being a core missing feature of large

play01:02

language models that will prevent us

play01:04

from reaching AGI but in this paper they

play01:07

show that it's actually possible to get

play01:09

spatial reasoning out of large language

play01:11

models so let me give you an example of

play01:13

what spatial reasoning is in your mind

play01:15

think about this you're standing at a

play01:17

point on the North Pole and you start

play01:20

walking and you walk 50 yards in One

play01:22

Direction then you turn left and then

play01:25

you continue to walk indefinitely now

play01:27

think about this if you continued

play01:29

walking would you ever cross over that

play01:32

initial point now you're doing all of

play01:34

this spatial reasoning in your head

play01:36

through what's called your mind's eye

play01:38

language isn't really involved when

play01:41

you're thinking through this problem and

play01:43

that is what spatial reasoning is and

play01:45

that is why Yan laon thinks spatial

play01:47

reasoning is not possible with language

play01:49

models alone but according to this paper

play01:52

it definitely is so let me get into it

play01:54

and remember stick around to after this

play01:56

because I'm actually going to show it to

play01:57

you in action in an open source project

play02:00

so this is out of Microsoft research so

play02:02

in the beginning it talks about how

play02:03

large language models are really great

play02:05

however their abilities in spatial

play02:06

reasoning a crucial aspect of human

play02:08

cognition remain relatively unexplored

play02:11

humans possess a remarkable ability to

play02:14

create mental images of unseen objects

play02:16

and actions through a process known as

play02:17

The Mind's Eye enabling the imagination

play02:20

of the Unseen World inspired by this

play02:22

cognitive capacity we propose

play02:24

visualization of thought prompting and

play02:27

I'm going to show you why this will

play02:28

translate into a large action model

play02:31

because right now it's called

play02:32

visualization of thought but if we take

play02:34

this technique and we apply it to a user

play02:37

interface we can actually control that

play02:38

user interface and that's essentially

play02:40

what a large action model is so let's

play02:42

look at this diagram this is what is

play02:45

happening in the human mind we have

play02:47

visuals we have verbal language we put

play02:50

it all together in what is called The

play02:52

Mind's Eye and then we put together a

play02:54

mental image of whatever we're thinking

play02:57

about now on the right side is what is

play02:59

the The Mind's Eye of large language

play03:01

models so really we only have text

play03:03

language we put it all together in what

play03:06

is the large language models Mind's Eye

play03:09

and then we come up with what is a

play03:10

mental image so can we actually achieve

play03:13

that with a large language model well

play03:15

let's find out so here is conventional

play03:18

prompting you have an input and then you

play03:20

get an output and then we have more

play03:21

advanced prompting techniques like Chain

play03:23

of Thought So it's an input and then

play03:25

walk me through thought by thought how

play03:27

you get to the output and what we found

play03:30

is when you use Chain of Thought

play03:31

prompting and other prompting techniques

play03:33

like reflection you actually improve the

play03:35

performance of the large language model

play03:38

pretty greatly actually then we have

play03:40

visualization of thought we have the

play03:41

input and then we ask it to have a

play03:44

thought and to represent the

play03:46

visualization at each step along the way

play03:49

before we get to the output and this is

play03:51

all theoretical I'm going to show you

play03:52

actual examples of it in a second so

play03:55

humans can enhance their spatial

play03:56

awareness and inform Decisions by

play03:58

creating mental images during the

play03:59

spatial reasoning process similarly

play04:02

large language models can create

play04:04

internal mental images we propose the

play04:06

visualization of thought prompting to

play04:08

elicit The Mind's Eye of llms for

play04:10

spatial reasoning so spatial reasoning

play04:12

is super important in basically every

play04:14

aspect of life whether you're driving

play04:16

playing video games playing chess just

play04:19

walking everything you're doing is using

play04:21

spatial awareness as long as you're

play04:23

interacting with your 3D World so let's

play04:25

talk about visualization of thought vot

play04:27

prompting to elicit this ability this

play04:29

being spatial awareness this method

play04:31

augments llms with a visual spatial

play04:34

sketch pad to visualize their reasoning

play04:36

steps and inform subsequent steps vot

play04:39

adopts zero shot prompting instead of

play04:40

relying on few shot demonstrations or

play04:43

textto image visualization with clip to

play04:46

evaluate the effectiveness of vot and

play04:48

spatial reasoning we selected three

play04:50

tasks that require spatial awareness in

play04:52

llms including natural language

play04:54

navigation visual navigation and visual

play04:56

tiling and I'll explain what all three

play04:58

of those things are we designed 2D grid

play05:01

worlds using special characters as

play05:03

enriched input formats for the llms in

play05:06

visual navigation and visual tiling

play05:08

tasks now remember large language models

play05:11

can't interpret graphs like if we were

play05:13

to put together a 2d tile and just pass

play05:16

it to the large language model it

play05:18

wouldn't really understand it we have to

play05:20

represent that 2D space with natural

play05:23

language and you'll see how they do it

play05:25

so vot prompting proposed in this paper

play05:27

consistently induces llms to visual uze

play05:30

the reasoning steps and inform

play05:31

subsequent steps and consequently this

play05:34

approach achieved significant

play05:36

performance improvements on the

play05:37

corresponding tasks so let's look at

play05:39

this we have a bunch of 2D grids right

play05:41

here and they're of different sizes and

play05:44

they have different objects within them

play05:46

so let's look at this k equals 2 so the

play05:48

house is the starting point and the

play05:50

office is the ending point and what

play05:52

we're going to do is we're going to ask

play05:54

the large language model to navigate

play05:56

step by step from the house to the

play05:59

office it's easy for humans to do this

play06:01

right go right go right go up go up and

play06:04

that's it and obviously we can get more

play06:06

complicated but it's still super easy in

play06:08

fact we don't really even need to go

play06:10

step by step we can kind of just look at

play06:12

it and go all the way through just by

play06:14

thinking about it but if we had to we

play06:16

could describe it up up left left up up

play06:19

Etc but this is spatial awareness this

play06:21

is spatial reasoning and this is very

play06:23

difficult for large language models to

play06:25

date but not anymore so spatial

play06:27

reasoning refers to the ability to

play06:29

comprehend and reason about the spatial

play06:30

relationships among objects their

play06:32

movements and interactions and these can

play06:34

be applied in the context of technology

play06:38

to navigation Robotics and autonomous

play06:40

driving so here they say in this context

play06:42

a square map is defined by a sequence of

play06:45

random walk instructions along

play06:46

corresponding objects denoted as and

play06:49

then they actually just give the

play06:50

algorithm to denote the graph and the

play06:54

walking path then we have visual

play06:56

navigation so visual navigation task

play06:58

presents a synthetic 2D grid world to

play07:00

llm challenging it to navigate using

play07:02

visual cues the model must generate

play07:04

navigation instructions to move in four

play07:06

directions left right up down what we

play07:07

just talked about to reach the

play07:09

destination from the starting point

play07:10

while avoiding obstacles this involves

play07:13

two subtests route planning and Next

play07:15

Step prediction requiring multihop

play07:17

spatial reasoning while the former is

play07:19

more complex and here is the formulation

play07:22

of it so it's represented by a formula

play07:24

rather than just passing in like an

play07:26

image of that 2D grid then we have

play07:29

visual ual tiling and that is what we're

play07:31

seeing right here in these examples and

play07:33

let me just talk about that for a second

play07:35

polyomino tiling is a classic spatial

play07:38

reasoning challenge we extend this

play07:40

concept to test the lm's ability to

play07:42

comprehend organize and reason with

play07:43

shapes in a confined area so essentially

play07:46

you have a grid with different colors

play07:48

different shapes really and you are

play07:51

tasked with finding a place for a new

play07:53

object now if we just look at this we

play07:55

can tell that within this grid right

play07:58

here we can place

play08:00

this red 4X one or

play08:03

1x4 object right here okay so that is

play08:07

essentially what this test is

play08:09

accomplishing now the really important

play08:11

part of vot prompting is visualizing at

play08:15

each step so it's kind of like Chain of

play08:18

Thought we're not just saying okay do it

play08:19

all at once it's I want to see a trace

play08:22

of the path step by step as you go along

play08:24

the way so we introduce vot prompting

play08:27

and it just starts really simply

play08:29

visualize the state after each reasoning

play08:32

step this new paradigm for spatial

play08:34

reasoning aims to generate reasoning

play08:35

traces and visualizations in an

play08:37

interleaved manner so let's look at the

play08:40

one on the left first so this is visual

play08:43

navigation we've already seen this so we

play08:45

have the house right here and the llm is

play08:47

supposed to navigate through all of

play08:49

these empty squares so the ones with

play08:52

gates in them cannot be navigated

play08:53

through all the way down to the office

play08:56

and what we're seeing down here is the L

play08:59

I'm doing that and doing it step by step

play09:01

so step one move right step two move

play09:05

down step three move left move down move

play09:08

left move down and they reached it same

play09:11

with visual tiling and what we're doing

play09:13

is we provide it with this grid and

play09:16

three different objects so 1x4 this is

play09:19

essentially Tetris objects and we say

play09:21

can you fit all of them into this grid

play09:24

and so it says okay well let's look at I

play09:26

where does that go then let's look at l

play09:29

where does that go and then let's look

play09:30

at T where does that go and then it is

play09:32

able to accomplish that and get them all

play09:34

in there and then here we have natural

play09:36

language navigation so we describe a 3X3

play09:39

grid and we tell it step by step what it

play09:42

needs to do and we're actually giving it

play09:44

the steps and then at the end we say

play09:46

okay where are you what did you find and

play09:48

so we're visualizing each step and the

play09:51

one with stars on it is where the large

play09:53

language model thinks it is in the

play09:55

current state so step two it's w step

play09:58

three it's c all the way up to step

play10:00

seven s and so on and then finally we're

play10:02

at C and so they tested four different

play10:05

versions and they're using GPT 4 so

play10:08

first gp4 with Chain of Thought So let's

play10:10

think step by step GPT 4 without

play10:13

visualization so don't use visualization

play10:15

the techniques that we're talking about

play10:16

today let's think step by step then gp24

play10:20

with vision so the ability to interpret

play10:23

what's in an image let's think step by

play10:25

step and then gbt 4 with v so visualize

play10:29

the state after each reasoning step now

play10:32

let's look at the performance so as you

play10:34

can see all the Bold across the board is

play10:36

where it performed best so first for

play10:39

route planning we have the completing

play10:40

rate and we have GPT 4 with vot as the

play10:45

best then we have the success rate far

play10:49

superior nearly 50% greater than the

play10:52

second place GPT 4 without visualization

play10:55

Next Step prediction visual tiling and

play10:58

natural language navig ation across the

play11:00

board vot prompting technique just wins

play11:04

it's really impressive so does that mean

play11:06

that different prompting techniques

play11:08

actually affect the outcome well yeah I

play11:10

mean that's obvious right so what it

play11:12

says here is in the setting gp4 coot

play11:15

Chain of Thought without explicit

play11:17

visualization prompts it demonstrated

play11:20

noticeable tracking rate across almost

play11:22

all tasks except route planning the fact

play11:25

implies that llm innately exhibit this

play11:28

capability of visual State tracking when

play11:30

spatial temporal simulation is necessary

play11:33

for reasoning and in this figure we're

play11:36

also seeing the difference between

play11:37

asking it to visualize and output the

play11:40

visualization at each step along the way

play11:43

versus just at least one step so here is

play11:46

the complete tracking rate which means

play11:47

it's visualizing at every single step

play11:49

route planning completely dominates for

play11:52

Next Step prediction does a lot better

play11:54

visual tiling and so on natural language

play11:57

so this purple is gb4 with vot on the

play12:00

right side is partial tracking rate

play12:02

which means at least one step had the

play12:05

visualization and what we're seeing here

play12:06

is similar results except for Next Step

play12:09

prediction in which gp4 with coot Chain

play12:12

of Thought actually performs pretty darn

play12:14

well so one last thing before I actually

play12:16

show you the examples what are the

play12:17

limitations so both mental images and

play12:20

visual State tracking rely on the

play12:21

emerging ability of advanced llms

play12:24

therefore it might cause performance

play12:26

deterioration in less Advanced language

play12:28

models or more challenging tasks so here

play12:31

is the project it's called Pi win

play12:34

assistant and it's described as the

play12:36

first open source large action model

play12:38

generalist artificial narrow

play12:40

intelligence that controls completely

play12:42

human user interfaces only by using

play12:44

natural language so they reference this

play12:46

paper this is actually how I found the

play12:48

paper and it uses the same techniques to

play12:50

control a Windows environment so they

play12:52

give you this cute little character in

play12:53

the right and you can essentially task

play12:55

it with anything you want so let's look

play12:58

at a few examples all right so what

play13:00

we're going to be seeing is an example

play13:02

in the windows environment we have this

play13:04

little assistant right there and you can

play13:06

tell it to do different things so the

play13:08

first thing we're going to tell it or

play13:09

the first thing that the video tells it

play13:10

is to open Firefox open Firefox click on

play13:15

YouTube click on YouTube so it's giving

play13:17

it a series of things to do CLI onto the

play13:19

element without visioning

play13:23

context okay so it clicked on YouTube

play13:26

Okay so let's take a look at actually

play13:27

what's happening so you click clicked on

play13:29

the assistant you dragg me so that's

play13:30

just the person dragging the little

play13:32

assistant around then we say open

play13:33

Firefox so it responds with clicking on

play13:36

click on YouTube selected application

play13:38

Mozilla Firefox then AI decision

play13:41

coordinates it actually finds the

play13:42

coordinates then it says clicking on the

play13:44

search input and so on so let's keep

play13:47

watching so there we go type Rick Roll

play13:50

type Rick

play13:51

Roll click on search click on search

play13:54

clicking onto the element without

play13:56

visioning context

play13:59

click on the second video okay so we're

play14:02

just telling it what to do and it's able

play14:04

to do that this is essentially open

play14:06

interpreter but it works really really

play14:08

well clicking onto the element without

play14:11

visioning

play14:13

context and there we go so it was able

play14:16

to do that I'm going to mute it because

play14:18

I don't want to get copyright stried and

play14:20

it's playing the video now so it's just

play14:22

step by step telling it exactly what it

play14:23

needs to do there it said to mute it so

play14:25

it clicked on the mute button again it

play14:27

has no training as to what is on the

play14:31

screen or how to click it's figuring it

play14:33

out as it goes and it's asking to

play14:35

visualize it at each step so very

play14:38

impressive all right so let's look at

play14:39

this next example by the way this is an

play14:42

awesome background so the user has given

play14:44

it the instruction make a new post on

play14:46

Twitter saying hello world and a brief

play14:48

greeting explaining your an artificial

play14:50

intelligence and then here's the prompt

play14:52

here's another prompt it is analyzing

play14:55

what to do generating the test case and

play14:57

then it actually interest inly iterates

play15:00

on the prompt automatically and then it

play15:02

says current status so that is where

play15:05

it's representing what it currently

play15:07

understands it's basically the

play15:08

visualization at each step so let's keep

play15:11

watching so add SPAC map click on what

play15:13

is happening okay then it generates the

play15:16

actions right here so step click on the

play15:18

browser address bar enter twitter.com

play15:21

wait for the Twitter homepage to load so

play15:24

it's giving the entire set of actions it

play15:27

needs to accomplish and it's going to go

play15:29

through it step by step so it's actually

play15:31

asking it to do the planning up front

play15:33

well let's watch it so selected element

play15:35

locate the address it shows the

play15:38

coordinates of the address bar clicks on

play15:40

it enters twitter.com there we go okay

play15:43

found the address bar right there

play15:46

entered the tweet and then hopefully

play15:48

they're going to push post but here we

play15:50

go we can see every single step along

play15:52

the

play15:54

way very cool so let's look at some of

play15:56

the cases these are proven cases working

play15:59

cases so open a new tab with the song

play16:01

click on the button send a list of steps

play16:03

to make a joke about engineers whilst

play16:05

making it essay and so on and so forth

play16:07

so it's actually a lot of really cool

play16:10

implementations of this so I encourage

play16:13

you to check this out read the research

play16:14

paper if you're interested if you want

play16:16

to see me do a full tutorial of pwin

play16:18

assistant let me know in the comments

play16:19

I'm happy to do that if you enjoyed this

play16:21

video please give a like And subscribe

play16:23

and I'll see you in the next one

Rate This

5.0 / 5 (0 votes)

Related Tags
空间推理大型语言模型可视化思维微软研究开源项目导航系统拼图挑战人工智能自然语言用户界面技术演示
Do you need a summary in English?