"VoT" Gives LLMs Spacial Reasoning AND Open-Source "Large Action Model"
Summary
TLDR微软发布了一个开源的大型动作模型,名为Pi win助手,它利用自然语言控制Windows环境中的应用程序。该技术基于一篇名为“可视化思维激发空间推理和大型语言模型”的白皮书,该白皮书介绍了如何通过可视化思维提示(VOT)来增强大型语言模型的空间推理能力。空间推理是大型语言模型历史上表现不佳的一个领域,但这项研究表明,通过可视化思维提示,可以显著提高模型在空间推理任务中的表现。Pi win助手展示了如何通过自然语言指令执行复杂的任务,如在Twitter上发帖或在网页上导航,而无需任何视觉上下文。这项技术不仅展示了大型语言模型的新可能性,也揭示了它们在空间推理方面的潜力。
Takeaways
- 📄 微软发布了一篇研究论文和开源项目,展示了如何通过自然语言控制Windows环境中的应用程序。
- 🤖 该技术被称为“可视化思维”(Visualization of Thought),旨在提升大型语言模型的空间推理能力。
- 🧠 空间推理是指在三维或二维环境中想象不同对象之间的关系,这是大型语言模型历史上表现不佳的一个领域。
- 🚀 通过“可视化思维”提示技术,大型语言模型能够创建内部心理图像,从而增强其空间推理能力。
- 🔍 该技术通过零样本提示而非少量样本演示或图像到文本的可视化,来评估空间推理的有效性。
- 📈 通过可视化思维提示,大型语言模型在自然语言导航、视觉导航和视觉铺砖等任务中取得了显著的性能提升。
- 📊 实验结果显示,使用可视化思维提示的大型语言模型在多个任务中的表现优于未使用可视化的模型。
- 📝 该技术通过在每个推理步骤后可视化状态,生成推理路径和可视化,从而增强了模型的空间感知能力。
- 📚 论文中提到,尽管可视化思维提示技术有效,但它依赖于先进的大型语言模型的能力,可能在较不先进的模型或更具挑战性的任务中导致性能下降。
- 🌐 微软的开源项目名为Pi win assistant,是第一个开源的大型动作模型,能够仅通过自然语言控制完全的人类用户界面。
- 💡 Pi win assistant展示了如何将可视化思维提示技术应用于实际的Windows环境控制,如打开浏览器、导航到特定网站、输入文本等。
- 🔗 对于想要深入了解这项技术的人,可以通过阅读微软的研究论文或尝试使用Pi win assistant开源项目来获取更多信息。
Q & A
什么是空间推理?
-空间推理是指在三维或二维环境中,对不同对象之间的关系进行视觉化的能力。它是人类认知的一个重要方面,涉及到对物体、它们的动作和交互之间空间关系的理解和推理。
为什么空间推理对于大型语言模型来说是一个挑战?
-空间推理对于大型语言模型来说是一个挑战,因为这些模型传统上在处理需要视觉化或空间操作的任务时表现不佳。它们通常只能处理文本语言,而空间推理需要能够像人类一样在“心灵的眼睛”中创建和操作视觉图像。
微软发布的开源项目是什么,它如何与空间推理相关?
-微软发布的开源项目名为 Pi win Assistant,它是第一个开源的大型行动模型,专门用于通过自然语言控制完全由人类使用的界面。这个项目使用了论文中提到的“思维可视化”技术,通过在每一步中可视化推理过程,来增强模型的空间推理能力。
什么是“思维可视化”(Visualization of Thought, VOT)?
-“思维可视化”是一种提示技术,它要求大型语言模型在处理任务的每一步中都生成一个内部的视觉空间草图,以此来可视化其推理步骤,并指导后续步骤。这种方法可以帮助模型更好地进行空间推理。
在测试中使用了哪些任务来评估空间推理能力?
-在测试中使用了三种需要空间意识的任务来评估大型语言模型的空间推理能力,包括自然语言导航、视觉导航和视觉铺砌。这些任务通过设计二维网格世界和使用特殊字符作为输入格式,来挑战模型的空间推理能力。
Pi win Assistant 如何在 Windows 环境中控制应用程序?
-Pi win Assistant 通过自然语言处理用户的指令,然后自动执行一系列的操作来控制 Windows 环境中的应用程序。例如,它可以打开 Firefox 浏览器,导航到 YouTube,搜索特定的内容,或者在 Twitter 上发布新的帖子。
为什么 Pi win Assistant 需要在每一步中可视化?
-在每一步中可视化是为了提供推理过程的可追溯性,类似于“思维链”(Chain of Thought)的概念。这不仅可以帮助模型更准确地执行任务,还可以让开发者或用户理解模型是如何一步步到达最终结果的。
Pi win Assistant 的性能如何,它在哪些方面表现出色?
-根据测试结果,使用 VOT 提示技术(即在每一步中可视化)的 GPT-4 在所有任务中都表现出色,特别是在路线规划、下一步预测、视觉铺砌和自然语言导航等方面,其成功率和完成率都显著高于未使用可视化技术的模型。
Pi win Assistant 有什么局限性?
-Pi win Assistant 的局限性在于它依赖于先进的大型语言模型的能力,因此在性能较差的语言模型或者更复杂的任务上可能会导致性能下降。此外,它需要能够理解和处理自然语言描述的二维空间,而不是直接处理图形或图像。
如何获取并使用 Pi win Assistant?
-Pi win Assistant 是一个开源项目,用户可以从相关的开源社区或微软的官方网站获取源代码。根据提供的文档和指南,用户可以下载、安装并在本地环境中使用 Pi win Assistant。
除了控制 Windows 环境,Pi win Assistant 是否可以扩展到其他操作系统?
-虽然 Pi win Assistant 目前是为 Windows 环境设计的,但其基于自然语言控制接口的核心概念理论上可以扩展到其他操作系统。这将需要对不同操作系统的用户界面和控件进行适配和开发。
Pi win Assistant 的未来发展方向可能是什么?
-Pi win Assistant 的未来发展方向可能包括提高模型的空间推理能力,扩展到更多的操作系统和应用程序,以及增强用户交互体验。此外,它还可以被集成到更复杂的自动化系统中,或者用于辅助视觉障碍人士使用计算机。
Outlines
📄 大型语言模型的空间推理能力
本段介绍了微软发布的开源项目,该项目旨在为Windows环境提供类似于Android环境下Rabbit R1应用的控制功能。微软不仅发布了研究论文,还提供了一个开源项目,允许用户下载并立即使用。重点在于一篇名为'Visualization of Thought Elicits Spatial Reasoning and Large Language Models'的白皮书,它描述了如何赋予大型语言模型空间推理能力,这是以往这些模型做得非常差的一项能力。空间推理是指在3D或2D环境中不同对象间的关系可视化,是大型语言模型尚未具备的核心特性。微软的研究展示了如何通过'visualization of thought'(思维可视化)提示来激发大型语言模型的'Mind's Eye'(思维之眼),从而实现空间推理。
🤖 大型语言模型在空间任务中的表现
这一段深入探讨了大型语言模型在空间任务中的表现,包括自然语言导航、视觉导航和视觉铺砌任务。研究者们设计了2D网格世界,使用特殊字符作为输入格式,以测试模型的空间推理能力。大型语言模型在这些任务中的表现得到了显著提升,这表明通过可视化思维(VOT)提示,可以有效地增强模型在空间推理方面的表现。研究结果表明,VOT提示技术在所有任务中都取得了最佳性能,这证明了不同的提示技术确实会影响结果。
📈 GPT-4在不同提示技术下的性能对比
本段讨论了GPT-4在不同提示技术下的性能对比,包括链式思考(Chain of Thought)和可视化思维(Visualization of Thought)。实验结果表明,使用VOT提示技术在路线规划、下一步预测、视觉铺砌和自然语言导航等任务中均取得了最佳完成率和成功率。此外,文中还提到了VOT提示技术的限制,即它依赖于先进的大型语言模型的能力,可能会在较不先进的语言模型或更具挑战性的任务中导致性能下降。
🚀 Pi Win Assistant:开源的Windows环境控制模型
最后一段介绍了一个名为Pi Win Assistant的开源项目,这是一个使用自然语言控制Windows环境的通用人工智能模型。通过一系列的例子,展示了Pi Win Assistant如何通过自然语言指令执行任务,如打开Firefox浏览器、点击YouTube、搜索内容等。此外,还展示了如何创建Twitter帖子,说明了Pi Win Assistant在执行任务时的步骤可视化。这些例子证明了Pi Win Assistant的有效性,并且鼓励观众阅读研究论文并尝试使用该模型。
Mindmap
Keywords
💡大型语言模型
💡空间推理
💡思维可视化(VOT)
💡
💡PiWin助手
💡自然语言导航
💡视觉导航
💡视觉平铺
💡链式思考(Chain of Thought)
💡零样本提示(Zero Shot Prompting)
💡性能提升
💡人工智能
Highlights
微软发布了一个开源的大型动作模型,类似于兔子R1,可以通过自然语言控制Android环境中的应用程序。
该模型被称为Pi win助手,是首个开源的通用人工智能助手,能够仅通过自然语言控制完全由人类使用的界面。
微软的研究论文《思维可视化促进空间推理和大型语言模型》展示了如何让大型语言模型具备空间推理能力。
空间推理是指在3D或2D环境中不同对象之间关系的视觉化能力,这是大型语言模型历史上做得不好的一个领域。
论文提出,通过思维可视化提示(Visualization of Thought, VOT)可以激发大型语言模型的空间推理能力。
研究者使用三种任务来测试模型的空间意识,包括自然语言导航、视觉导航和视觉铺砖。
VOT提示技术通过在每一步推理过程中可视化状态,显著提高了大型语言模型在相应任务上的性能。
研究结果表明,VOT提示技术在路线规划、下一步预测、视觉铺砖和自然语言导航等任务中表现最佳。
Pi win助手能够执行各种任务,如打开Firefox浏览器、点击YouTube、搜索Rick Roll视频等。
Pi win助手展示了如何通过自然语言指令在Windows环境中执行复杂的任务序列。
该技术允许大型语言模型在没有图形理解的情况下,通过自然语言描述来理解和操作2D空间。
Pi win助手的实现证明了VOT提示技术在实际应用中的有效性,它可以自动迭代提示并生成测试用例。
尽管VOT提示技术在空间推理任务中表现出色,但它依赖于先进的大型语言模型,可能会在性能上影响较不先进的模型。
该研究提供了一个开源项目,允许用户下载并立即使用,展示了大型语言模型在空间推理方面的潜力。
Pi win助手通过可视化每个步骤的路径,展示了如何逐步导航到目的地,类似于人类的空间推理过程。
该技术在Twitter上发布新帖子的任务中,展示了如何通过自然语言指令生成和执行一系列动作。
Pi win助手的实现是微软研究论文中提出理论的一个实际应用示例,展示了大型语言模型在用户界面控制方面的应用。
该技术的成功演示了大型语言模型在没有先前训练的情况下,如何实时理解和执行屏幕上的操作。
Transcripts
today we have an opsource large action
model so very similar to how the rabbit
R1 can control applications within the
Android environment just by speaking
natural language we now have a
completely open- Source version of that
for the windows environment released by
Microsoft so not only did Microsoft
release a research paper outlining how
they were able to achieve it they also
have an open-source project which you
can download and use right away and I'm
going to show you that today so first
let's go over the white paper this is
called visualization of thought elicits
spatial reasoning and large language
models and essentially what this paper
describes is a way to give large
language models spatial reasoning and if
you're not familiar with what spatial
reasoning means it's basically just the
ability to visualize the relationships
in a 3D environment or even a 2d
environment between different objects
and this is something that large
language models have historically done
really poorly and the lead of meta AI
Yan laon has talked about about this as
being a core missing feature of large
language models that will prevent us
from reaching AGI but in this paper they
show that it's actually possible to get
spatial reasoning out of large language
models so let me give you an example of
what spatial reasoning is in your mind
think about this you're standing at a
point on the North Pole and you start
walking and you walk 50 yards in One
Direction then you turn left and then
you continue to walk indefinitely now
think about this if you continued
walking would you ever cross over that
initial point now you're doing all of
this spatial reasoning in your head
through what's called your mind's eye
language isn't really involved when
you're thinking through this problem and
that is what spatial reasoning is and
that is why Yan laon thinks spatial
reasoning is not possible with language
models alone but according to this paper
it definitely is so let me get into it
and remember stick around to after this
because I'm actually going to show it to
you in action in an open source project
so this is out of Microsoft research so
in the beginning it talks about how
large language models are really great
however their abilities in spatial
reasoning a crucial aspect of human
cognition remain relatively unexplored
humans possess a remarkable ability to
create mental images of unseen objects
and actions through a process known as
The Mind's Eye enabling the imagination
of the Unseen World inspired by this
cognitive capacity we propose
visualization of thought prompting and
I'm going to show you why this will
translate into a large action model
because right now it's called
visualization of thought but if we take
this technique and we apply it to a user
interface we can actually control that
user interface and that's essentially
what a large action model is so let's
look at this diagram this is what is
happening in the human mind we have
visuals we have verbal language we put
it all together in what is called The
Mind's Eye and then we put together a
mental image of whatever we're thinking
about now on the right side is what is
the The Mind's Eye of large language
models so really we only have text
language we put it all together in what
is the large language models Mind's Eye
and then we come up with what is a
mental image so can we actually achieve
that with a large language model well
let's find out so here is conventional
prompting you have an input and then you
get an output and then we have more
advanced prompting techniques like Chain
of Thought So it's an input and then
walk me through thought by thought how
you get to the output and what we found
is when you use Chain of Thought
prompting and other prompting techniques
like reflection you actually improve the
performance of the large language model
pretty greatly actually then we have
visualization of thought we have the
input and then we ask it to have a
thought and to represent the
visualization at each step along the way
before we get to the output and this is
all theoretical I'm going to show you
actual examples of it in a second so
humans can enhance their spatial
awareness and inform Decisions by
creating mental images during the
spatial reasoning process similarly
large language models can create
internal mental images we propose the
visualization of thought prompting to
elicit The Mind's Eye of llms for
spatial reasoning so spatial reasoning
is super important in basically every
aspect of life whether you're driving
playing video games playing chess just
walking everything you're doing is using
spatial awareness as long as you're
interacting with your 3D World so let's
talk about visualization of thought vot
prompting to elicit this ability this
being spatial awareness this method
augments llms with a visual spatial
sketch pad to visualize their reasoning
steps and inform subsequent steps vot
adopts zero shot prompting instead of
relying on few shot demonstrations or
textto image visualization with clip to
evaluate the effectiveness of vot and
spatial reasoning we selected three
tasks that require spatial awareness in
llms including natural language
navigation visual navigation and visual
tiling and I'll explain what all three
of those things are we designed 2D grid
worlds using special characters as
enriched input formats for the llms in
visual navigation and visual tiling
tasks now remember large language models
can't interpret graphs like if we were
to put together a 2d tile and just pass
it to the large language model it
wouldn't really understand it we have to
represent that 2D space with natural
language and you'll see how they do it
so vot prompting proposed in this paper
consistently induces llms to visual uze
the reasoning steps and inform
subsequent steps and consequently this
approach achieved significant
performance improvements on the
corresponding tasks so let's look at
this we have a bunch of 2D grids right
here and they're of different sizes and
they have different objects within them
so let's look at this k equals 2 so the
house is the starting point and the
office is the ending point and what
we're going to do is we're going to ask
the large language model to navigate
step by step from the house to the
office it's easy for humans to do this
right go right go right go up go up and
that's it and obviously we can get more
complicated but it's still super easy in
fact we don't really even need to go
step by step we can kind of just look at
it and go all the way through just by
thinking about it but if we had to we
could describe it up up left left up up
Etc but this is spatial awareness this
is spatial reasoning and this is very
difficult for large language models to
date but not anymore so spatial
reasoning refers to the ability to
comprehend and reason about the spatial
relationships among objects their
movements and interactions and these can
be applied in the context of technology
to navigation Robotics and autonomous
driving so here they say in this context
a square map is defined by a sequence of
random walk instructions along
corresponding objects denoted as and
then they actually just give the
algorithm to denote the graph and the
walking path then we have visual
navigation so visual navigation task
presents a synthetic 2D grid world to
llm challenging it to navigate using
visual cues the model must generate
navigation instructions to move in four
directions left right up down what we
just talked about to reach the
destination from the starting point
while avoiding obstacles this involves
two subtests route planning and Next
Step prediction requiring multihop
spatial reasoning while the former is
more complex and here is the formulation
of it so it's represented by a formula
rather than just passing in like an
image of that 2D grid then we have
visual ual tiling and that is what we're
seeing right here in these examples and
let me just talk about that for a second
polyomino tiling is a classic spatial
reasoning challenge we extend this
concept to test the lm's ability to
comprehend organize and reason with
shapes in a confined area so essentially
you have a grid with different colors
different shapes really and you are
tasked with finding a place for a new
object now if we just look at this we
can tell that within this grid right
here we can place
this red 4X one or
1x4 object right here okay so that is
essentially what this test is
accomplishing now the really important
part of vot prompting is visualizing at
each step so it's kind of like Chain of
Thought we're not just saying okay do it
all at once it's I want to see a trace
of the path step by step as you go along
the way so we introduce vot prompting
and it just starts really simply
visualize the state after each reasoning
step this new paradigm for spatial
reasoning aims to generate reasoning
traces and visualizations in an
interleaved manner so let's look at the
one on the left first so this is visual
navigation we've already seen this so we
have the house right here and the llm is
supposed to navigate through all of
these empty squares so the ones with
gates in them cannot be navigated
through all the way down to the office
and what we're seeing down here is the L
I'm doing that and doing it step by step
so step one move right step two move
down step three move left move down move
left move down and they reached it same
with visual tiling and what we're doing
is we provide it with this grid and
three different objects so 1x4 this is
essentially Tetris objects and we say
can you fit all of them into this grid
and so it says okay well let's look at I
where does that go then let's look at l
where does that go and then let's look
at T where does that go and then it is
able to accomplish that and get them all
in there and then here we have natural
language navigation so we describe a 3X3
grid and we tell it step by step what it
needs to do and we're actually giving it
the steps and then at the end we say
okay where are you what did you find and
so we're visualizing each step and the
one with stars on it is where the large
language model thinks it is in the
current state so step two it's w step
three it's c all the way up to step
seven s and so on and then finally we're
at C and so they tested four different
versions and they're using GPT 4 so
first gp4 with Chain of Thought So let's
think step by step GPT 4 without
visualization so don't use visualization
the techniques that we're talking about
today let's think step by step then gp24
with vision so the ability to interpret
what's in an image let's think step by
step and then gbt 4 with v so visualize
the state after each reasoning step now
let's look at the performance so as you
can see all the Bold across the board is
where it performed best so first for
route planning we have the completing
rate and we have GPT 4 with vot as the
best then we have the success rate far
superior nearly 50% greater than the
second place GPT 4 without visualization
Next Step prediction visual tiling and
natural language navig ation across the
board vot prompting technique just wins
it's really impressive so does that mean
that different prompting techniques
actually affect the outcome well yeah I
mean that's obvious right so what it
says here is in the setting gp4 coot
Chain of Thought without explicit
visualization prompts it demonstrated
noticeable tracking rate across almost
all tasks except route planning the fact
implies that llm innately exhibit this
capability of visual State tracking when
spatial temporal simulation is necessary
for reasoning and in this figure we're
also seeing the difference between
asking it to visualize and output the
visualization at each step along the way
versus just at least one step so here is
the complete tracking rate which means
it's visualizing at every single step
route planning completely dominates for
Next Step prediction does a lot better
visual tiling and so on natural language
so this purple is gb4 with vot on the
right side is partial tracking rate
which means at least one step had the
visualization and what we're seeing here
is similar results except for Next Step
prediction in which gp4 with coot Chain
of Thought actually performs pretty darn
well so one last thing before I actually
show you the examples what are the
limitations so both mental images and
visual State tracking rely on the
emerging ability of advanced llms
therefore it might cause performance
deterioration in less Advanced language
models or more challenging tasks so here
is the project it's called Pi win
assistant and it's described as the
first open source large action model
generalist artificial narrow
intelligence that controls completely
human user interfaces only by using
natural language so they reference this
paper this is actually how I found the
paper and it uses the same techniques to
control a Windows environment so they
give you this cute little character in
the right and you can essentially task
it with anything you want so let's look
at a few examples all right so what
we're going to be seeing is an example
in the windows environment we have this
little assistant right there and you can
tell it to do different things so the
first thing we're going to tell it or
the first thing that the video tells it
is to open Firefox open Firefox click on
YouTube click on YouTube so it's giving
it a series of things to do CLI onto the
element without visioning
context okay so it clicked on YouTube
Okay so let's take a look at actually
what's happening so you click clicked on
the assistant you dragg me so that's
just the person dragging the little
assistant around then we say open
Firefox so it responds with clicking on
click on YouTube selected application
Mozilla Firefox then AI decision
coordinates it actually finds the
coordinates then it says clicking on the
search input and so on so let's keep
watching so there we go type Rick Roll
type Rick
Roll click on search click on search
clicking onto the element without
visioning context
click on the second video okay so we're
just telling it what to do and it's able
to do that this is essentially open
interpreter but it works really really
well clicking onto the element without
visioning
context and there we go so it was able
to do that I'm going to mute it because
I don't want to get copyright stried and
it's playing the video now so it's just
step by step telling it exactly what it
needs to do there it said to mute it so
it clicked on the mute button again it
has no training as to what is on the
screen or how to click it's figuring it
out as it goes and it's asking to
visualize it at each step so very
impressive all right so let's look at
this next example by the way this is an
awesome background so the user has given
it the instruction make a new post on
Twitter saying hello world and a brief
greeting explaining your an artificial
intelligence and then here's the prompt
here's another prompt it is analyzing
what to do generating the test case and
then it actually interest inly iterates
on the prompt automatically and then it
says current status so that is where
it's representing what it currently
understands it's basically the
visualization at each step so let's keep
watching so add SPAC map click on what
is happening okay then it generates the
actions right here so step click on the
browser address bar enter twitter.com
wait for the Twitter homepage to load so
it's giving the entire set of actions it
needs to accomplish and it's going to go
through it step by step so it's actually
asking it to do the planning up front
well let's watch it so selected element
locate the address it shows the
coordinates of the address bar clicks on
it enters twitter.com there we go okay
found the address bar right there
entered the tweet and then hopefully
they're going to push post but here we
go we can see every single step along
the
way very cool so let's look at some of
the cases these are proven cases working
cases so open a new tab with the song
click on the button send a list of steps
to make a joke about engineers whilst
making it essay and so on and so forth
so it's actually a lot of really cool
implementations of this so I encourage
you to check this out read the research
paper if you're interested if you want
to see me do a full tutorial of pwin
assistant let me know in the comments
I'm happy to do that if you enjoyed this
video please give a like And subscribe
and I'll see you in the next one
関連動画をさらに表示
[ML News] Jamba, CMD-R+, and other new models (yes, I know this is like a week behind 🙃)
Understand DSPy: Programming AI Pipelines
【生成式AI導論 2024】第4講:訓練不了人工智慧?你可以訓練你自己 (中) — 拆解問題與使用工具
【生成式AI導論 2024】第3講:訓練不了人工智慧?你可以訓練你自己 (上) — 神奇咒語與提供更多資訊
Trying to make LLMs less stubborn in RAG (DSPy optimizer tested with knowledge graphs)
【生成式AI】ChatGPT 可以自我反省!
5.0 / 5 (0 votes)