MASSIVE Step Allowing AI Agents To Control Computers (MacOS, Windows, Linux)
Summary
TLDRВ видео представлен проект OS World, разработанный для тестирования и оценки производительности искусственных интеллектов (AI) в реальных компьютерных средах. Проект предлагает робастную среду с несколькими операционными системами, способность взаимодействия с окружением и методы измерения эффективности. OS World может работать с различными приложениями и интерфейсами, предоставляя агентам данные о позициях элементов и способах их контроля. В рамках проекта разработан язык XLang, который переводит естественноязычные инструкции в код, который может быть выполнен в определенной среде. Проведены тесты с использованием различных агентов, и результаты показывают, что использование дерева доступности или комбинация скриншота с деревом доступности дает лучшие результаты. Проект OS World является открытым исходным кодом и может значительно улучшить способ тестирования и разработки AI-агентов.
Takeaways
- 📈 Проект OS World предназначен для решения проблемы тестирования и оценки производительности ИИ-агентов в реальных компьютерных средах.
- 🌐 OS World предлагает многообразную среду с несколькими операционными системами для взаимодействия ИИ-агентов с компьютерной средой.
- 🔍 В рамках проекта разработана методология и инструменты для измерения и анализа производительности ИИ-агентов.
- 🏗️ Используется аналогия с сборкой мебели IKEA для объяснения важности понимания инструкций и их выполнения в реальном мире.
- 🤖 ИИ-агенты должны обладать способностью воспринимать среду через сенсоры, планировать действия и взаимодействовать с окружающим миром.
- 📚 Проект включает в себя исследовательскую статью, презентацию, открытый исходный код и данные для обеспечения прозрачности и доступности.
- 📋 Созданы 369 реальных компьютерных задач, которые включают взаимодействие с веб-сайтами, десктопными приложениями и использование файловой системы.
- 🛠️ Для оценки успешности выполнения задач ИИ-агентам предоставляется информация о состоянии окружения, инструкциях и наблюдениях.
- 📷 Высокоразрешительные снимки экрана обычно приводят к лучшей производительности при использовании только изображений для обучения ИИ-агентов.
- 📈 GPT-4 оказался лучшим агентом во всех режимах, кроме режима с использованием только снимков экрана, где лучшие результаты показал Gemini Pro V.
- 🔧 OS World может служить основой для дальнейшего улучшения взаимодействия ИИ-агентов с операционными системами и повышения их эффективности в реальных задачах.
Q & A
Что является одним из самых больших препятствий для тестирования ИИ-агентов?
-Одним из основных препятствий для тестирования ИИ-агентов является отсутствие способа их проверки и определения их корректной работы, что является единственным способом их улучшения.
Какой проект направлен на решение проблемы тестирования ИИ-агентов?
-Проект под названием OS World, разработанный совместно Университетом Гонконга, Карнеги-Меллонским университетом, Salesforce Research и Университетом Вательо, направлен на решение проблемы тестирования ИИ-агентов.
Что включает в себя проект OS World?
-Проект OS World включает в себя исследовательскую статью, открытый код, данные и все необходимое для тестирования ИИ-агентов в рамках многообразных операционных систем.
Какие преимущества предлагает открытый доступ к коду и данным проекта?
-Открытый доступ к коду и данным позволяет任何人 использовать и модифицировать проект для своих исследований и улучшений, что способствует быстрому прогрессу в области тестирования ИИ-агентов.
Какие задачи решает среда OS World?
-Среда OS World предоставляет агентам робастную среду с несколькими операционными системами, возможность взаимодействия с окружением и способ измерения эффективности работы.
Чему сравнивается сборка мебели из IKEA с использованием инструкций?
-Сборка мебели из IKEA используется в качестве аналогии для того, как люди принимают инструкции и выполняют их, что включает в себя понимание шагов, выполнение действий и получение обратной связи.
Какие существующие системы могут использоваться для тестирования в рамках проекта?
-Существующие системы, такие как LLMs (Large Language Models) и VMs (Virtual Machines), могут быть использованы для тестирования в рамках проекта.
Что такое XLang и как оно используется в проекте?
-XLang - это язык, который переводит естественные языковые инструкции в код, который может быть выполнен в определенной среде. В проекте OS World он используется для преобразования инструкций в действия, которые могут быть выполнены агентами.
Какие задачи могут выполнять агенты в среде OS World?
-Агенты могут выполнять многошаговые компьютерные задачи, включая работу с различными веб-приложениями и интерфейсами, чтение и запись файлов, а также выполнение команд через графический и командный интерфейс.
Какие результаты были получены в ходе тестирования различных режимов ввода?
-Тестирование показало, что использование дерева доступности или комбинация скриншота и дерева доступности дает лучшие результаты. Скриншоты с более высоким разрешением также приводят к улучшению производительности.
Какой агент проявил наилучшую производительность во время тестирования?
-GPT-4 проявил наилучшую производительность во всех режимах, кроме режима только с использованием скриншота, где лучшими результатами оказались Gemini Pro V.
Outlines
🔍 Введение в OS World: решение для тестирования AI-агентов
Параграф 1 вводит тему проекта OS World, который был создан для решения проблемы тестирования и оценки производительности AI-агентов. Авторы проекта из Университета Гонконга, CMU Salesforce и Университета Уотерлоо представили не только научную статью, но и открытый исходный код и данные. Основная идея проекта заключается в создании среды, которая позволит агентам взаимодействовать с операционными системами и приложениями, а также предоставить методы измерения их эффективности. В параграфе также рассматривается аналогия с инструкциями по сборке мебели IKEA, чтобы объяснить важность 'закрепления' шагов выполнения задач.
🤖 Основы интеллектуальных агентов и их взаимодействие с окружающими
Параграф 2 обсуждает определение интеллектуальных агентов, их способности воспринимать окружение через датчики и взаимодействовать с ним с помощью эффекторов. Рассматриваются различные типы сред, в которых могут действовать агенты, включая компьютеры, мобильные устройства и физическую реальность. Также здесь упоминается использование различных инструментов и языков программирования для взаимодействия с окружающими средами. Важной частью проекта является язык XLang, который переводит естественные языковые инструкции в код, который может быть выполнен в определенной среде.
🛠️ Примеры задач и их выполнение с использованием OS World
В параграфе 3 описывается, как агенты могут выполнять сложные компьютерные задачи, включая работу с несколькими приложениями и интерфейсами. Ос World предоставляет агентам возможность взаимодействия с операционными системами и приложениями на разных уровнях, включая графический и командный интерфейс. Здесь также объясняется, как агенты могут получать наблюдения и генерировать инструкции для взаимодействия с компьютерной средой. Рассматриваются различные типы действий, которые могут выполнять агенты, такие как перемещение мыши, нажатие клавиш и др.
📊 Оценка эффективности выполнения задач и результаты тестирования
Параграф 4 фокусируется на методах оценки эффективности выполнения задач AI-агентов. Описание процесса оценки, включая создание реальных компьютерных задач, аннотирование инструкциями, настройку начального состояния и написание скриптов для проверки завершения задач. Также здесь приводятся результаты тестирования различных агентов с использованием различных режимов ввода информации, таких как снимок экрана, дерево доступности и т.д. Указывается, что использование дерева доступности или комбинация снимка экрана с деревом доступности дает лучшие результаты.
Mindmap
Keywords
💡AI агенты
💡тестирование AI агентов
💡OS World
💡бенчмаркинг
💡открытый исходный код
💡умные агенты
💡Markov decision process
💡XLang
💡автономные агентские задачи
💡OpenAI
💡эквивалентность доступности
Highlights
OS World is a new project designed to address the benchmarking problem for AI agents.
The project includes a research paper, open-source code, and data, promoting transparency and collaboration.
OS World provides a robust environment for AI agents to interact with multiple operating systems and measure performance.
The project uses an analogy of Ikea furniture assembly to explain the importance of grounding instructions for successful task execution.
AI agents require grounding to translate instructions into actions, which is currently a challenge in digital environments like Mac OS or Windows.
Large Language Models (LLMs) and Vision-Guided Models (VGMs) can be used for these tests, but have limitations without effective grounding.
An intelligent agent is defined as one that perceives its environment and acts rationally upon it, with autonomy and reactivity.
OS World introduces xLang, a tool that translates natural language instructions into executable code.
The project has created 369 real-world computer tasks for benchmarking, involving real web and desktop apps in open domains.
Tasks are evaluated using custom execution-based scripts that check if the task was completed as per the instructions.
The study found that using an accessibility tree or a screenshot plus the accessibility tree provides the best results for observation.
Higher screenshot resolution typically leads to improved performance in tasks that rely on visual input.
The OS World environment is significant for its ability to serve as a unified multimodal agent environment across operating systems.
The project's findings suggest that integrating OS World or similar environments into real-world use will be crucial for deploying effective AI agents.
The use of tools like Deep Checks is highlighted for evaluating, monitoring, and debugging LLM-based applications.
The OS World project is open for further exploration and potential tutorials, indicating its practical application and educational value.
The project's comprehensive approach to AI agent benchmarking is expected to contribute significantly to the field of AI development and testing.
Transcripts
today we have a really interesting
project one of the biggest hurdles for
AI agents is actually how to test them
how to know if they're doing things
correctly and that's really the only way
for them to improve but today there's
not really a way to do it consistently
and thoroughly until now a new project
called OS World aims to fix the
benchmarking problem for AI agents and
it's not only a research paper they also
release the code they release the data
as well basically everything is open-
source and I really appreciate that so
we're going to talk about it today I'm
going to tell you all about it and it's
super interesting so let's get into it
so first here's the research paper osor
benchmarking multimodal agents for
open-ended tasks in real computer
environments and it's out of the
University of Hong Kong CMU Salesforce
research and University of watero and
the project actually comes with a
presentation which I think did a Fant
fantastic job of explaining what is
going on so I'll show you that in a
minute but the gist is to date we
haven't had a great way to Benchmark AI
agents to allow them to perform actions
in an environment and actually test and
get the results of how well they're
performing and that's what osor aims to
fix it gives agents a robust environment
multiple operating systems a way to
interact with the environment and a way
to actually measure the performance so
first let's go over the slides because
this is such a great presentation I
think it sums it up really well so this
is by tal U out of the University of
Hong Kong just came out a few weeks ago
so the first page shows Ikea furniture
assembly and it's trying to set up an
analogy for how humans take instructions
and actually execute those instructions
so on the left we have Ikea assembly
instructions and then on the right we
have the assembled chair but what
happens in between those two things
first on the left we have those
step-by-step plans but that's not enough
that's not actually enough just having
the step-by-step plans is not enough to
go assemble a chair we need grounding we
need to actually know how to take those
step-by-step instructions and execute
them and that execution step which also
includes getting feedback and perceiving
the world and in this example perceiving
the different steps of building the
chair are incredibly important for
actually executing the task successfully
and here they're call that the grounding
so now let's look at a digital task we
have computer tasks in a digital world
task instruction how do I change my Mac
desktop background so we have the Mac OS
environment we have our control
instructions which is basically just
from the help on the Apple website and
then we have the final outcome of Mac OS
with new wallpaper but again how do we
get from just the instructions to the
final executed task we need grounding
and grounding in this case comes in the
form form of a mouse and keyboard now
it's already difficult just based on
that to date I believe open interpreter
is probably the best at taking
instructions and being able to actually
control the computer and it's really
difficult to do so because you know Mac
is a closed system Windows is a closed
system and so basically what they do is
they typically take a screenshot of the
entire desktop then they put a grid over
it and then the large language model
tells the mouse and keyboard where to
move on that grid but it's all done
through accessibility features and it's
imprecise to say the least so it's
really a very inefficient way of
controlling a computer and let's see
what's next so can llms and VMS be used
for these tests so the answer is yes and
no according to this presentation so
let's ask chat GPT on the left side how
do I change my Mac desktop background
and chat GPT gives us perfect
instructions so step-by-step
instructions and then for real world Tas
it can't really help with an Ikea chair
right cuz you ask it how do I assemble
an Ikea chair and it gives you only the
most high level information about how to
do that so let's look at the yes let's
look at how to actually execute digital
instructions so again on the left we
have the step-by-step instructions on
how to change the Mac desktop background
and what we need is control instructions
how do I take the actual step-by-step
instructions and control the computer
and that grounding is the missing piece
because there is no really cut and dry
way to control a Mac desktop for example
again it's usually take a screenshot
place a grid over it and try to guess
what the coordinates are very imprecise
and it says right here Chachi PT cannot
execute tasks on your Mac by grounding
plans into actions and as the second
example the real world example chaty PT
also cannot generate step-by-step plans
without interacting in the environment
so basically without getting that
feedback and how's it going to get
feedback from The Real World environment
without a lot of sensors which we don't
really have right now then it's not
actually able to go give you really
solid instructions for how to do it now
before we get to the solution this
presentation talks about what are actual
agents so we have a user over here the
user gives an instruction the llm as an
agent is able to code actions it has a
bunch of actions it can do and it's
basically able to code it so here we
have SQL we have a P calls we have web
and app control so actually being able
to control the desktop and even an
embodied AI in the form of a robot so we
can actually use large language models
to generate code that controls a robot
then we have the environment and that is
like the Mac OS or Windows but we have
more than that we have the data we have
websites we have apps we have mobile
desktop and we have the physical world
all the different environments that we
should be able to operate within then we
need to be able to gather observations
and place those back into the large
language model because this this is
going to be an iterative Loop it needs
to plan it needs to perform and then it
needs to observe and use that
information to iterate once again and
then of course we have whatever tools we
want to use hugging face SQL python Etc
so here they say what is an intelligent
agent and I've never actually heard that
term before but let's take a look at
what it says so the definition is an
intelligent agent perceives its
environment via sensors and acts
rationally upon that environment with
its effectors a discrete agent receives
percepts one at a time and Maps this
percept sequence to a sequence of
discrete actions so let's take a look at
this little funny looking chart that
they have here we have the agent the
agent can gather input through sensors
in the form of percepts then it can plan
and actually perform actions via its
affectors things that can actually
affect the environment so the properties
of an intelligent agent are it's
autonomous it's reactive to the
environment it's proactive gold directed
and it interacts with other agents via
the environment so this is all really
cool and I keep thinking back to crew Ai
and autogen and other AI agent
Frameworks because this feels very akin
to that so let's look at some examples
of what this can be the environment can
be a computer mobile data or the
physical world if you're using an
embodied AI agent then for the sensors
we can use camera screenshot ultrasonic
radar and now I'm kind of thinking of
Tesla autonomous vehicles acting as
agents then we have the agent itself
where the llm vlm is the brain and the
affectors so that's the robot or The
Interpreter so here's an example of a
robotic physical world agent so we give
the instructions stack the blocks on the
empty bowl we have the code right here
so block name detect blocks detect
objects and basically stack them up this
is all the code necessary to do that
that is in the actions so here are all
the different options for the actions
independent of the environment that
we're working within and here's again
the basic workflow so we've already
talked about this but it also says we
have to be able to interpret abstract
user instructions utilize tools and
expand capacities explore complex unseen
environments multi-step planning and
reasoning and follow feedback and self
debug these are all part of being an
agent all stuff that we've already seen
and so it seems one of their big in
ation is this x Lang which basically
takes natural language instructions and
translates that into code that can be
executed in an environment and so here's
the XL website you can see very similar
to what we've been seeing already and
here's the xlang GitHub page where you
can get the open agents project as well
as OS world and these are all open
source which is awesome thanks to the
sponsor of this video deep checks deep
checks helps teams building llm
applications evaluate Monitor and debug
their llm based applications with deep
checks you can release highquality llm
apps quickly without compromising on
testing imagine building a rag based
chatbot application you do not want it
to hallucinate or have inaccuracies
hallucinations incorrect answers bias
deviations from policy harmful content
and more need to be detected explored
and mitigated before and after your app
goes live easily compared versions of
your prompts and models to pentest your
llm based app for undesired Behavior to
enhance their text annotation efforts
with automated scoring or to monitor the
actual quality of your llm based app in
production it allows you to create
custom properties and rules to evaluate
your llm applications based on your
requirements deep check supports rag
chatbots Q&A summarization text to SQL
text to code and other content
generation deep checks lln evaluation
solution is currently available for free
trials if you're building any kind of
llm based application you should
definitely check out deep checks I'll
drop the link in the description below
so you can get your free trial thanks
again to deep checks and now back to the
video so they've actually published a
bunch of work and projects recently they
have instructor which is uh adapting to
various agent environments by simply
providing instructions binder which is
one of the first llm Plus tool used
studies lemur open state-of-the-art llms
for language agents open agents an open
platform for language agents in the wild
this is an agents project that I
actually haven't tested yet I not even
heard of it before reading about it here
we have text to reward which connects a
large language model agents to the
physical world and then OS World which
is what we're talking about today okay
so that's enough Theory let's actually
talk about how it's working so here's an
example so I zoomed in and we have
computer tasks often involve mult
multiple apps and interfaces so the
instruction example that is given here
update the bookkeeping sheet with my
recent transactions over the past few
days in the provided folder I am so
blown away that if not today eventually
and pretty soon I believe agents are
going to be able to take complex task
instructions like this and actually go
execute them on our behalf that's why
I'm excited about agent Frameworks
that's why I'm excited about the rabbit
device hopefully you're seeing the
potential of Agents so in this example
we have the operating system right here
they need to open up office they
actually need to open up different
images which contain receipts and then
they need to read the image look for the
different line items the different
prices and input those into the
spreadsheet this is very complex but how
do agents actually do that it is
incredibly difficult to do that in the
Mac OS environment in Windows
environments because there's no
grounding layer there's no ability to
take those instructions and actually
generate the instructions to interact
with the environment and so that's where
osor comes in which is the first
scalable real computer environment osor
can serve as a unified multimodal agent
environment for evaluating open-ended
computer tasks that involve arbitrary
apps and interfaces across operating
systems So within this environment they
can operate any of the operating systems
they can operate any amount of
applications within that and they can
even operate the interfaces themselves
both the UI and the CLI and it's able to
provide observations to the agents the
agents are able to use grounding to
actually generate instructions for how
to interact with the computer
environment so let's look at what an
agent task includes an autonomous agent
task can be formalized as a primarily
observable marov decision process so we
have the state space so the current
desktop environment we have the
observation space which is the
instruction the screenshot the Ally tree
which I'll show you in a minute and then
we have the action space what they can
actually do so being able to click then
we have the transition function and the
reward function so when each task is
generated we have this initial State
that's located in this task config so
here's the instructions right here we
have the config which is the current
state we have the evaluator so basically
how do we tell if the task is completed
or or not we have the result compared to
the expected and then we have the
function with different options so how
does it actually get the observations
well there's really a few ways that they
describe here we have the set of marks
and the accessibility tree the set of
marks is kind of like a grid format it
basically just tells it how to click on
different objects within the screen and
this is very Ain to how open interpreter
works today it basically has a grid and
it decides where to click now rather
than figuring out what the grid is and
figuring out where each button is within
the grid that is the point of os world
it is actually telling the language
model where everything is and then we
have the accessibility tree which is
basically a code version of that so the
AI agent generates an action which
results in a new state and then a new
observation so here is an example of the
actual interaction with the environment
we have the mouse moving we have
clicking we have writing text we have
pressing the keyo board we have using
hot Keys scrolling dragging key up key
down waiting failing done so that is how
it actually interacts with the
environment so how are the task
executions actually evaluated and that's
what we're seeing here so we have the
task instruction as an example we have
this initial State can you help me clean
my computer by getting rid of all the
tracking things that Amazon might have
saved so the evaluation script which I
guess is just a simplified version an
example version of it is it actually
check checks it grabs the cookies and
sees does amazon.com have any cookies
left and if not it passed and if so it
failed then over here we have rename
sheet one to L's resources then make a
copy of it place the copy before sheet
two rename it by a pending etc etc and
again it's simply checking whether it
was done or not so this is a great way
to Benchmark in a really accurate way so
they created 369 real world computer
tasks that involve real web and desktop
app in open domains they use OS file
reading and writing they do multi-app
workflows through both the GUI and the
command line and each example task are
carefully annotated with real world task
instructions from real users an initial
State setup config to simulate human
work in progress and a custom execution
based evaluation script so let's
actually look at the prompt this is what
the actual prompt looks like so they
tested it against Cog agent which I had
not heard of GPT 4 Gemini Pro Cloud 3 as
agents then we have the prompt details
which we're seeing over here so you're
an agent which follow my instructions
and perform desktop computer tasks as
instructed you have good knowledge etc
etc you are required to use Pi Auto GUI
to perform the action grounded to the
observation return one line or multiple
lines of python code to perform each of
the actions you need to specify the
coordinates of by yourself based on the
observation of the current observation
and here's a password you can use so uh
really just a pretty thorough prompt to
give to the large language model the
temperature of one which I thought was
interesting because that means that it's
going to be the most creative basically
and a top PE of 0.9 now I would think it
would want to keep the temperature
really low uh I'm not actually sure why
they decided to keep the temperature at
one and then they also provide the most
recent three observations and actions as
history context for each step basically
helping the large language model
understand what has come before and that
will inform what it needs to do going
forward and as input settings as we've
already talked about they have set of
marks and accessibility tree and they
actually have four different versions of
it they have accessibility tree only
screenshot only screenshot plus
accessibility tree and set of marks now
let's look at the results so on the left
we have the different input modes so the
Ally tree which is the accessibility
tree we have the screenshot we have the
accessibility tree plus screenshot and
then the set of marks so what we have
found is first of all gp4 across the
board has been the winner The Only
Exception is with screenshot only mode
which in Gemini Pro V did the best and
it seems that either the accessibility
tree or using a screenshot plus the
accessibility tree are giving the best
result with really the accessibility
tree alone being really the winner of
the best way to give observation to the
large language model the setup of Mark
actually also works pretty well the
screenshot alone does not and that's
interesting because I still believe open
interpreter the way that they're
actually interacting with computers is
by screenshot now if you're trying to
deploy agents to a real world
environment consumers are not going to
have OS World installed in their
machines or expose OS world as an
environment so what we're probably going
to have to do in the long run is build
in two operating systems a way for
agents to interact with them more
effectively and one interesting Insight
that they had is higher screenshot
resolution typically leads to improved
performance so if they're just doing
screenshots you can see right here the
success rate and percentage increases as
the resolution of that screenshot
improves so that's it that is the OS
World project I really like it I
appreciate that it's going to allow us
to Benchmark agents testing and actually
having results is the only way to
improve anything I'm thinking about
actually setting up OS world on my own
machine testing it out maybe I'll create
a tutorial from it if you want to see
that let me know in the comments below
if you liked this video please consider
giving a like And subscribe and I'll see
you in the next one
Weitere ähnliche Videos ansehen
TaskWeaver: запуск и создание своих плагинов
Why I’m switching from Unity to Unreal Engine
ИЗУЧИТЕ ГЛАВНУЮ НЕЙРОСЕТЬ МИРА от Б до Ю
Welcome to Sembly – the smartest AI Teammate!
I call this one Fish Out of Water 😂 | Movie title: The Ridiculous 6 | #movie #film
ЧТО БУДЕТ ЕСЛИ СПАТЬ С ОТКРЫТЫМИ ГЛАЗАМИ? | Проверил на себе, эксперимент
5.0 / 5 (0 votes)