MASSIVE Step Allowing AI Agents To Control Computers (MacOS, Windows, Linux)

Matthew Berman
28 Apr 202419:09

Summary

TLDRВ видео представлен проект OS World, разработанный для тестирования и оценки производительности искусственных интеллектов (AI) в реальных компьютерных средах. Проект предлагает робастную среду с несколькими операционными системами, способность взаимодействия с окружением и методы измерения эффективности. OS World может работать с различными приложениями и интерфейсами, предоставляя агентам данные о позициях элементов и способах их контроля. В рамках проекта разработан язык XLang, который переводит естественноязычные инструкции в код, который может быть выполнен в определенной среде. Проведены тесты с использованием различных агентов, и результаты показывают, что использование дерева доступности или комбинация скриншота с деревом доступности дает лучшие результаты. Проект OS World является открытым исходным кодом и может значительно улучшить способ тестирования и разработки AI-агентов.

Takeaways

  • 📈 Проект OS World предназначен для решения проблемы тестирования и оценки производительности ИИ-агентов в реальных компьютерных средах.
  • 🌐 OS World предлагает многообразную среду с несколькими операционными системами для взаимодействия ИИ-агентов с компьютерной средой.
  • 🔍 В рамках проекта разработана методология и инструменты для измерения и анализа производительности ИИ-агентов.
  • 🏗️ Используется аналогия с сборкой мебели IKEA для объяснения важности понимания инструкций и их выполнения в реальном мире.
  • 🤖 ИИ-агенты должны обладать способностью воспринимать среду через сенсоры, планировать действия и взаимодействовать с окружающим миром.
  • 📚 Проект включает в себя исследовательскую статью, презентацию, открытый исходный код и данные для обеспечения прозрачности и доступности.
  • 📋 Созданы 369 реальных компьютерных задач, которые включают взаимодействие с веб-сайтами, десктопными приложениями и использование файловой системы.
  • 🛠️ Для оценки успешности выполнения задач ИИ-агентам предоставляется информация о состоянии окружения, инструкциях и наблюдениях.
  • 📷 Высокоразрешительные снимки экрана обычно приводят к лучшей производительности при использовании только изображений для обучения ИИ-агентов.
  • 📈 GPT-4 оказался лучшим агентом во всех режимах, кроме режима с использованием только снимков экрана, где лучшие результаты показал Gemini Pro V.
  • 🔧 OS World может служить основой для дальнейшего улучшения взаимодействия ИИ-агентов с операционными системами и повышения их эффективности в реальных задачах.

Q & A

  • Что является одним из самых больших препятствий для тестирования ИИ-агентов?

    -Одним из основных препятствий для тестирования ИИ-агентов является отсутствие способа их проверки и определения их корректной работы, что является единственным способом их улучшения.

  • Какой проект направлен на решение проблемы тестирования ИИ-агентов?

    -Проект под названием OS World, разработанный совместно Университетом Гонконга, Карнеги-Меллонским университетом, Salesforce Research и Университетом Вательо, направлен на решение проблемы тестирования ИИ-агентов.

  • Что включает в себя проект OS World?

    -Проект OS World включает в себя исследовательскую статью, открытый код, данные и все необходимое для тестирования ИИ-агентов в рамках многообразных операционных систем.

  • Какие преимущества предлагает открытый доступ к коду и данным проекта?

    -Открытый доступ к коду и данным позволяет任何人 использовать и модифицировать проект для своих исследований и улучшений, что способствует быстрому прогрессу в области тестирования ИИ-агентов.

  • Какие задачи решает среда OS World?

    -Среда OS World предоставляет агентам робастную среду с несколькими операционными системами, возможность взаимодействия с окружением и способ измерения эффективности работы.

  • Чему сравнивается сборка мебели из IKEA с использованием инструкций?

    -Сборка мебели из IKEA используется в качестве аналогии для того, как люди принимают инструкции и выполняют их, что включает в себя понимание шагов, выполнение действий и получение обратной связи.

  • Какие существующие системы могут использоваться для тестирования в рамках проекта?

    -Существующие системы, такие как LLMs (Large Language Models) и VMs (Virtual Machines), могут быть использованы для тестирования в рамках проекта.

  • Что такое XLang и как оно используется в проекте?

    -XLang - это язык, который переводит естественные языковые инструкции в код, который может быть выполнен в определенной среде. В проекте OS World он используется для преобразования инструкций в действия, которые могут быть выполнены агентами.

  • Какие задачи могут выполнять агенты в среде OS World?

    -Агенты могут выполнять многошаговые компьютерные задачи, включая работу с различными веб-приложениями и интерфейсами, чтение и запись файлов, а также выполнение команд через графический и командный интерфейс.

  • Какие результаты были получены в ходе тестирования различных режимов ввода?

    -Тестирование показало, что использование дерева доступности или комбинация скриншота и дерева доступности дает лучшие результаты. Скриншоты с более высоким разрешением также приводят к улучшению производительности.

  • Какой агент проявил наилучшую производительность во время тестирования?

    -GPT-4 проявил наилучшую производительность во всех режимах, кроме режима только с использованием скриншота, где лучшими результатами оказались Gemini Pro V.

Outlines

00:00

🔍 Введение в OS World: решение для тестирования AI-агентов

Параграф 1 вводит тему проекта OS World, который был создан для решения проблемы тестирования и оценки производительности AI-агентов. Авторы проекта из Университета Гонконга, CMU Salesforce и Университета Уотерлоо представили не только научную статью, но и открытый исходный код и данные. Основная идея проекта заключается в создании среды, которая позволит агентам взаимодействовать с операционными системами и приложениями, а также предоставить методы измерения их эффективности. В параграфе также рассматривается аналогия с инструкциями по сборке мебели IKEA, чтобы объяснить важность 'закрепления' шагов выполнения задач.

05:01

🤖 Основы интеллектуальных агентов и их взаимодействие с окружающими

Параграф 2 обсуждает определение интеллектуальных агентов, их способности воспринимать окружение через датчики и взаимодействовать с ним с помощью эффекторов. Рассматриваются различные типы сред, в которых могут действовать агенты, включая компьютеры, мобильные устройства и физическую реальность. Также здесь упоминается использование различных инструментов и языков программирования для взаимодействия с окружающими средами. Важной частью проекта является язык XLang, который переводит естественные языковые инструкции в код, который может быть выполнен в определенной среде.

10:02

🛠️ Примеры задач и их выполнение с использованием OS World

В параграфе 3 описывается, как агенты могут выполнять сложные компьютерные задачи, включая работу с несколькими приложениями и интерфейсами. Ос World предоставляет агентам возможность взаимодействия с операционными системами и приложениями на разных уровнях, включая графический и командный интерфейс. Здесь также объясняется, как агенты могут получать наблюдения и генерировать инструкции для взаимодействия с компьютерной средой. Рассматриваются различные типы действий, которые могут выполнять агенты, такие как перемещение мыши, нажатие клавиш и др.

15:03

📊 Оценка эффективности выполнения задач и результаты тестирования

Параграф 4 фокусируется на методах оценки эффективности выполнения задач AI-агентов. Описание процесса оценки, включая создание реальных компьютерных задач, аннотирование инструкциями, настройку начального состояния и написание скриптов для проверки завершения задач. Также здесь приводятся результаты тестирования различных агентов с использованием различных режимов ввода информации, таких как снимок экрана, дерево доступности и т.д. Указывается, что использование дерева доступности или комбинация снимка экрана с деревом доступности дает лучшие результаты.

Mindmap

Keywords

💡AI агенты

AI агенты - это интеллектуальные системы, способные выполнять задачи, обрабатывать информацию и принимать решения без непосредственного участия человека. В видео они являются центральной темой, так как обсуждаются различные аспекты их тестирования и улучшения. Например, скрипт упоминает, что один из основных вызовов для AI агентов - это определение их корректной работы и способов их тестирования.

💡тестирование AI агентов

Тестирование AI агентов - это процесс проверки их работоспособности и эффективности в решении задач. В контексте видео, тестирование является ключевым для понимания, как агенты могут быть улучшены. В видео упоминается, что до проекта OS World не существовало надежного и последовательного метода тестирования AI агентов.

💡OS World

OS World - это новый проект, предназначенный для решения проблемы бенчмаркинга AI агентов. Он предоставляет агентам робастную среду, несколько операционных систем и способы взаимодействия с этими средами, а также методы измерения их производительности. В видео это решение представляется как инновационный подход к тестированию и развитию AI агентов.

💡бенчмаркинг

Бенчмаркинг - это метод сравнительного тестирования производительности или эффективности. В контексте видео, бенчмаркинг используется для определения, насколько хорошо AI агенты выполняют задачи и как их можно улучшить. OS World предлагает новый способ бенчмаркинга, который позволяет более точно измерять производительность AI агентов.

💡открытый исходный код

Открытый исходный код - это термин, описывающий программное обеспечение с доступным исходным кодом, который может быть просмотрен, изменен и распространен пользователями. В видео упоминается, что проект OS World, включая исследательскую работу, код и данные, полностью открыты и доступны для всех, что способствует прогрессу и развитию в области AI.

💡умные агенты

Умные агенты - это системы, способные воспринимать свою среду через датчики и рационально действовать в данной среде с помощью своих эффекторов. В видео определяется, что умный агент обладает автономностью, реактивностью, проактивностью и способностью взаимодействовать с другими агентами через среду.

💡Markov decision process

Markov decision process (Процесс принятия решений Маркова) - это математическая модель, используемая для определения оптимальных стратегий принятия решений при наличии неопределенности. В контексте видео, это понятие связано с формализацией автономных агентских задач, где состояние, наблюдение, действия и функции перехода и награды определяют поведение агента.

💡XLang

XLang - это язык, представленный в видео, который переводит естественноязычные инструкции в код, который может быть выполнен в определенной среде. Это ключевой компонент проекта OS World, позволяющий агентам понимать и преобразовывать инструкции в действия, необходимые для взаимодействия с компьютерной средой.

💡автономные агентские задачи

Автономные агентские задачи - это задачи, которые могут быть выполнены агентом без вмешательства пользователя и включают в себя планирование, выполнение и наблюдение. В видео упоминается, что такие задачи требуют от агента умения интерпретировать абстрактные инструкции, использовать инструменты, исследовать сложные невиданные среды и следовать обратной связи.

💡OpenAI

OpenAI - это исследовательская лаборатория, специализирующаяся на разработке и обучении сильных AI систем. В контексте видео, OpenAI упоминается в связи с разработкой и тестированием различных AI агентов, включая GPT-4, в рамках проекта OS World.

💡эквивалентность доступности

Эквивалентность доступности - это термин, связанный с доступностью и использованием технологий, в данном случае, для агентов AI. В видео упоминается, что в долгосрочной перспективе потребуется разработать способы, позволяющие агентам более эффективно взаимодействовать с операционными системами, что подразумевает идею эквивалентности доступности.

Highlights

OS World is a new project designed to address the benchmarking problem for AI agents.

The project includes a research paper, open-source code, and data, promoting transparency and collaboration.

OS World provides a robust environment for AI agents to interact with multiple operating systems and measure performance.

The project uses an analogy of Ikea furniture assembly to explain the importance of grounding instructions for successful task execution.

AI agents require grounding to translate instructions into actions, which is currently a challenge in digital environments like Mac OS or Windows.

Large Language Models (LLMs) and Vision-Guided Models (VGMs) can be used for these tests, but have limitations without effective grounding.

An intelligent agent is defined as one that perceives its environment and acts rationally upon it, with autonomy and reactivity.

OS World introduces xLang, a tool that translates natural language instructions into executable code.

The project has created 369 real-world computer tasks for benchmarking, involving real web and desktop apps in open domains.

Tasks are evaluated using custom execution-based scripts that check if the task was completed as per the instructions.

The study found that using an accessibility tree or a screenshot plus the accessibility tree provides the best results for observation.

Higher screenshot resolution typically leads to improved performance in tasks that rely on visual input.

The OS World environment is significant for its ability to serve as a unified multimodal agent environment across operating systems.

The project's findings suggest that integrating OS World or similar environments into real-world use will be crucial for deploying effective AI agents.

The use of tools like Deep Checks is highlighted for evaluating, monitoring, and debugging LLM-based applications.

The OS World project is open for further exploration and potential tutorials, indicating its practical application and educational value.

The project's comprehensive approach to AI agent benchmarking is expected to contribute significantly to the field of AI development and testing.

Transcripts

play00:00

today we have a really interesting

play00:02

project one of the biggest hurdles for

play00:04

AI agents is actually how to test them

play00:07

how to know if they're doing things

play00:09

correctly and that's really the only way

play00:11

for them to improve but today there's

play00:14

not really a way to do it consistently

play00:17

and thoroughly until now a new project

play00:20

called OS World aims to fix the

play00:23

benchmarking problem for AI agents and

play00:27

it's not only a research paper they also

play00:29

release the code they release the data

play00:31

as well basically everything is open-

play00:34

source and I really appreciate that so

play00:36

we're going to talk about it today I'm

play00:37

going to tell you all about it and it's

play00:39

super interesting so let's get into it

play00:41

so first here's the research paper osor

play00:45

benchmarking multimodal agents for

play00:47

open-ended tasks in real computer

play00:48

environments and it's out of the

play00:50

University of Hong Kong CMU Salesforce

play00:53

research and University of watero and

play00:55

the project actually comes with a

play00:57

presentation which I think did a Fant

play00:59

fantastic job of explaining what is

play01:02

going on so I'll show you that in a

play01:04

minute but the gist is to date we

play01:06

haven't had a great way to Benchmark AI

play01:08

agents to allow them to perform actions

play01:11

in an environment and actually test and

play01:14

get the results of how well they're

play01:16

performing and that's what osor aims to

play01:19

fix it gives agents a robust environment

play01:22

multiple operating systems a way to

play01:25

interact with the environment and a way

play01:27

to actually measure the performance so

play01:30

first let's go over the slides because

play01:32

this is such a great presentation I

play01:34

think it sums it up really well so this

play01:36

is by tal U out of the University of

play01:38

Hong Kong just came out a few weeks ago

play01:40

so the first page shows Ikea furniture

play01:43

assembly and it's trying to set up an

play01:46

analogy for how humans take instructions

play01:49

and actually execute those instructions

play01:51

so on the left we have Ikea assembly

play01:53

instructions and then on the right we

play01:55

have the assembled chair but what

play01:57

happens in between those two things

play01:59

first on the left we have those

play02:01

step-by-step plans but that's not enough

play02:04

that's not actually enough just having

play02:06

the step-by-step plans is not enough to

play02:08

go assemble a chair we need grounding we

play02:10

need to actually know how to take those

play02:12

step-by-step instructions and execute

play02:15

them and that execution step which also

play02:17

includes getting feedback and perceiving

play02:20

the world and in this example perceiving

play02:22

the different steps of building the

play02:24

chair are incredibly important for

play02:26

actually executing the task successfully

play02:28

and here they're call that the grounding

play02:31

so now let's look at a digital task we

play02:34

have computer tasks in a digital world

play02:36

task instruction how do I change my Mac

play02:38

desktop background so we have the Mac OS

play02:40

environment we have our control

play02:43

instructions which is basically just

play02:45

from the help on the Apple website and

play02:47

then we have the final outcome of Mac OS

play02:49

with new wallpaper but again how do we

play02:51

get from just the instructions to the

play02:54

final executed task we need grounding

play02:57

and grounding in this case comes in the

play02:59

form form of a mouse and keyboard now

play03:02

it's already difficult just based on

play03:04

that to date I believe open interpreter

play03:06

is probably the best at taking

play03:08

instructions and being able to actually

play03:10

control the computer and it's really

play03:13

difficult to do so because you know Mac

play03:14

is a closed system Windows is a closed

play03:17

system and so basically what they do is

play03:19

they typically take a screenshot of the

play03:21

entire desktop then they put a grid over

play03:24

it and then the large language model

play03:26

tells the mouse and keyboard where to

play03:28

move on that grid but it's all done

play03:30

through accessibility features and it's

play03:33

imprecise to say the least so it's

play03:36

really a very inefficient way of

play03:38

controlling a computer and let's see

play03:40

what's next so can llms and VMS be used

play03:44

for these tests so the answer is yes and

play03:47

no according to this presentation so

play03:49

let's ask chat GPT on the left side how

play03:51

do I change my Mac desktop background

play03:54

and chat GPT gives us perfect

play03:56

instructions so step-by-step

play03:57

instructions and then for real world Tas

play04:01

it can't really help with an Ikea chair

play04:03

right cuz you ask it how do I assemble

play04:05

an Ikea chair and it gives you only the

play04:07

most high level information about how to

play04:09

do that so let's look at the yes let's

play04:11

look at how to actually execute digital

play04:13

instructions so again on the left we

play04:16

have the step-by-step instructions on

play04:17

how to change the Mac desktop background

play04:20

and what we need is control instructions

play04:24

how do I take the actual step-by-step

play04:26

instructions and control the computer

play04:28

and that grounding is the missing piece

play04:31

because there is no really cut and dry

play04:34

way to control a Mac desktop for example

play04:37

again it's usually take a screenshot

play04:39

place a grid over it and try to guess

play04:41

what the coordinates are very imprecise

play04:44

and it says right here Chachi PT cannot

play04:46

execute tasks on your Mac by grounding

play04:48

plans into actions and as the second

play04:51

example the real world example chaty PT

play04:53

also cannot generate step-by-step plans

play04:56

without interacting in the environment

play04:57

so basically without getting that

play04:59

feedback and how's it going to get

play05:00

feedback from The Real World environment

play05:02

without a lot of sensors which we don't

play05:04

really have right now then it's not

play05:06

actually able to go give you really

play05:08

solid instructions for how to do it now

play05:10

before we get to the solution this

play05:12

presentation talks about what are actual

play05:14

agents so we have a user over here the

play05:17

user gives an instruction the llm as an

play05:20

agent is able to code actions it has a

play05:24

bunch of actions it can do and it's

play05:25

basically able to code it so here we

play05:27

have SQL we have a P calls we have web

play05:31

and app control so actually being able

play05:33

to control the desktop and even an

play05:35

embodied AI in the form of a robot so we

play05:37

can actually use large language models

play05:38

to generate code that controls a robot

play05:40

then we have the environment and that is

play05:42

like the Mac OS or Windows but we have

play05:44

more than that we have the data we have

play05:46

websites we have apps we have mobile

play05:48

desktop and we have the physical world

play05:50

all the different environments that we

play05:52

should be able to operate within then we

play05:54

need to be able to gather observations

play05:55

and place those back into the large

play05:58

language model because this this is

play05:59

going to be an iterative Loop it needs

play06:01

to plan it needs to perform and then it

play06:03

needs to observe and use that

play06:05

information to iterate once again and

play06:07

then of course we have whatever tools we

play06:09

want to use hugging face SQL python Etc

play06:12

so here they say what is an intelligent

play06:14

agent and I've never actually heard that

play06:17

term before but let's take a look at

play06:19

what it says so the definition is an

play06:20

intelligent agent perceives its

play06:22

environment via sensors and acts

play06:24

rationally upon that environment with

play06:26

its effectors a discrete agent receives

play06:30

percepts one at a time and Maps this

play06:33

percept sequence to a sequence of

play06:35

discrete actions so let's take a look at

play06:38

this little funny looking chart that

play06:40

they have here we have the agent the

play06:42

agent can gather input through sensors

play06:45

in the form of percepts then it can plan

play06:48

and actually perform actions via its

play06:51

affectors things that can actually

play06:52

affect the environment so the properties

play06:55

of an intelligent agent are it's

play06:57

autonomous it's reactive to the

play06:58

environment it's proactive gold directed

play07:01

and it interacts with other agents via

play07:03

the environment so this is all really

play07:06

cool and I keep thinking back to crew Ai

play07:09

and autogen and other AI agent

play07:12

Frameworks because this feels very akin

play07:15

to that so let's look at some examples

play07:17

of what this can be the environment can

play07:19

be a computer mobile data or the

play07:21

physical world if you're using an

play07:22

embodied AI agent then for the sensors

play07:25

we can use camera screenshot ultrasonic

play07:28

radar and now I'm kind of thinking of

play07:30

Tesla autonomous vehicles acting as

play07:32

agents then we have the agent itself

play07:34

where the llm vlm is the brain and the

play07:38

affectors so that's the robot or The

play07:41

Interpreter so here's an example of a

play07:43

robotic physical world agent so we give

play07:45

the instructions stack the blocks on the

play07:47

empty bowl we have the code right here

play07:50

so block name detect blocks detect

play07:52

objects and basically stack them up this

play07:55

is all the code necessary to do that

play07:56

that is in the actions so here are all

play07:58

the different options for the actions

play08:01

independent of the environment that

play08:02

we're working within and here's again

play08:05

the basic workflow so we've already

play08:07

talked about this but it also says we

play08:09

have to be able to interpret abstract

play08:12

user instructions utilize tools and

play08:14

expand capacities explore complex unseen

play08:17

environments multi-step planning and

play08:19

reasoning and follow feedback and self

play08:21

debug these are all part of being an

play08:24

agent all stuff that we've already seen

play08:26

and so it seems one of their big in

play08:29

ation is this x Lang which basically

play08:32

takes natural language instructions and

play08:35

translates that into code that can be

play08:37

executed in an environment and so here's

play08:40

the XL website you can see very similar

play08:43

to what we've been seeing already and

play08:45

here's the xlang GitHub page where you

play08:48

can get the open agents project as well

play08:51

as OS world and these are all open

play08:54

source which is awesome thanks to the

play08:56

sponsor of this video deep checks deep

play08:58

checks helps teams building llm

play09:00

applications evaluate Monitor and debug

play09:03

their llm based applications with deep

play09:05

checks you can release highquality llm

play09:08

apps quickly without compromising on

play09:10

testing imagine building a rag based

play09:12

chatbot application you do not want it

play09:14

to hallucinate or have inaccuracies

play09:17

hallucinations incorrect answers bias

play09:19

deviations from policy harmful content

play09:22

and more need to be detected explored

play09:25

and mitigated before and after your app

play09:27

goes live easily compared versions of

play09:30

your prompts and models to pentest your

play09:32

llm based app for undesired Behavior to

play09:35

enhance their text annotation efforts

play09:37

with automated scoring or to monitor the

play09:39

actual quality of your llm based app in

play09:42

production it allows you to create

play09:44

custom properties and rules to evaluate

play09:46

your llm applications based on your

play09:49

requirements deep check supports rag

play09:52

chatbots Q&A summarization text to SQL

play09:55

text to code and other content

play09:57

generation deep checks lln evaluation

play10:00

solution is currently available for free

play10:02

trials if you're building any kind of

play10:04

llm based application you should

play10:06

definitely check out deep checks I'll

play10:08

drop the link in the description below

play10:10

so you can get your free trial thanks

play10:12

again to deep checks and now back to the

play10:14

video so they've actually published a

play10:17

bunch of work and projects recently they

play10:19

have instructor which is uh adapting to

play10:22

various agent environments by simply

play10:24

providing instructions binder which is

play10:27

one of the first llm Plus tool used

play10:29

studies lemur open state-of-the-art llms

play10:32

for language agents open agents an open

play10:35

platform for language agents in the wild

play10:38

this is an agents project that I

play10:39

actually haven't tested yet I not even

play10:41

heard of it before reading about it here

play10:43

we have text to reward which connects a

play10:44

large language model agents to the

play10:47

physical world and then OS World which

play10:49

is what we're talking about today okay

play10:50

so that's enough Theory let's actually

play10:52

talk about how it's working so here's an

play10:55

example so I zoomed in and we have

play10:57

computer tasks often involve mult

play10:59

multiple apps and interfaces so the

play11:01

instruction example that is given here

play11:03

update the bookkeeping sheet with my

play11:05

recent transactions over the past few

play11:07

days in the provided folder I am so

play11:10

blown away that if not today eventually

play11:13

and pretty soon I believe agents are

play11:15

going to be able to take complex task

play11:18

instructions like this and actually go

play11:21

execute them on our behalf that's why

play11:22

I'm excited about agent Frameworks

play11:25

that's why I'm excited about the rabbit

play11:26

device hopefully you're seeing the

play11:28

potential of Agents so in this example

play11:31

we have the operating system right here

play11:33

they need to open up office they

play11:36

actually need to open up different

play11:37

images which contain receipts and then

play11:39

they need to read the image look for the

play11:42

different line items the different

play11:44

prices and input those into the

play11:46

spreadsheet this is very complex but how

play11:49

do agents actually do that it is

play11:52

incredibly difficult to do that in the

play11:55

Mac OS environment in Windows

play11:58

environments because there's no

play12:00

grounding layer there's no ability to

play12:02

take those instructions and actually

play12:04

generate the instructions to interact

play12:06

with the environment and so that's where

play12:08

osor comes in which is the first

play12:10

scalable real computer environment osor

play12:12

can serve as a unified multimodal agent

play12:15

environment for evaluating open-ended

play12:17

computer tasks that involve arbitrary

play12:19

apps and interfaces across operating

play12:21

systems So within this environment they

play12:24

can operate any of the operating systems

play12:27

they can operate any amount of

play12:29

applications within that and they can

play12:31

even operate the interfaces themselves

play12:33

both the UI and the CLI and it's able to

play12:37

provide observations to the agents the

play12:40

agents are able to use grounding to

play12:42

actually generate instructions for how

play12:44

to interact with the computer

play12:46

environment so let's look at what an

play12:49

agent task includes an autonomous agent

play12:52

task can be formalized as a primarily

play12:54

observable marov decision process so we

play12:57

have the state space so the current

play12:59

desktop environment we have the

play13:00

observation space which is the

play13:02

instruction the screenshot the Ally tree

play13:04

which I'll show you in a minute and then

play13:06

we have the action space what they can

play13:08

actually do so being able to click then

play13:10

we have the transition function and the

play13:13

reward function so when each task is

play13:16

generated we have this initial State

play13:18

that's located in this task config so

play13:20

here's the instructions right here we

play13:22

have the config which is the current

play13:24

state we have the evaluator so basically

play13:27

how do we tell if the task is completed

play13:28

or or not we have the result compared to

play13:31

the expected and then we have the

play13:33

function with different options so how

play13:35

does it actually get the observations

play13:37

well there's really a few ways that they

play13:39

describe here we have the set of marks

play13:42

and the accessibility tree the set of

play13:44

marks is kind of like a grid format it

play13:46

basically just tells it how to click on

play13:48

different objects within the screen and

play13:51

this is very Ain to how open interpreter

play13:55

works today it basically has a grid and

play13:56

it decides where to click now rather

play13:59

than figuring out what the grid is and

play14:01

figuring out where each button is within

play14:03

the grid that is the point of os world

play14:05

it is actually telling the language

play14:07

model where everything is and then we

play14:09

have the accessibility tree which is

play14:11

basically a code version of that so the

play14:13

AI agent generates an action which

play14:16

results in a new state and then a new

play14:19

observation so here is an example of the

play14:22

actual interaction with the environment

play14:24

we have the mouse moving we have

play14:26

clicking we have writing text we have

play14:28

pressing the keyo board we have using

play14:30

hot Keys scrolling dragging key up key

play14:32

down waiting failing done so that is how

play14:36

it actually interacts with the

play14:37

environment so how are the task

play14:39

executions actually evaluated and that's

play14:42

what we're seeing here so we have the

play14:44

task instruction as an example we have

play14:46

this initial State can you help me clean

play14:48

my computer by getting rid of all the

play14:49

tracking things that Amazon might have

play14:51

saved so the evaluation script which I

play14:54

guess is just a simplified version an

play14:56

example version of it is it actually

play14:58

check checks it grabs the cookies and

play15:00

sees does amazon.com have any cookies

play15:03

left and if not it passed and if so it

play15:05

failed then over here we have rename

play15:07

sheet one to L's resources then make a

play15:10

copy of it place the copy before sheet

play15:12

two rename it by a pending etc etc and

play15:15

again it's simply checking whether it

play15:18

was done or not so this is a great way

play15:20

to Benchmark in a really accurate way so

play15:23

they created 369 real world computer

play15:26

tasks that involve real web and desktop

play15:28

app in open domains they use OS file

play15:31

reading and writing they do multi-app

play15:33

workflows through both the GUI and the

play15:36

command line and each example task are

play15:39

carefully annotated with real world task

play15:41

instructions from real users an initial

play15:44

State setup config to simulate human

play15:46

work in progress and a custom execution

play15:48

based evaluation script so let's

play15:50

actually look at the prompt this is what

play15:52

the actual prompt looks like so they

play15:54

tested it against Cog agent which I had

play15:57

not heard of GPT 4 Gemini Pro Cloud 3 as

play16:00

agents then we have the prompt details

play16:02

which we're seeing over here so you're

play16:05

an agent which follow my instructions

play16:07

and perform desktop computer tasks as

play16:09

instructed you have good knowledge etc

play16:11

etc you are required to use Pi Auto GUI

play16:14

to perform the action grounded to the

play16:17

observation return one line or multiple

play16:19

lines of python code to perform each of

play16:21

the actions you need to specify the

play16:23

coordinates of by yourself based on the

play16:25

observation of the current observation

play16:27

and here's a password you can use so uh

play16:31

really just a pretty thorough prompt to

play16:33

give to the large language model the

play16:35

temperature of one which I thought was

play16:36

interesting because that means that it's

play16:38

going to be the most creative basically

play16:40

and a top PE of 0.9 now I would think it

play16:44

would want to keep the temperature

play16:45

really low uh I'm not actually sure why

play16:48

they decided to keep the temperature at

play16:49

one and then they also provide the most

play16:51

recent three observations and actions as

play16:54

history context for each step basically

play16:55

helping the large language model

play16:57

understand what has come before and that

play16:59

will inform what it needs to do going

play17:01

forward and as input settings as we've

play17:03

already talked about they have set of

play17:05

marks and accessibility tree and they

play17:06

actually have four different versions of

play17:08

it they have accessibility tree only

play17:10

screenshot only screenshot plus

play17:12

accessibility tree and set of marks now

play17:15

let's look at the results so on the left

play17:18

we have the different input modes so the

play17:20

Ally tree which is the accessibility

play17:23

tree we have the screenshot we have the

play17:25

accessibility tree plus screenshot and

play17:27

then the set of marks so what we have

play17:30

found is first of all gp4 across the

play17:34

board has been the winner The Only

play17:36

Exception is with screenshot only mode

play17:38

which in Gemini Pro V did the best and

play17:42

it seems that either the accessibility

play17:44

tree or using a screenshot plus the

play17:47

accessibility tree are giving the best

play17:48

result with really the accessibility

play17:51

tree alone being really the winner of

play17:54

the best way to give observation to the

play17:57

large language model the setup of Mark

play17:59

actually also works pretty well the

play18:01

screenshot alone does not and that's

play18:03

interesting because I still believe open

play18:05

interpreter the way that they're

play18:07

actually interacting with computers is

play18:08

by screenshot now if you're trying to

play18:11

deploy agents to a real world

play18:13

environment consumers are not going to

play18:15

have OS World installed in their

play18:16

machines or expose OS world as an

play18:19

environment so what we're probably going

play18:21

to have to do in the long run is build

play18:23

in two operating systems a way for

play18:26

agents to interact with them more

play18:27

effectively and one interesting Insight

play18:29

that they had is higher screenshot

play18:31

resolution typically leads to improved

play18:33

performance so if they're just doing

play18:35

screenshots you can see right here the

play18:37

success rate and percentage increases as

play18:40

the resolution of that screenshot

play18:42

improves so that's it that is the OS

play18:45

World project I really like it I

play18:48

appreciate that it's going to allow us

play18:50

to Benchmark agents testing and actually

play18:53

having results is the only way to

play18:55

improve anything I'm thinking about

play18:57

actually setting up OS world on my own

play18:59

machine testing it out maybe I'll create

play19:01

a tutorial from it if you want to see

play19:03

that let me know in the comments below

play19:04

if you liked this video please consider

play19:06

giving a like And subscribe and I'll see

play19:08

you in the next one

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
Искусственный интеллектТестированиеПроект OS WorldАвтономные агентыМногозадачностьОценка производительностиИнтеллектуальные системыРазработка ПОИнновационные технологииИскусственная автономияРешение проблем
Besoin d'un résumé en anglais ?