Stream of Search (SoS): Learning to Search in Language

Arxiv Papers
7 Apr 202412:43

Summary

TLDR本视频脚本探讨了训练语言模型对搜索过程的影响,旨在教授语言模型如何搜索和回溯,以实现自我改进。基于Transformer的模型在规划任务中存在错误累积和前瞻性挑战等问题,这些问题源于模型在有效搜索和回溯方面的限制。研究展示了通过将搜索过程表示为搜索流,语言模型可以被训练进行搜索和回溯。以倒计时游戏为灵感的搜索问题,要求将输入数字与算术运算结合以达到目标数字,提供了一个具有挑战性的搜索问题。通过使用符号规划器和启发式函数创建的训练数据集,训练语言模型在多样性数据集上的表现优于仅在最优解上训练的模型。此外,通过优势诱导策略对齐和专家迭代进行微调后,搜索流模型展现出了增强的搜索和规划能力。研究结果表明,基于Transformer的语言模型可以通过搜索学习问题解决,并自主改进其搜索策略。

Takeaways

  • 🤖 训练语言模型以学习搜索和回溯,可以提高模型的自我改进能力。
  • 🚀 基于Transformer的模型在规划任务中存在错误累积和前瞻性挑战。
  • 🔍 通过将搜索过程表示为搜索流,语言模型可以被训练进行搜索和回溯。
  • 🎯 使用启发式函数创建的训练数据集,可以帮助模型学习多样化的搜索策略。
  • 📈 与仅在最优解上训练的模型相比,基于搜索流训练的模型在预测正确解的路径上表现更好。
  • 🧠 通过优势策略对齐和专家迭代进行微调,可以增强模型的搜索和规划能力。
  • 🔬 研究表明,基于Transformer的语言模型可以通过搜索学习问题解决,并自主改进搜索策略。
  • 📚 通过马尔可夫决策过程(MDP)对问题空间进行建模,定义了状态、动作、转换函数和奖励函数。
  • 📊 通过比较不同的搜索策略,发现模型能够与多种符号策略对齐,而不是仅依赖于训练数据中的单一策略。
  • 🛠️ 利用强化学习策略进行微调,模型能够解决之前无法解决的问题,并发现新的搜索策略。
  • ✅ 训练语言模型以提高准确性,也有助于发现新的搜索策略,这表明模型能够灵活地使用不同的搜索策略。
  • 🏆 该研究展示了通过内部搜索机制使语言模型能够处理复杂问题的可能性,并强调了让模型了解问题解决过程而不仅仅是最优解的重要性。

Q & A

  • 在训练语言模型时,为什么需要让模型学会搜索和回溯?

    -训练语言模型学会搜索和回溯是为了提高模型在规划任务中的表现,解决诸如错误累积和前瞻性挑战等问题,这些问题源于模型在有效搜索和回溯方面的限制。

  • Transformer-based模型在规划任务中面临哪些挑战?

    -Transformer-based模型在规划任务中面临的挑战包括错误累积和前瞻性问题,这些问题使得模型难以有效地进行搜索和回溯。

  • 如何通过训练提高语言模型的搜索能力?

    -通过将搜索过程表示为搜索流,定义探索、回溯和剪枝等组件,并将其应用于受倒计时游戏启发的搜索问题,可以训练语言模型进行搜索和回溯。

  • 倒计时游戏在训练语言模型中的作用是什么?

    -倒计时游戏提供了一个具有挑战性的搜索问题,要求将输入数字与算术运算结合以达成目标数字,这需要模型进行规划、搜索和回溯,从而增强模型的问题解决能力。

  • 在训练数据集中,为什么只包含57%能导致解决方案的搜索轨迹?

    -包含非最优和有时不成功的搜索轨迹可以更全面地训练模型,使其能够学习到更多样化的搜索策略,并提高模型在面对复杂问题时的适应性和灵活性。

  • 如何评估语言模型在生成正确解决方案轨迹方面的准确性?

    -通过检查生成的轨迹中是否包含从初始状态到目标状态的正确路径来衡量准确性。

  • 在训练过程中,模型是如何与不同的搜索策略对齐的?

    -通过分析模型解决正确问题的策略和访问的状态,可以评估模型与不同搜索策略的对齐情况。例如,模型与使用总和启发式函数的深度优先搜索(DFS)策略具有最高的相关性。

  • 为什么训练模型时使用搜索轨迹比仅使用最优解更有效?

    -使用搜索轨迹训练模型可以提高在未知输入上的准确性,因为这种方法使模型能够学习到多种搜索策略,并在训练过程中发现新的搜索策略。

  • 如何使用强化学习策略来提高模型的问题解决能力?

    -通过使用专家迭代(STAR)和优势引导策略对齐(OPA)等强化学习策略,可以对模型进行微调,以提高其在验证集上的性能,并解决之前无法解决的问题。

  • 在强化学习策略中,如何设计奖励函数来引导模型的学习过程?

    -奖励函数基于正确性和轨迹长度来设计,以指导模型学习过程,鼓励模型生成更正确、更高效的解决方案。

  • 研究结果表明,训练语言模型进行搜索可以带来哪些好处?

    -研究结果表明,训练语言模型进行搜索可以使模型自主地采用各种搜索策略,解决以前未解决的问题,并发现新的搜索方法,从而提高其问题解决能力。

  • 在研究中,哪些外部因素对模型的训练和改进起到了支持作用?

    -研究得到了Gabriel Poian、Jacob Andreas、Joy Hui、Yuya Dangov、Jang Eric Zelikman、Jan Philip Franken和C. Jong等人的宝贵讨论和支持,同时得到了斯坦福大学人类中心人工智能、Google High Grant和NSF Expeditions Grant的资助。

Outlines

00:00

🚀 语言模型的搜索与自我改进训练

本段探讨了训练语言模型对搜索过程的影响,目的是教会语言模型如何搜索和回溯,以实现自我改进。基于Transformer的模型在规划任务中面临错误累积和前瞻性挑战,这些问题源于模型在搜索和回溯方面的限制。虽然已有方法将语言模型与符号搜索算法结合以解决这些问题,但它们仅在推理期间提供帮助,而训练期间是否能够独立进行搜索仍是一个未解决的问题。研究表明,通过将搜索过程表示为搜索流,语言模型可以被训练进行搜索和回溯。以倒计时游戏为灵感的搜索问题,提供了一个具有挑战性的搜索问题,其中输入数字必须与算术运算结合以达到目标数字。通过使用符号规划器和启发式函数创建的训练数据集,展示了语言模型在搜索和规划能力上的提升。此外,通过Advantage诱导策略对齐和专家迭代进行微调后,模型显示出更强的搜索和规划能力。研究结果表明,基于Transformer的语言模型可以通过搜索学习解决问题,并通过自主训练提高搜索策略。

05:02

🧠 搜索问题的建模与训练数据

本段将问题空间建模为马尔可夫决策过程(MDP),包括状态、动作、转移函数和奖励函数,以表示搜索过程。定义了一个从初始状态到目标状态的搜索树,其中正确的解决方案是一系列状态和动作的序列。介绍了诸如状态扩展、探索选择、剪枝、回溯、目标检查和启发式等原始操作,以有效指导搜索过程。以倒计时游戏为例,展示了搜索流的使用,该游戏需要结合输入数字和算术运算来达到目标数字。训练数据部分,通过创建包含多样化和次优符号搜索策略的合成数据集,对模型进行了训练。数据集由基于广度优先搜索(BFS)和深度优先搜索(DFS)的12种搜索策略组成,这些策略使用简单的启发式函数来指导搜索。每个搜索轨迹被表示为一系列状态节点的字符串,其中只有57%的搜索轨迹导致了解决方案。模型在生成正确解决方案轨迹方面的准确性得到了评估,并通过比较不同搜索策略之间的对齐情况,发现训练在搜索轨迹上的模型比仅训练在最优解上的模型表现更好。

10:03

🔧 通过搜索流增强模型的策略

本段旨在探究模型(流搜索语言模型,简称SO LM)是否能够通过基于正确性和效率的反馈来增强其解决问题的能力。测试了模型使用其训练数据上的符号搜索策略解决以前无法解决的问题的能力,并挑战了训练集中一些符号搜索策略无法处理的困难问题。为了增强模型,采用了两种强化学习(RL)策略:专家迭代(Star)和优势诱导策略对齐(OPA)。通过Star进行微调,通过从训练数据集中生成正确的轨迹并使用它们迭代更新模型,直到在验证集上观察到性能提升。另一方面,OPA涉及创建语言模型的副本作为价值网络,然后使用它来增强原始模型。设计了基于正确性和轨迹长度的奖励函数来指导模型的学习过程。实验表明,经过三次Star微调后,SO模型表现出改进的性能,解决了超出基础模型的额外问题。同样,使用OPA微调的模型显示出提高的准确性。通过分析状态访问模式,观察到Star和OPA模型探索了与特定启发式相关的更多状态,表明它们能够采用多样化的搜索策略。此外,SO模型在错误处理和快速找到解决方案方面比基础模型表现得更好。这些改进表明,模型可以有效地利用各种搜索策略,可能发现新的启发式和搜索方法。结果强调了训练语言模型搜索解决方案的有效性,强调了将模型暴露于问题解决过程而不仅仅是最优解的重要性。

📝 致谢

在本段中,表达了对Gabriel Poia、Jacob Andreas、Joy Hui、Yuya Danju Jang、Eric Zelikman、Jan Philip Franken和C. Jong等人宝贵讨论和支持的感激之情。研究得以实现,感谢斯坦福大学以人为中心的人工智能、高谷歌资助和国家科学基金会探险资助,资助编号为1,91877。

Mindmap

Keywords

💡语言模型

语言模型是一种人工智能技术,用于生成和理解自然语言。在视频中,语言模型被训练以学习搜索和回溯,以提高其问题解决能力。这与视频的主题密切相关,因为整个讨论都集中在如何通过训练语言模型来增强其搜索和规划能力上。

💡搜索过程

搜索过程是解决问题时探索所有可能解决方案的步骤。视频中提到,语言模型通过搜索过程的模拟来学习如何搜索和回溯。这与视频的核心内容紧密相连,因为研究的重点是如何使语言模型能够独立地进行搜索,从而提高其自主解决问题的能力。

💡马尔可夫决策过程(MDP)

MDP是一种数学框架,用于建模决策者在不确定环境中的决策。在视频中,问题空间被建模为MDP,其中包含了状态、动作、转移函数和奖励函数,以表示搜索过程。MDP是理解语言模型如何通过搜索树来找到解决方案的关键概念。

💡搜索树

搜索树是一种数据结构,用于表示从初始状态到目标状态的所有可能路径。视频中提到,通过搜索树,语言模型可以探索从初始状态到其子状态的所有可能动作,直到找到解决方案。搜索树是语言模型进行有效搜索的核心组成部分。

💡启发式函数

启发式函数是一种评估搜索中节点价值的方法,通常用于指导搜索过程朝着更有前景的方向发展。在视频中,启发式函数被用来估计到目标的距离,帮助语言模型更有效地搜索解决方案。这是搜索策略中的一个重要组成部分。

💡回溯

回溯是搜索算法中的一个概念,指的是当当前路径不能导致解决方案时,算法返回到上一个决策点并尝试其他可能的路径。视频中强调了语言模型学习回溯的重要性,因为这是提高其搜索效率和问题解决能力的关键。

💡深度优先搜索(DFS)

DFS是一种搜索算法,它沿着树的深度尽可能远地搜索,直到找到解决方案或确定该路径不可行。在视频中,DFS被用作构建搜索策略的基础之一,并与简单的启发式函数结合使用。DFS是语言模型训练中使用的关键搜索策略之一。

💡广度优先搜索(BFS)

BFS是另一种搜索算法,它首先探索所有位于初始状态相同深度的节点,然后再深入探索。视频中提到了基于BFS的搜索策略,这些策略通过使用启发式函数来指导搜索,帮助语言模型更全面地探索问题空间。

💡强化学习(RL)

RL是一种机器学习方法,通过奖励和惩罚来训练模型做出决策。在视频中,RL策略如专家迭代和优势引导策略对齐(OPA)被用来提高语言模型的问题解决能力。通过这些策略,模型能够根据反馈自我改进,这是视频讨论的一个重要方面。

💡倒计时游戏(Countdown)

倒计时游戏是一种数学谜题,要求使用给定的数字和算术运算符达到一个目标数字。视频中使用倒计时游戏作为训练语言模型的搜索问题的例子,因为它需要规划、搜索和回溯,这与语言模型训练的目标——提高问题解决能力——紧密相关。

💡优势引导策略对齐(OPA)

OPA是一种强化学习技术,通过创建模型的副本作为价值网络来增强原始模型的策略。在视频中,OPA被用来微调语言模型,以提高其搜索和规划能力。OPA的使用展示了如何通过模仿专家的行为来提升模型的性能。

Highlights

研究探讨了训练语言模型对搜索过程的影响,旨在教授语言模型如何搜索和回溯,以实现自我改进。

基于Transformer的模型在规划任务中遇到困难,如错误累积和前瞻性挑战。

通过将搜索过程表示为搜索流,定义了探索、回溯和剪枝等组件。

Countdown游戏启发的搜索问题,要求将输入数字与算术运算结合以达到目标数字。

通过符号规划器和启发式函数创建了包含多样化搜索轨迹的训练数据集。

与仅在最优解上训练的模型相比,搜索流模型在预测最优步骤方面表现更好。

使用Advantage诱导策略对齐和专家迭代对模型进行微调,显示出增强的搜索和规划能力。

研究表明,基于Transformer的语言模型可以通过搜索学习解决问题,并通过自主训练改进搜索策略。

通过马尔可夫决策过程(MDP)对问题空间进行建模,包括状态、动作、转移函数和奖励函数。

介绍了搜索树的构建,从初始状态到目标状态,通过一系列状态和动作。

定义了一组基本操作,如状态扩展、探索选择、剪枝、回溯、目标检查和启发式。

通过合成数据集训练模型,该数据集包含多样化和次优的符号搜索策略。

训练的模型在生成正确解决方案轨迹方面的准确性高于仅在最优解上训练的模型。

模型展示了与多种符号策略的有效对齐,并能够生成有效轨迹,错误率低。

通过使用强化学习策略,如Star和Advantage诱导策略对齐(OPA),模型能够自我改进。

经过Star和OPA微调的模型在解决以前无法解决的问题上表现出改进的性能。

模型在微调后能够灵活地使用各种搜索策略,并可能发现新的启发式和搜索方法。

通过反馈机制如Star和OPA,可以引导模型进行自我改进,增强问题解决能力。

研究表达了对Gabriel Poian、Jacob Andreas、Joy Hi yuya Dangu Jang等人的感谢。

研究得到了斯坦福大学人类中心人工智能、谷歌高奖和NSF Expeditions Grant的支持。

Transcripts

play00:00

section

play00:02

introduction in this section we explore

play00:05

the impact of training language models

play00:07

on the search

play00:08

process we aim to teach language models

play00:11

how to search and backtrack allowing

play00:12

them to

play00:14

self-improve transformer-based models

play00:16

have struggled with planning tasks

play00:18

facing issues like error compounding and

play00:20

look ahead

play00:21

challenges these problems stem from the

play00:24

model's limited ability to search and

play00:26

backtrack

play00:27

effectively while some approaches have

play00:29

combined language models with symbolic

play00:31

search algorithms to address these

play00:33

issues they only assist during inference

play00:35

leaving unanswered whether language

play00:37

models can conduct search

play00:39

independently learning to search during

play00:41

training could greatly benefit language

play00:43

models by mastering search techniques

play00:46

early on models May develop more

play00:48

adaptable strategies to handle error

play00:50

compounding and look ahead

play00:52

tasks we demonstrate that language

play00:54

models can be trained to search and

play00:56

backtrack by representing the process as

play00:58

a stream of search so

play01:01

we define components like exploration

play01:03

backtracking and pruning in a unified

play01:05

language and apply them to a search

play01:07

problem inspired by the game of

play01:10

countdown countdown presents a

play01:12

challenging search problem where input

play01:14

numbers must be combined with arithmetic

play01:16

operations to reach a target

play01:19

number we create a training data set of

play01:22

search trajectories using symbolic

play01:24

planners and heuristic functions

play01:26

training a language model on this

play01:27

diverse data set comparing this approach

play01:31

to training solely on Optimal Solutions

play01:33

reveals that the stream of search LM

play01:35

outperforms models focused on predicting

play01:37

optimal

play01:38

steps furthermore when fine-tuned for

play01:41

correctness using Advantage induced

play01:43

policy alignment and expert iteration

play01:45

the stream of search model shows

play01:47

enhanced search and planning

play01:50

abilities our results suggest that

play01:52

transformer-based language models can

play01:54

learn problem solving through search and

play01:56

improve their search strategies

play01:58

autonomously training for accuracy also

play02:01

leads to the discovery of new search

play02:04

strategies in related Works previous

play02:07

methods integrated language models into

play02:09

search systems to generate actions and

play02:11

evaluate

play02:13

States however these methods primarily

play02:16

focus on inference lacking Improvement

play02:18

in reasoning

play02:19

abilities in contrast our approach

play02:22

trains language models to explore

play02:24

backtrack and reason effectively

play02:26

learning an intrinsic policy for

play02:28

autonomous search

play02:30

this eliminates the high inference costs

play02:33

associated with fixed search

play02:35

strategies other approaches like in

play02:38

context demonstrations of search and

play02:40

process supervision have limitations in

play02:42

terms of demonstrated search procedures

play02:44

and

play02:45

scalability our method directly enhances

play02:48

the model's planning and search

play02:49

capabilities without the need for a

play02:51

verifier reward

play02:54

model while similar Works train

play02:56

transformer models on search

play02:58

trajectories our emphasis relies on

play03:00

autonomous search procedure usage and

play03:02

the discovery of new strategies

play03:04

distinguishing our approach from

play03:05

existing

play03:07

methods section summary in this section

play03:10

we explore the impact of training

play03:12

language models LMS to learn from

play03:14

mistakes and search processes allowing

play03:16

them to

play03:18

self-improve by teaching LMS to search

play03:20

and backtrack during training they can

play03:22

develop more flexible search strategies

play03:25

enhancing their problem solving

play03:27

abilities our study demonstrates that

play03:29

trans transformer-based LMS when trained

play03:31

to recover from errors and explore

play03:33

different options can autonomously

play03:35

employ various search strategies leading

play03:38

to solving previously unsolved problems

play03:40

and discovering new search

play03:43

approaches section a language for search

play03:47

in this section we model the problem

play03:49

space as a marov decision process

play03:52

mdp the mdp consists of States

play03:55

representing problemsolving steps

play03:57

actions that can be taken a transition

play03:59

function defining State changes based on

play04:01

actions and a reward function for

play04:03

reaching the goal

play04:05

State the search process starts with an

play04:08

initial State and aims to reach a goal

play04:09

state by exploring a search tree this

play04:12

tree includes all possible actions from

play04:14

the initial state to its child States

play04:16

until a solution is found a correct path

play04:20

to the solution is a sequence of states

play04:22

and actions leading from the initial

play04:23

state to the goal State we focus on the

play04:26

search process within the search tree to

play04:29

rep present this process we introduce

play04:31

primitive operations like the current

play04:33

state being explored the goal State a

play04:36

queue of states to be explored an

play04:38

expansion function to explore adjacent

play04:40

States and choices for exploration

play04:43

order other operations include pruning

play04:46

backtracking goal checking and using

play04:49

heuristics to estimate distances to the

play04:51

goal these operations can be implicit or

play04:54

explicit in the search trajectory we

play04:57

choose to make the current state goal

play04:58

State back tracking goal checks and

play05:01

exploration choices explicit in the

play05:04

trajectory however we keep heuristic

play05:06

functions State values and pruning

play05:09

strategies

play05:10

implicit for our task example countdown

play05:14

A variation of the 24 game we

play05:16

demonstrate the use of search

play05:18

streams countdown involves combining

play05:21

input numbers with arithmetic operations

play05:23

to reach a target number we select this

play05:26

task due to its complexity requiring

play05:28

planning search and backtracking for

play05:31

Solutions we focus on problems with four

play05:34

input numbers and Target numbers ranging

play05:36

from 10 to 100 with 10% of targets held

play05:39

out for

play05:41

evaluation section summary in this

play05:44

section we model the problem space as a

play05:46

marov decision process mdp with States

play05:50

actions transition functions and reward

play05:53

functions to represent the search

play05:55

process we Define a search tree starting

play05:58

from an initial state to a goal state

play06:00

where a correct path to the solution is

play06:02

a sequence of states and

play06:04

actions we introduce a vocabulary of

play06:06

primitive operations like State

play06:08

expansion exploration Choice pruning

play06:11

backtracking goal checks and heuristics

play06:14

to guide the search process

play06:17

efficiently section training

play06:20

data in this section we trained a model

play06:23

on streams of search for Countdown by

play06:25

creating a synthetic data set with

play06:26

diverse and suboptimal symbolic search

play06:29

strategies

play06:30

we Define 12 search strategies based on

play06:33

breadth first search BFS and depth first

play06:36

search DFS using simple heuristic

play06:39

functions these heuristics guide the

play06:42

search based on the absolute difference

play06:44

between remaining options and the target

play06:46

sum and the distance to the factors of

play06:47

the target our data set consisted of

play06:50

500,000 search trajectories with only

play06:53

57% leading to the

play06:55

solution each search trajectory is

play06:58

represented as a string of tre nodes

play07:00

States in traversal

play07:02

order we evaluated the model's accuracy

play07:05

in generating correct solution

play07:06

trajectories for countdown

play07:08

problems we measured correctness by

play07:11

checking if the correct path to the

play07:12

solution was present in the generated

play07:16

trajectory we also analyzed the

play07:18

alignment between different search

play07:20

strategies based on the problems they

play07:21

solved correctly and the states they

play07:24

visited we then compared training models

play07:27

on clean Optimal Solutions versus messy

play07:29

and sometimes unsuccessful search

play07:32

trajectories training on search

play07:34

trajectories outperformed training on

play07:36

Optimal Solutions achieving 51.2 7%

play07:40

accuracy on heldout inputs compared to

play07:58

25.73084 trained model showed alignment

play08:01

with various symbolic strategies with

play08:03

the highest correlation observed with

play08:04

DFS using the sum

play08:07

heuristic the model did not rely heavily

play08:09

on any single strategy from its training

play08:12

data section summary in this section we

play08:16

construct a synthetic data set for

play08:18

training a model on streams of search

play08:19

for Countdown by defining 12 search

play08:22

strategies based on BFS and DFS with

play08:24

interpretable heuristic

play08:26

functions We compare models trained on

play08:29

op op imal Solutions versus suboptimal

play08:31

search trajectories and find that the

play08:33

model trained on streams of search

play08:34

outperforms the one trained on Optimal

play08:36

Solutions achieving higher accuracy on

play08:39

heldout inputs and

play08:41

targets despite facing challenges the

play08:44

model trained on suboptimal search

play08:45

trajectories demonstrates effective

play08:47

learning and alignment with various

play08:49

search strategies showcasing its ability

play08:51

to generate valid trajectories with low

play08:53

error

play08:55

rates section policy improvement with

play08:58

stream of search

play09:01

in this section we aim to investigate if

play09:03

our model the stream of search language

play09:05

model so LM can enhance its

play09:08

problemsolving capabilities beyond what

play09:09

it has learned from its training data we

play09:12

want to see if the model can improve

play09:14

itself based on feedback regarding

play09:16

correctness and

play09:17

efficiency to assess this we test the

play09:20

model's ability to solve problems that

play09:22

were previously unsolvable using the

play09:24

symbolic search strategies it was

play09:25

trained on we also challenge it with

play09:29

difficult problems from the training set

play09:30

that none of the symbolic search

play09:32

strategies can handle to enhance the

play09:35

model we employ two reinforcement

play09:37

learning RL strategies expert iteration

play09:40

using star and Advantage induced policy

play09:42

alignment

play09:44

Opa with star we fine-tune the model by

play09:47

generating correct trajectories from the

play09:49

training data set and using them to

play09:51

update the model iteratively until we

play09:53

observe improved performance on the

play09:54

validation

play09:56

set on the other hand Opa involves

play09:59

creating a copy of the language model to

play10:01

serve as a value Network which is then

play10:03

used to enhance the original model's

play10:05

policy we designed a reward function

play10:08

based on correctness and trajectory

play10:10

length to guide the model's learning

play10:13

process our experiments show that after

play10:16

three iterations of star fine tuning the

play10:18

so models exhibit improved performance

play10:21

solving additional problems beyond the

play10:23

base model similarly when fine-tuned

play10:26

with Opa the models show enhanced

play10:28

accuracy although the validation

play10:30

accuracy stabilizes after a certain

play10:32

number of training

play10:33

steps by analyzing the state visitation

play10:36

patterns we observe that both star and

play10:38

opa models Explore More States

play10:40

associated with specific heuristics

play10:42

indicating their ability to employ

play10:44

diverse search

play10:46

strategies furthermore the so models

play10:49

enhanced with star and opa demonstrate

play10:51

better error handling and faster

play10:53

solution finding compared to the base

play10:55

model these improvements suggest that

play10:58

the models can effectively utilize

play11:00

various search strategies potentially

play11:02

discovering new heuristics and search

play11:04

methods our results highlight the

play11:07

effectiveness of training language

play11:08

models to search for Solutions

play11:10

emphasizing the importance of exposing

play11:12

models to the problemsolving process

play11:14

rather than just Optimal

play11:16

Solutions overall the so framework shows

play11:19

promise in enabling language models to

play11:21

tackle complex problems through internal

play11:23

search

play11:26

mechanisms by incorporating feedback

play11:28

mechanisms like star and Opa we can

play11:30

guide the models towards

play11:31

self-improvement and enhance their

play11:33

problem solving

play11:36

capabilities section summary in this

play11:39

section we explore if the so LM can

play11:41

enhance its symbolic strategies through

play11:43

self-improvement based on correctness

play11:45

and efficiency

play11:46

feedback by utilizing RL strategies like

play11:49

star and Opa we observe significant

play11:52

improvements in solving previously

play11:53

unsolved and difficult problems Beyond

play11:55

symbolic search

play11:57

strategies the so model show enhanced

play12:00

performance after iterations of

play12:01

fine-tuning with star and opa

play12:03

demonstrating the ability to flexibly

play12:05

utilize various search strategies and

play12:07

discover novel

play12:10

heuristics section

play12:13

acknowledgements in this section we

play12:15

express our gratitude to Gabriel poia

play12:18

Jacob Andreas Joy hi yuya dangu Jang

play12:22

Eric zelikman Jan Philip Franken and C

play12:25

Jong for their valuable discussions and

play12:28

support our research was made possible

play12:30

thanks to the funding from the Stanford

play12:32

human- centered artificial intelligence

play12:34

High Google Grant and the NSF

play12:37

Expeditions Grant with award number 1,

play12:42

91877

Rate This

5.0 / 5 (0 votes)

Related Tags
语言模型搜索策略自主学习问题解决Transformer模型马尔可夫决策过程搜索树强化学习专家迭代策略改进计数游戏人工智能
Do you need a summary in English?