Stream of Search (SoS): Learning to Search in Language
Summary
TLDR本视频脚本探讨了训练语言模型对搜索过程的影响,旨在教授语言模型如何搜索和回溯,以实现自我改进。基于Transformer的模型在规划任务中存在错误累积和前瞻性挑战等问题,这些问题源于模型在有效搜索和回溯方面的限制。研究展示了通过将搜索过程表示为搜索流,语言模型可以被训练进行搜索和回溯。以倒计时游戏为灵感的搜索问题,要求将输入数字与算术运算结合以达到目标数字,提供了一个具有挑战性的搜索问题。通过使用符号规划器和启发式函数创建的训练数据集,训练语言模型在多样性数据集上的表现优于仅在最优解上训练的模型。此外,通过优势诱导策略对齐和专家迭代进行微调后,搜索流模型展现出了增强的搜索和规划能力。研究结果表明,基于Transformer的语言模型可以通过搜索学习问题解决,并自主改进其搜索策略。
Takeaways
- 🤖 训练语言模型以学习搜索和回溯,可以提高模型的自我改进能力。
- 🚀 基于Transformer的模型在规划任务中存在错误累积和前瞻性挑战。
- 🔍 通过将搜索过程表示为搜索流,语言模型可以被训练进行搜索和回溯。
- 🎯 使用启发式函数创建的训练数据集,可以帮助模型学习多样化的搜索策略。
- 📈 与仅在最优解上训练的模型相比,基于搜索流训练的模型在预测正确解的路径上表现更好。
- 🧠 通过优势策略对齐和专家迭代进行微调,可以增强模型的搜索和规划能力。
- 🔬 研究表明,基于Transformer的语言模型可以通过搜索学习问题解决,并自主改进搜索策略。
- 📚 通过马尔可夫决策过程(MDP)对问题空间进行建模,定义了状态、动作、转换函数和奖励函数。
- 📊 通过比较不同的搜索策略,发现模型能够与多种符号策略对齐,而不是仅依赖于训练数据中的单一策略。
- 🛠️ 利用强化学习策略进行微调,模型能够解决之前无法解决的问题,并发现新的搜索策略。
- ✅ 训练语言模型以提高准确性,也有助于发现新的搜索策略,这表明模型能够灵活地使用不同的搜索策略。
- 🏆 该研究展示了通过内部搜索机制使语言模型能够处理复杂问题的可能性,并强调了让模型了解问题解决过程而不仅仅是最优解的重要性。
Q & A
在训练语言模型时,为什么需要让模型学会搜索和回溯?
-训练语言模型学会搜索和回溯是为了提高模型在规划任务中的表现,解决诸如错误累积和前瞻性挑战等问题,这些问题源于模型在有效搜索和回溯方面的限制。
Transformer-based模型在规划任务中面临哪些挑战?
-Transformer-based模型在规划任务中面临的挑战包括错误累积和前瞻性问题,这些问题使得模型难以有效地进行搜索和回溯。
如何通过训练提高语言模型的搜索能力?
-通过将搜索过程表示为搜索流,定义探索、回溯和剪枝等组件,并将其应用于受倒计时游戏启发的搜索问题,可以训练语言模型进行搜索和回溯。
倒计时游戏在训练语言模型中的作用是什么?
-倒计时游戏提供了一个具有挑战性的搜索问题,要求将输入数字与算术运算结合以达成目标数字,这需要模型进行规划、搜索和回溯,从而增强模型的问题解决能力。
在训练数据集中,为什么只包含57%能导致解决方案的搜索轨迹?
-包含非最优和有时不成功的搜索轨迹可以更全面地训练模型,使其能够学习到更多样化的搜索策略,并提高模型在面对复杂问题时的适应性和灵活性。
如何评估语言模型在生成正确解决方案轨迹方面的准确性?
-通过检查生成的轨迹中是否包含从初始状态到目标状态的正确路径来衡量准确性。
在训练过程中,模型是如何与不同的搜索策略对齐的?
-通过分析模型解决正确问题的策略和访问的状态,可以评估模型与不同搜索策略的对齐情况。例如,模型与使用总和启发式函数的深度优先搜索(DFS)策略具有最高的相关性。
为什么训练模型时使用搜索轨迹比仅使用最优解更有效?
-使用搜索轨迹训练模型可以提高在未知输入上的准确性,因为这种方法使模型能够学习到多种搜索策略,并在训练过程中发现新的搜索策略。
如何使用强化学习策略来提高模型的问题解决能力?
-通过使用专家迭代(STAR)和优势引导策略对齐(OPA)等强化学习策略,可以对模型进行微调,以提高其在验证集上的性能,并解决之前无法解决的问题。
在强化学习策略中,如何设计奖励函数来引导模型的学习过程?
-奖励函数基于正确性和轨迹长度来设计,以指导模型学习过程,鼓励模型生成更正确、更高效的解决方案。
研究结果表明,训练语言模型进行搜索可以带来哪些好处?
-研究结果表明,训练语言模型进行搜索可以使模型自主地采用各种搜索策略,解决以前未解决的问题,并发现新的搜索方法,从而提高其问题解决能力。
在研究中,哪些外部因素对模型的训练和改进起到了支持作用?
-研究得到了Gabriel Poian、Jacob Andreas、Joy Hui、Yuya Dangov、Jang Eric Zelikman、Jan Philip Franken和C. Jong等人的宝贵讨论和支持,同时得到了斯坦福大学人类中心人工智能、Google High Grant和NSF Expeditions Grant的资助。
Outlines
🚀 语言模型的搜索与自我改进训练
本段探讨了训练语言模型对搜索过程的影响,目的是教会语言模型如何搜索和回溯,以实现自我改进。基于Transformer的模型在规划任务中面临错误累积和前瞻性挑战,这些问题源于模型在搜索和回溯方面的限制。虽然已有方法将语言模型与符号搜索算法结合以解决这些问题,但它们仅在推理期间提供帮助,而训练期间是否能够独立进行搜索仍是一个未解决的问题。研究表明,通过将搜索过程表示为搜索流,语言模型可以被训练进行搜索和回溯。以倒计时游戏为灵感的搜索问题,提供了一个具有挑战性的搜索问题,其中输入数字必须与算术运算结合以达到目标数字。通过使用符号规划器和启发式函数创建的训练数据集,展示了语言模型在搜索和规划能力上的提升。此外,通过Advantage诱导策略对齐和专家迭代进行微调后,模型显示出更强的搜索和规划能力。研究结果表明,基于Transformer的语言模型可以通过搜索学习解决问题,并通过自主训练提高搜索策略。
🧠 搜索问题的建模与训练数据
本段将问题空间建模为马尔可夫决策过程(MDP),包括状态、动作、转移函数和奖励函数,以表示搜索过程。定义了一个从初始状态到目标状态的搜索树,其中正确的解决方案是一系列状态和动作的序列。介绍了诸如状态扩展、探索选择、剪枝、回溯、目标检查和启发式等原始操作,以有效指导搜索过程。以倒计时游戏为例,展示了搜索流的使用,该游戏需要结合输入数字和算术运算来达到目标数字。训练数据部分,通过创建包含多样化和次优符号搜索策略的合成数据集,对模型进行了训练。数据集由基于广度优先搜索(BFS)和深度优先搜索(DFS)的12种搜索策略组成,这些策略使用简单的启发式函数来指导搜索。每个搜索轨迹被表示为一系列状态节点的字符串,其中只有57%的搜索轨迹导致了解决方案。模型在生成正确解决方案轨迹方面的准确性得到了评估,并通过比较不同搜索策略之间的对齐情况,发现训练在搜索轨迹上的模型比仅训练在最优解上的模型表现更好。
🔧 通过搜索流增强模型的策略
本段旨在探究模型(流搜索语言模型,简称SO LM)是否能够通过基于正确性和效率的反馈来增强其解决问题的能力。测试了模型使用其训练数据上的符号搜索策略解决以前无法解决的问题的能力,并挑战了训练集中一些符号搜索策略无法处理的困难问题。为了增强模型,采用了两种强化学习(RL)策略:专家迭代(Star)和优势诱导策略对齐(OPA)。通过Star进行微调,通过从训练数据集中生成正确的轨迹并使用它们迭代更新模型,直到在验证集上观察到性能提升。另一方面,OPA涉及创建语言模型的副本作为价值网络,然后使用它来增强原始模型。设计了基于正确性和轨迹长度的奖励函数来指导模型的学习过程。实验表明,经过三次Star微调后,SO模型表现出改进的性能,解决了超出基础模型的额外问题。同样,使用OPA微调的模型显示出提高的准确性。通过分析状态访问模式,观察到Star和OPA模型探索了与特定启发式相关的更多状态,表明它们能够采用多样化的搜索策略。此外,SO模型在错误处理和快速找到解决方案方面比基础模型表现得更好。这些改进表明,模型可以有效地利用各种搜索策略,可能发现新的启发式和搜索方法。结果强调了训练语言模型搜索解决方案的有效性,强调了将模型暴露于问题解决过程而不仅仅是最优解的重要性。
📝 致谢
在本段中,表达了对Gabriel Poia、Jacob Andreas、Joy Hui、Yuya Danju Jang、Eric Zelikman、Jan Philip Franken和C. Jong等人宝贵讨论和支持的感激之情。研究得以实现,感谢斯坦福大学以人为中心的人工智能、高谷歌资助和国家科学基金会探险资助,资助编号为1,91877。
Mindmap
Keywords
💡语言模型
💡搜索过程
💡马尔可夫决策过程(MDP)
💡搜索树
💡启发式函数
💡回溯
💡深度优先搜索(DFS)
💡广度优先搜索(BFS)
💡强化学习(RL)
💡倒计时游戏(Countdown)
💡优势引导策略对齐(OPA)
Highlights
研究探讨了训练语言模型对搜索过程的影响,旨在教授语言模型如何搜索和回溯,以实现自我改进。
基于Transformer的模型在规划任务中遇到困难,如错误累积和前瞻性挑战。
通过将搜索过程表示为搜索流,定义了探索、回溯和剪枝等组件。
Countdown游戏启发的搜索问题,要求将输入数字与算术运算结合以达到目标数字。
通过符号规划器和启发式函数创建了包含多样化搜索轨迹的训练数据集。
与仅在最优解上训练的模型相比,搜索流模型在预测最优步骤方面表现更好。
使用Advantage诱导策略对齐和专家迭代对模型进行微调,显示出增强的搜索和规划能力。
研究表明,基于Transformer的语言模型可以通过搜索学习解决问题,并通过自主训练改进搜索策略。
通过马尔可夫决策过程(MDP)对问题空间进行建模,包括状态、动作、转移函数和奖励函数。
介绍了搜索树的构建,从初始状态到目标状态,通过一系列状态和动作。
定义了一组基本操作,如状态扩展、探索选择、剪枝、回溯、目标检查和启发式。
通过合成数据集训练模型,该数据集包含多样化和次优的符号搜索策略。
训练的模型在生成正确解决方案轨迹方面的准确性高于仅在最优解上训练的模型。
模型展示了与多种符号策略的有效对齐,并能够生成有效轨迹,错误率低。
通过使用强化学习策略,如Star和Advantage诱导策略对齐(OPA),模型能够自我改进。
经过Star和OPA微调的模型在解决以前无法解决的问题上表现出改进的性能。
模型在微调后能够灵活地使用各种搜索策略,并可能发现新的启发式和搜索方法。
通过反馈机制如Star和OPA,可以引导模型进行自我改进,增强问题解决能力。
研究表达了对Gabriel Poian、Jacob Andreas、Joy Hi yuya Dangu Jang等人的感谢。
研究得到了斯坦福大学人类中心人工智能、谷歌高奖和NSF Expeditions Grant的支持。
Transcripts
section
introduction in this section we explore
the impact of training language models
on the search
process we aim to teach language models
how to search and backtrack allowing
them to
self-improve transformer-based models
have struggled with planning tasks
facing issues like error compounding and
look ahead
challenges these problems stem from the
model's limited ability to search and
backtrack
effectively while some approaches have
combined language models with symbolic
search algorithms to address these
issues they only assist during inference
leaving unanswered whether language
models can conduct search
independently learning to search during
training could greatly benefit language
models by mastering search techniques
early on models May develop more
adaptable strategies to handle error
compounding and look ahead
tasks we demonstrate that language
models can be trained to search and
backtrack by representing the process as
a stream of search so
we define components like exploration
backtracking and pruning in a unified
language and apply them to a search
problem inspired by the game of
countdown countdown presents a
challenging search problem where input
numbers must be combined with arithmetic
operations to reach a target
number we create a training data set of
search trajectories using symbolic
planners and heuristic functions
training a language model on this
diverse data set comparing this approach
to training solely on Optimal Solutions
reveals that the stream of search LM
outperforms models focused on predicting
optimal
steps furthermore when fine-tuned for
correctness using Advantage induced
policy alignment and expert iteration
the stream of search model shows
enhanced search and planning
abilities our results suggest that
transformer-based language models can
learn problem solving through search and
improve their search strategies
autonomously training for accuracy also
leads to the discovery of new search
strategies in related Works previous
methods integrated language models into
search systems to generate actions and
evaluate
States however these methods primarily
focus on inference lacking Improvement
in reasoning
abilities in contrast our approach
trains language models to explore
backtrack and reason effectively
learning an intrinsic policy for
autonomous search
this eliminates the high inference costs
associated with fixed search
strategies other approaches like in
context demonstrations of search and
process supervision have limitations in
terms of demonstrated search procedures
and
scalability our method directly enhances
the model's planning and search
capabilities without the need for a
verifier reward
model while similar Works train
transformer models on search
trajectories our emphasis relies on
autonomous search procedure usage and
the discovery of new strategies
distinguishing our approach from
existing
methods section summary in this section
we explore the impact of training
language models LMS to learn from
mistakes and search processes allowing
them to
self-improve by teaching LMS to search
and backtrack during training they can
develop more flexible search strategies
enhancing their problem solving
abilities our study demonstrates that
trans transformer-based LMS when trained
to recover from errors and explore
different options can autonomously
employ various search strategies leading
to solving previously unsolved problems
and discovering new search
approaches section a language for search
in this section we model the problem
space as a marov decision process
mdp the mdp consists of States
representing problemsolving steps
actions that can be taken a transition
function defining State changes based on
actions and a reward function for
reaching the goal
State the search process starts with an
initial State and aims to reach a goal
state by exploring a search tree this
tree includes all possible actions from
the initial state to its child States
until a solution is found a correct path
to the solution is a sequence of states
and actions leading from the initial
state to the goal State we focus on the
search process within the search tree to
rep present this process we introduce
primitive operations like the current
state being explored the goal State a
queue of states to be explored an
expansion function to explore adjacent
States and choices for exploration
order other operations include pruning
backtracking goal checking and using
heuristics to estimate distances to the
goal these operations can be implicit or
explicit in the search trajectory we
choose to make the current state goal
State back tracking goal checks and
exploration choices explicit in the
trajectory however we keep heuristic
functions State values and pruning
strategies
implicit for our task example countdown
A variation of the 24 game we
demonstrate the use of search
streams countdown involves combining
input numbers with arithmetic operations
to reach a target number we select this
task due to its complexity requiring
planning search and backtracking for
Solutions we focus on problems with four
input numbers and Target numbers ranging
from 10 to 100 with 10% of targets held
out for
evaluation section summary in this
section we model the problem space as a
marov decision process mdp with States
actions transition functions and reward
functions to represent the search
process we Define a search tree starting
from an initial state to a goal state
where a correct path to the solution is
a sequence of states and
actions we introduce a vocabulary of
primitive operations like State
expansion exploration Choice pruning
backtracking goal checks and heuristics
to guide the search process
efficiently section training
data in this section we trained a model
on streams of search for Countdown by
creating a synthetic data set with
diverse and suboptimal symbolic search
strategies
we Define 12 search strategies based on
breadth first search BFS and depth first
search DFS using simple heuristic
functions these heuristics guide the
search based on the absolute difference
between remaining options and the target
sum and the distance to the factors of
the target our data set consisted of
500,000 search trajectories with only
57% leading to the
solution each search trajectory is
represented as a string of tre nodes
States in traversal
order we evaluated the model's accuracy
in generating correct solution
trajectories for countdown
problems we measured correctness by
checking if the correct path to the
solution was present in the generated
trajectory we also analyzed the
alignment between different search
strategies based on the problems they
solved correctly and the states they
visited we then compared training models
on clean Optimal Solutions versus messy
and sometimes unsuccessful search
trajectories training on search
trajectories outperformed training on
Optimal Solutions achieving 51.2 7%
accuracy on heldout inputs compared to
25.73084 trained model showed alignment
with various symbolic strategies with
the highest correlation observed with
DFS using the sum
heuristic the model did not rely heavily
on any single strategy from its training
data section summary in this section we
construct a synthetic data set for
training a model on streams of search
for Countdown by defining 12 search
strategies based on BFS and DFS with
interpretable heuristic
functions We compare models trained on
op op imal Solutions versus suboptimal
search trajectories and find that the
model trained on streams of search
outperforms the one trained on Optimal
Solutions achieving higher accuracy on
heldout inputs and
targets despite facing challenges the
model trained on suboptimal search
trajectories demonstrates effective
learning and alignment with various
search strategies showcasing its ability
to generate valid trajectories with low
error
rates section policy improvement with
stream of search
in this section we aim to investigate if
our model the stream of search language
model so LM can enhance its
problemsolving capabilities beyond what
it has learned from its training data we
want to see if the model can improve
itself based on feedback regarding
correctness and
efficiency to assess this we test the
model's ability to solve problems that
were previously unsolvable using the
symbolic search strategies it was
trained on we also challenge it with
difficult problems from the training set
that none of the symbolic search
strategies can handle to enhance the
model we employ two reinforcement
learning RL strategies expert iteration
using star and Advantage induced policy
alignment
Opa with star we fine-tune the model by
generating correct trajectories from the
training data set and using them to
update the model iteratively until we
observe improved performance on the
validation
set on the other hand Opa involves
creating a copy of the language model to
serve as a value Network which is then
used to enhance the original model's
policy we designed a reward function
based on correctness and trajectory
length to guide the model's learning
process our experiments show that after
three iterations of star fine tuning the
so models exhibit improved performance
solving additional problems beyond the
base model similarly when fine-tuned
with Opa the models show enhanced
accuracy although the validation
accuracy stabilizes after a certain
number of training
steps by analyzing the state visitation
patterns we observe that both star and
opa models Explore More States
associated with specific heuristics
indicating their ability to employ
diverse search
strategies furthermore the so models
enhanced with star and opa demonstrate
better error handling and faster
solution finding compared to the base
model these improvements suggest that
the models can effectively utilize
various search strategies potentially
discovering new heuristics and search
methods our results highlight the
effectiveness of training language
models to search for Solutions
emphasizing the importance of exposing
models to the problemsolving process
rather than just Optimal
Solutions overall the so framework shows
promise in enabling language models to
tackle complex problems through internal
search
mechanisms by incorporating feedback
mechanisms like star and Opa we can
guide the models towards
self-improvement and enhance their
problem solving
capabilities section summary in this
section we explore if the so LM can
enhance its symbolic strategies through
self-improvement based on correctness
and efficiency
feedback by utilizing RL strategies like
star and Opa we observe significant
improvements in solving previously
unsolved and difficult problems Beyond
symbolic search
strategies the so model show enhanced
performance after iterations of
fine-tuning with star and opa
demonstrating the ability to flexibly
utilize various search strategies and
discover novel
heuristics section
acknowledgements in this section we
express our gratitude to Gabriel poia
Jacob Andreas Joy hi yuya dangu Jang
Eric zelikman Jan Philip Franken and C
Jong for their valuable discussions and
support our research was made possible
thanks to the funding from the Stanford
human- centered artificial intelligence
High Google Grant and the NSF
Expeditions Grant with award number 1,
91877
استعرض المزيد من الفيديوهات ذات الصلة
【生成式AI導論 2024】第4講:訓練不了人工智慧?你可以訓練你自己 (中) — 拆解問題與使用工具
Understand DSPy: Programming AI Pipelines
[ML News] Jamba, CMD-R+, and other new models (yes, I know this is like a week behind 🙃)
LangGraph AI Agents: How Future of Internet Search will look like?
【生成式AI導論 2024】第6講:大型語言模型修練史 — 第一階段: 自我學習,累積實力 (熟悉機器學習的同學從 15:00 開始看起即可)
Intro to Algorithms: Crash Course Computer Science #13
5.0 / 5 (0 votes)