【人工智能】OpenAI o1模型背后的技术 | 后训练阶段的缩放法则 | 测试时计算 | 慢思考 | 隐式思维链CoT | STaR | Critic模型 | 大语言模型的天花板在哪里
Summary
TLDRIn this episode, 'Best Partners' discusses OpenAI's o1 series model, highlighting its significant advancements in mathematics, coding, and long-term planning. The model's performance surge is attributed to post-training scaling laws and test-time compute scaling. The script explores the model's technical aspects, emphasizing the shift from pre-training parameter scaling to reinforcement learning in post-training, which is crucial for enhancing reasoning and problem-solving abilities. It also touches on techniques like MCTS, Chain of Thought, and STaR for optimizing model output, suggesting that the future of AI may lie in the intelligent allocation of computational resources during the post-training phase.
Takeaways
- 🚀 OpenAI's o1 model represents a significant leap in AI capabilities, particularly in mathematical reasoning, coding, and long-range planning.
- 📈 The model's advancements are attributed to post-training scaling laws and reinforcement learning during the training phase, which have allowed it to surpass human doctoral accuracy in certain domains.
- 🧠 The diminishing returns of pre-training scaling suggest that future AI improvements may rely more on post-training enhancements like reinforcement learning.
- 🤖 o1's performance in competitive programming and mathematical problem-solving places it within the top percentiles of human performers.
- 🔍 The model's approach includes techniques like Monte Carlo Tree Search (MCTS) and Chain of Thought (CoT) to enhance its reasoning and error-correction abilities.
- 💡 STaR and Quiet-STaR methodologies are highlighted as being instrumental in teaching the model to think before responding, thereby improving its reasoning accuracy.
- 🔄 The concept of 'internal thinking' introduced by Quiet-STaR allows the model to perform implicit reasoning without external examples, expanding its applicability.
- 📚 o1's training process is dynamic, incorporating self-critique and iterative learning to refine its reasoning链条 and strategies.
- 🌐 The model's ability to generate high-quality training data through its reasoning processes could lead to a self-reinforcing cycle of performance improvement.
- 🔮 While o1 excels in complex reasoning tasks, it may not yet be optimized for general agent or assistant roles, indicating a potential trade-off between reasoning and directive-following capabilities.
Q & A
What significant advancements did OpenAI's o1 series model achieve according to the transcript?
-The o1 series model achieved significant advancements in mathematics, coding, and long-range planning. It ranked in the 89th percentile in competitive programming on Codeforces, made it into the top 500 students in the American Invitational Mathematics Examination (AIME), and surpassed human doctoral level accuracy on the GPQA benchmark for physics, biology, and chemistry problems.
What is the Post-Training Scaling Law and how does it relate to the o1 model's performance?
-The Post-Training Scaling Law refers to the principle that the performance of AI models can be significantly enhanced through reinforcement learning during the post-training phase, rather than just scaling up the model's parameters during pre-training. This law was instrumental in the o1 model's performance leap, suggesting a shift in focus towards post-training optimization for improving reasoning and long-range problem-solving abilities.
How does the o1 model utilize reinforcement learning in its training?
-The o1 model employs reinforcement learning during the post-training phase to enhance its reasoning capabilities. It does this by iteratively guiding the model to produce logical reasoning paths and incorporating these into the training process, allowing the model to learn and improve its reasoning accuracy over time.
What is the role of Test-Time Compute in the performance of large language models as discussed in the transcript?
-Test-Time Compute refers to the computational resources a model uses during the testing phase for reasoning and reflection. The transcript suggests that increasing Test-Time Compute can be more effective than simply scaling up model parameters, as it allows the model to engage in deeper and more complex reasoning processes, which directly impacts the model's performance.
What is the concept of 'Chain of Thought' (CoT) mentioned in the transcript, and how does it improve model output?
-The 'Chain of Thought' (CoT) is a method where the model is prompted to generate a series of intermediate reasoning steps before providing a final answer. This approach helps to enhance the model's reasoning capabilities, especially in tasks requiring mathematical and coding solutions, by making the reasoning process more explicit and structured.
How does the STaR method contribute to the o1 model's reasoning abilities?
-STaR, or 'Bootstrapping Reasoning With Reasoning,' is a method that leverages the model's existing reasoning capabilities to iteratively guide it in producing logical reasoning paths. It incorporates these paths into the training process, allowing the model to learn and improve its reasoning accuracy, which is similar to the strategy gradient algorithm in reinforcement learning.
What is the difference between STaR and Quiet-STaR as per the transcript?
-While STaR focuses on explicit reasoning by generating reasoning paths, Quiet-STaR introduces the concept of 'internal thinking,' transforming the explicit reasoning process into an implicit one within the model. This allows Quiet-STaR to operate without reliance on external examples and to apply reasoning across a broader range of tasks and non-structured data.
How does the o1 model optimize its internal reasoning process according to the transcript?
-The o1 model optimizes its internal reasoning process through a combination of reinforcement learning and dynamic introduction of reasoning tokens. It learns to identify and correct errors, break down complex steps into simpler ones, and try different solutions when necessary, which significantly enhances its reasoning capabilities.
What is the concept of a 'data flywheel' mentioned in the transcript, and how does it relate to the o1 model?
-A 'data flywheel' refers to a self-reinforcing cycle where the model's reasoning process generates high-quality training data, which can then be used to further improve the model's performance. In the context of the o1 model, this concept suggests that as the model's bootstrapping capabilities expand, it can accelerate performance improvements and potentially move closer to achieving superintelligence.
What challenges does the o1 model face in balancing reasoning capabilities with following instructions, as discussed in the transcript?
-While the o1 model excels in reasoning abilities, especially for complex tasks like mathematics and physics, it may not necessarily perform as well as an agent or assistant in language generation tasks. The transcript suggests that as models become more powerful, there could be a separation between reasoning capabilities and the ability to follow instructions, which could become a core issue in developing general intelligent agents.
Outlines
🚀 Introduction to O1 Model's Breakthroughs
The script introduces the advancements of the O1 series model by OpenAI, highlighting its significant improvements in mathematics, coding, and long-range planning. The model's performance is benchmarked against competitive programming, mathematical contests, and scientific problem-solving, where it has exceeded human expert levels. The script emphasizes the role of Post-Training Scaling Law and the reinforcement learning during the training phase in achieving these capabilities. It also discusses the diminishing returns of merely scaling up parameters during pre-training and suggests that post-training reinforcement learning is a pivotal next step. The script references Ilya's 2018 MIT talk on the potential of reinforcement learning and self-play for AGI and OpenAI's exploration of scaling laws beyond parameter size.
🤖 Deep Dive into Post-Training Scaling and STaR Method
This section delves into the technical aspects of the O1 model's training, particularly focusing on the Post-Training Scaling Law. It explains how the training phase's computational demand is now linked to both the model parameter size and the computational load during reinforcement learning exploration. The script introduces the STaR method, which iteratively improves the model's reasoning capabilities by evaluating predictions and updating the model based on correct samples. It contrasts this approach with other optimization techniques like Monte Carlo Tree Search and Chain of Thought, highlighting STaR's ability to enhance explicit reasoning. Limitations of STaR, such as dependency on few-shot examples and restricted generalization, are also discussed, leading to the introduction of Quiet-STaR, which internalizes the reasoning process and broadens the model's applicability.
🔍 Quiet-STaR Innovations and Future AI Prospects
The final paragraph discusses the innovations of Quiet-STaR, which allows language models to 'think' before speaking by marking the beginning and end of thought processes with special tokens. It explores how Quiet-STaR uses distribution differences between reasoned and actual outcomes to introduce reward signals for reinforcement learning, improving the model's accuracy in predicting future tokens. The script also speculates on how OpenAI's O1 model might have optimized internal reasoning processes, potentially using critic models for fine-grained feedback. It reflects on the progress of large AI models since ChatGPT's release, noting the industry and academia's efforts to push their capabilities. The discussion concludes with the observation that while O1 excels in reasoning, it may not be as adept as an agent or assistant, suggesting a potential dichotomy between reasoning and instruction-following abilities that future models need to address.
Mindmap
Keywords
💡Post-Training Scaling Law
💡Reinforcement Learning
💡Codeforces
💡AIME
💡GPQA
💡Test-Time Compute
💡STaR
💡Quiet-STaR
💡Critic Model
💡Data Flywheel
💡System 1 and System 2
Highlights
OpenAI's o1 series models have achieved significant improvements in mathematics, coding, and long-term planning.
o1 ranks in the 89th percentile in competitive programming on Codeforces and among the top 500 students in the American Invitational Mathematics Examination (AIME).
o1 surpasses human doctoral accuracy on the GPQA benchmark for physics, biology, and chemistry problems.
Post-Training Scaling Law is pivotal for o1's performance leap, suggesting a reevaluation of computational resource allocation.
As model size increases, the marginal benefits of pre-training parameter scaling are diminishing.
Reinforcement learning during the post-training phase is identified as the next breakthrough for enhancing model reasoning and long-term problem-solving capabilities.
Ilya Sutskever expressed confidence in AGI through reinforcement learning and self-play at MIT in 2018.
OpenAI's exploration of scaling laws beyond parameters is evident in their 2021 paper on training verifiers for math word problems.
The inability of autoregressive models to self-correct answers is a challenge for progress in mathematical reasoning.
Reinforcement learning brings a paradigm shift in large language model training and introduces new scaling laws for post-training.
Training compute in the post-training phase includes not only model parameter scaling but also the computational load of reinforcement learning exploration.
Test-Time Compute, or the computational load during model reasoning and reflection, also affects model performance.
The necessity of sufficient computational power for post-training to enhance reasoning performance is becoming a critical factor.
o1's performance continues to improve with more reinforcement learning and extended thinking time.
STaR and Quiet-STaR methods are highlighted as being closest to o1's technical route and model performance.
STaR uses the model's reasoning capabilities to iteratively guide it to produce logical reasoning processes.
Quiet-STaR introduces 'internal thinking', transforming explicit reasoning into implicit processes within the model.
o1 likely optimizes the internal reasoning process, or 'implicit CoT', focusing training compute on this optimization.
Critic Model is introduced to provide fine-grained feedback for complex tasks that are difficult for the model to reason about internally.
o1 learns to optimize its reasoning chain and improve strategies, identifying and correcting errors, and breaking down complex steps.
o1 evolves from a fast, intuitive thinking model to a slower, more deliberate and reliable reasoning process, enhancing its ability to solve complex problems.
The potential for a data flywheel effect, where o1's reasoning process generates high-quality training data for self-improvement, is discussed.
The future of large models is considered, with a focus on balancing reasoning capabilities with the ability to follow instructions for building general intelligence.
Transcripts
大家好,这里是最佳拍档,我是大飞
北京时间9月13日的午夜
OpenAI发布了o1系列模型
在数学、代码、长程规划等问题上
取得了显著的提升
比方说
在竞争性编程问题Codeforces中排名第89个百分位
在美国数学奥林匹克竞赛AIME资格赛中
跻身美国前500名学生之列
在物理、生物和化学问题的基准GPQA 上
也超过了人类博士水平的准确性
而帮助o1取得如此性能飞跃的
正是后训练阶段强化学习的Scaling
以及测试推理阶段思考时间的Scaling
今天我们主要来聊聊o1背后的一些技术
尤其是前者
也就是后训练阶段的缩放法则
Post-Training Scaling Law
它的出现
可能会引发我们对于算力分配、后训练能力的重新思考
其实大家已经发现
随着大模型的尺寸逐渐增大
纯粹预训练阶段参数Scaling Up所能带来的边际收益
其实已经开始递减
如果想要深度提升模型的推理能力和长程问题能力
那么基于强化学习的后训练
将会成为下一个突破点
早在2018年Ilya在MIT的客座讲座上
他就分享过自己对于通过强化学习和自博弈
走向AGI的信心
显然
OpenAI也一直在探索参数以外的缩放法则
早在2021年
他们就在论文《训练验证器来解决数学应用题(Training Verifiers to Solve Math Word Problems)》中提到
自回归模型在数学推理问题上很难进步的一点在于
它没有办法进行回答的自主修正
如果只是依靠生成式方法和扩大参数的规模
那么在数学推理任务上带来的收益
并不会太大
所以,需要寻找额外的Scaling Laws
而现在看来
强化学习不仅带来了大语言模型训练的范式转变
也带来了新的Scaling Laws
即后训练缩放法则Post-Training Scaling Laws
在新的缩放法则下
训练阶段的计算量不再只是和模型参数量的上升有关
同时也会包含强化学习探索时
大语言模型推理的计算量
与此同时
在测试阶段模型推理和反思的计算量
也就是Test-Time Compute
也会影响模型最终的表现
这在DeepMind最近发表的论文
《扩大大语言模型测试时间的计算
比扩大模型的参数更有效(Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters)》中
就提到了这种范式的转变
在后训练阶段,虽然模型的参数没变
但是在训练算力上仍然会呈倍数的增长
推理上也会随着模型思考能力的提高
单次算力有所增长
因此,是否有足够的算力来做后训练
恐怕会成为以后能否提升推理性能的入场券了
当然,OpenAI的发现也证明了这一点
随着更多的强化学习和更多的思考时间
o1的性能也在不断的提升
而且后训练缩放法则的空间也还没有得到完全的探索
Rich Sutton在《The Bitter Lesson》中已经指出
只有两种技术可以随着算力增长
分别是学习和搜索
英伟达科学家Jim Fan也曾经说过
模型的参数
其实大部分是用来存储和记忆知识的
因此
随着参数Scaling Law的边际效益逐渐递减
现在是时候
将更多的算力转向后训练阶段和推理阶段了
那这次OpenAI发布的o1模型
究竟是怎么来做后训练阶段的强化学习呢?
我们先来看一个生活中的例子
当我们写东西或者说话的时候
经常会停下来思考一下
然而
大语言模型在预测下一个token的时候
更像是一种快思考的过程
由于缺乏详细的中间推理步骤
模型一开始可能会犯错
而这些错误可能还会传播
最终导致生成的答案也是错误的
为了优化这个过程
业内产生了一系列的方法
包括使用蒙特卡洛树搜索(MCTS)
将模型的输出
建模为一系列token或者句子级别的节点
然后提供奖励信号
帮助模型来调整生成的回答
另一种方式
就是是通过思维链CoT(Chain of Thought)来优化模型的输出
CoT可以通过分步推理的方式
要求模型在生成最终答案之前
先生成一系列的中间推理步骤
这种“思维链”的生成过程
有助于增强模型的推理能力
尤其在数学和代码生成等任务中表现出色
然而,CoT虽然能够生成中间步骤
但是并没有教会模型
如何从内部深入思考问题的关联
特别是对于非常复杂、而且需要多步推理规划的任务来说
中间的CoT推理过程,就显得更为重要
谈到这里
我们必须要提一下STaR和Quiet-STaR这两种方法
STaR来自于论文《STaR:
通过推理进行引导推理(STaR:
Bootstrapping Reasoning With Reasoning)》,
它的核心思路
是利用大语言模型已有的推理能力
迭代式的引导模型产生合理推理过程的能力
并且将这个推理过程融入到训练过程之中
让模型自己学会推理
STaR的思路
类似于强化学习中的策略梯度算法
甚至整体的优化目标
都可以近似为一个策略梯度优化的目标
具体来说
模型首先采样潜在推理路径的过程
类似于强化学习中的策略选择动作
基于环境状态
选择一个可能的策略路径
在STaR中,通过计算目标函数
模型会对整个数据集的预测结果进行评估
并且只根据预测正确的样本更新模型
同时
STaR会在同一批数据上进行多次梯度更新
这也类似于某些策略梯度算法中的策略
也就是通过多次调整同一批数据
来稳定学习过程
在强化学习中
策略梯度算法会通过这种方式
在探索动作空间时进行学习
而STaR则通过探索推理和答案空间
逐步改善推理生成的准确性
这种方法和我们之前提到的通过细粒度奖励
或者蒙特卡洛树搜索来优化输出有所不同
因为模型在正确和错误的示例中
更多学会的是如何进行显式的合理推理
与此同时
这种合理推理不只是能将问题拆解
分步推理
还可以适用于一般的常识问答任务上
比方说,有这么一道问题
什么东西可以用来装一只小狗
选项分别是(a) 游泳池 (b) 篮子 (c) 后院 (d) 自己的家
那么按照合理推理
答案必须是可以用来携带一只小狗的东西
选项中只有篮子是用来装东西的
因此,答案是 (b) 篮子
虽然STaR能够提升推理的准确性
但是也存在几个局限性
首先就是对少样本示例的依赖
STaR在推理任务中高度依赖少量的Few-Shot推理示例
这就会导致模型的推理能力较为有限
难以应对复杂和广泛的任务
其次是泛化能力受限
STaR虽然能够通过迭代的方式
提升模型的推理能力
但是它的应用主要局限于特定的结构化任务上
比如问题回答
很难在开放领域或者任意的文本生成任务中
取得同样的效果
针对于STaR的局限性
论文《Quiet-STaR:
语言模型可以教会自己在说话前思考(Quiet-STaR:
Language Models Can Teach Themselves to Think Before Speaking)》中
提出了 “内部思维” 的概念
将显式的推理过程
转化为了模型内部隐式的推理过程
从而摆脱了对于外部示例的依赖
同时
论文中还引入了可学习的 和 token
来标记思维的开始和结束
Quiet-STaR还实现了在更加一般文本上的推理学习
这意味着大量复杂任务下的非结构化语料
比如医疗、金融等领域
都可以被加入到学习过程中
Quiet-STaR利用带推理过程的结果与真实结果的分布差异
引入奖励信号
再通过强化的方法来优化生成的推理
使得基于这些推理的模型
在预测未来token方面更加准确
就目前来看
STaR和Quiet-STaR是最接近o1的技术路线和模型表现效果的
但是如果想要进一步达到OpenAI o1的效果
还需要克服很多问题
比方说
Quiet-STaR在生成内部思维的过程中
每个Token都会生成下一步对应的思考过程
这就会导致生成大量额外的token
也会导致计算资源需求的大幅增加
实际上
模型需要学会动态调整思考过程中的Token
其次
对于更为复杂的任务和长程问题
如何针对内部思考过程
提供更为细粒度的奖励信号呢?
仅仅通过比较合理推理的回答和正确回答是否一致
或者预测分布的相似度
显然是不够的
从这个角度来看
OpenAI o1应该也是沿着STaR和Quiet-STaR类似的路线
优化了模型内部生成合理推理、也就是所谓“隐式CoT”的过程
而后训练阶段强化学习的主要训练算力
应该也是放在了对于内部推理过程的优化上
那么
如何来构造隐式CoT优化过程中的奖励呢?
一般来说
我们可以通过不同温度采样出来的推理路径
来构建偏序
也可以根据蒙特卡洛树搜索出来的、同时含有正确和错误的推理过程
形成偏序
这点会和先前蒙特卡洛树的用法有所不同
现在蒙特卡洛树的节点上
不再是最终生成答案中的某个token
或者某个步骤
而是隐式推理过程中的每一步
同时
为了提供更加细粒度的反馈和指导
我们需要引入过程性的奖励
但是针对模型自身已经难以提供合理推理过程的复杂问题
我们还需要通过引入额外的、足够强的评价模型
Critic Model,来解决这个问题
前阵子我们做过一期节目
介绍了OpenAI发布的CriticGPT
它通过RLHF方法训练模型
能够为真实世界中的代码任务
书写自然语言的反馈
并且成功泛化到其他的分布上
这种反馈可以帮助人类进行更准确的评价
从而实现对于复杂输出的有效奖励反馈
先前OpenAI也在论文《协助人类评估者的自我批评模型(Self-critiquing models for assisting human evaluators)》中
深入探究过自我批判的方法
以及评价模型在辅助人类评判文本总结任务上的可行性
所以说
基于评价比生成更简单的原则
o1在隐式思维链的训练过程中
应当也引入了Critic的方法
来提供更为精准的反馈
最终,通过强化学习
o1学会了优化它的思维链
以及不断改进使用的策略
它不仅学会了识别并且纠正错误
还学会了将复杂的步骤
分解成为更简单的步骤
以及在当前方法无效的时候
去尝试不同的解决方案
这个过程大幅提高了模型的推理能力
同时,在OpenAI披露的细节中
生成过程中的推理Token也是动态引入的
尽可能减少了不必要的思考所带来的额外算力损耗
可以说
OpenAI o1已经不再是一个即时给出答案的模型
而是能够先进行深入思考、再给出答案
按照丹尼尔·卡尼曼在《思考快与慢》一书中提出的理论
o1正在从依赖于系统1
即快速、自动、直觉、容易出错的思维模式
逐步进化为采用系统2
即缓慢、刻意、有意识而且更为可靠的推理过程
这个转变赋予了o1解决之前无法应对的复杂问题的能力
而这一切的实现
都源于在训练后阶段中对Scaling Laws的应用与优化
更有意思的是
我们还可以构建一个数据飞轮
也就是通过o1模型的推理过程
自动生成大量高质量的训练数据
这些数据可以被反复用来进一步提升模型的性能
从而形成一个自我强化的良性循环
在这个过程中
模型的Bootstrap能力可以得到进一步的扩展
不仅能够加速模型性能提升的速度
也可能向着超级智能再迈出一步
好了
我们来总结一下o1背后的一些技术
首先,o1模型使用了强化学习进行训练,
通过引入动态的推理Token,
从而启发式的采用了“隐式思维链”来“思考”问题,
而且思考的时间越长,推理能力越强
其次,o1模型的发布
意味着AI能力的提升
不再局限于预训练阶段
还可以通过在后训练阶段中
提升强化学习训练的探索时间
以及增加模型推理的思考时间
来提升模型的性能
也就是所谓的后训练缩放法则
Post-Training Scaling Laws
第三,基于自我反思的o1模型
不仅能够提升模型的Bootstrap能力
还将大大提升模型对于没有见过的复杂问题的解决能力
同时
模型在推理过程
还有可能形成大量高质量的数据飞轮
从而向最终的超级智能迈进一步
最后大飞我想说
自从2022年ChatGPT面世以来
大模型已经经过了近两年时间的迭代
目前,无论是工业界还是学术界
都在努力探索大模型的上限
大家普遍认为
要进一步提升大模型的能力
要么就是通过合成数据
来扩展数据和参数规模
要么就是通过模态混合和模态穿透的方法
借助其他模态来增强模型能力
不过,我们从o1的表现中能够看到
尽管在数学、物理等复杂任务上的推理能力
有了大幅的提升
但是在一些语言生成任务上
并没有体现出更大的进步
OpenAI的研究人员在访谈中也提到
OpenAI o1擅长于推理能力
但是并不能作为一个很好的Agent和助手
也就是说
当模型强大到一定程度的时候
会出现推理能力和模型指令跟随能力的分离关系
这对于我们想要构建通用智能体的目标来说
如何去平衡二者的关系
可能会成为今后大模型发展的一个核心问题
大模型的天花板究竟有多高
还要我们拭目以待
感谢大家观看本期视频
我们下期再见
Посмотреть больше похожих видео
So Google's Research Just Exposed OpenAI's Secrets (OpenAI o1-Exposed)
AI can't cross this line and we don't know why.
NVIDIA Reveals STUNNING Breakthroughs: Blackwell, Intelligence Factory, Foundation Agents [SUPERCUT]
Machine Learning Intro 4
OpenAI o1: ChatGPT Supercharged!
ChatGPT o1 vs ChatGPT4 | Is it even better? | OpenAI launches new model GPT-o1
5.0 / 5 (0 votes)