【人工智能】OpenAI o1模型背后的技术 | 后训练阶段的缩放法则 | 测试时计算 | 慢思考 | 隐式思维链CoT | STaR | Critic模型 | 大语言模型的天花板在哪里

大飞说科技
19 Sept 202413:04

Summary

TLDRIn this episode, 'Best Partners' discusses OpenAI's o1 series model, highlighting its significant advancements in mathematics, coding, and long-term planning. The model's performance surge is attributed to post-training scaling laws and test-time compute scaling. The script explores the model's technical aspects, emphasizing the shift from pre-training parameter scaling to reinforcement learning in post-training, which is crucial for enhancing reasoning and problem-solving abilities. It also touches on techniques like MCTS, Chain of Thought, and STaR for optimizing model output, suggesting that the future of AI may lie in the intelligent allocation of computational resources during the post-training phase.

Takeaways

  • 🚀 OpenAI's o1 model represents a significant leap in AI capabilities, particularly in mathematical reasoning, coding, and long-range planning.
  • 📈 The model's advancements are attributed to post-training scaling laws and reinforcement learning during the training phase, which have allowed it to surpass human doctoral accuracy in certain domains.
  • 🧠 The diminishing returns of pre-training scaling suggest that future AI improvements may rely more on post-training enhancements like reinforcement learning.
  • 🤖 o1's performance in competitive programming and mathematical problem-solving places it within the top percentiles of human performers.
  • 🔍 The model's approach includes techniques like Monte Carlo Tree Search (MCTS) and Chain of Thought (CoT) to enhance its reasoning and error-correction abilities.
  • 💡 STaR and Quiet-STaR methodologies are highlighted as being instrumental in teaching the model to think before responding, thereby improving its reasoning accuracy.
  • 🔄 The concept of 'internal thinking' introduced by Quiet-STaR allows the model to perform implicit reasoning without external examples, expanding its applicability.
  • 📚 o1's training process is dynamic, incorporating self-critique and iterative learning to refine its reasoning链条 and strategies.
  • 🌐 The model's ability to generate high-quality training data through its reasoning processes could lead to a self-reinforcing cycle of performance improvement.
  • 🔮 While o1 excels in complex reasoning tasks, it may not yet be optimized for general agent or assistant roles, indicating a potential trade-off between reasoning and directive-following capabilities.

Q & A

  • What significant advancements did OpenAI's o1 series model achieve according to the transcript?

    -The o1 series model achieved significant advancements in mathematics, coding, and long-range planning. It ranked in the 89th percentile in competitive programming on Codeforces, made it into the top 500 students in the American Invitational Mathematics Examination (AIME), and surpassed human doctoral level accuracy on the GPQA benchmark for physics, biology, and chemistry problems.

  • What is the Post-Training Scaling Law and how does it relate to the o1 model's performance?

    -The Post-Training Scaling Law refers to the principle that the performance of AI models can be significantly enhanced through reinforcement learning during the post-training phase, rather than just scaling up the model's parameters during pre-training. This law was instrumental in the o1 model's performance leap, suggesting a shift in focus towards post-training optimization for improving reasoning and long-range problem-solving abilities.

  • How does the o1 model utilize reinforcement learning in its training?

    -The o1 model employs reinforcement learning during the post-training phase to enhance its reasoning capabilities. It does this by iteratively guiding the model to produce logical reasoning paths and incorporating these into the training process, allowing the model to learn and improve its reasoning accuracy over time.

  • What is the role of Test-Time Compute in the performance of large language models as discussed in the transcript?

    -Test-Time Compute refers to the computational resources a model uses during the testing phase for reasoning and reflection. The transcript suggests that increasing Test-Time Compute can be more effective than simply scaling up model parameters, as it allows the model to engage in deeper and more complex reasoning processes, which directly impacts the model's performance.

  • What is the concept of 'Chain of Thought' (CoT) mentioned in the transcript, and how does it improve model output?

    -The 'Chain of Thought' (CoT) is a method where the model is prompted to generate a series of intermediate reasoning steps before providing a final answer. This approach helps to enhance the model's reasoning capabilities, especially in tasks requiring mathematical and coding solutions, by making the reasoning process more explicit and structured.

  • How does the STaR method contribute to the o1 model's reasoning abilities?

    -STaR, or 'Bootstrapping Reasoning With Reasoning,' is a method that leverages the model's existing reasoning capabilities to iteratively guide it in producing logical reasoning paths. It incorporates these paths into the training process, allowing the model to learn and improve its reasoning accuracy, which is similar to the strategy gradient algorithm in reinforcement learning.

  • What is the difference between STaR and Quiet-STaR as per the transcript?

    -While STaR focuses on explicit reasoning by generating reasoning paths, Quiet-STaR introduces the concept of 'internal thinking,' transforming the explicit reasoning process into an implicit one within the model. This allows Quiet-STaR to operate without reliance on external examples and to apply reasoning across a broader range of tasks and non-structured data.

  • How does the o1 model optimize its internal reasoning process according to the transcript?

    -The o1 model optimizes its internal reasoning process through a combination of reinforcement learning and dynamic introduction of reasoning tokens. It learns to identify and correct errors, break down complex steps into simpler ones, and try different solutions when necessary, which significantly enhances its reasoning capabilities.

  • What is the concept of a 'data flywheel' mentioned in the transcript, and how does it relate to the o1 model?

    -A 'data flywheel' refers to a self-reinforcing cycle where the model's reasoning process generates high-quality training data, which can then be used to further improve the model's performance. In the context of the o1 model, this concept suggests that as the model's bootstrapping capabilities expand, it can accelerate performance improvements and potentially move closer to achieving superintelligence.

  • What challenges does the o1 model face in balancing reasoning capabilities with following instructions, as discussed in the transcript?

    -While the o1 model excels in reasoning abilities, especially for complex tasks like mathematics and physics, it may not necessarily perform as well as an agent or assistant in language generation tasks. The transcript suggests that as models become more powerful, there could be a separation between reasoning capabilities and the ability to follow instructions, which could become a core issue in developing general intelligent agents.

Outlines

00:00

🚀 Introduction to O1 Model's Breakthroughs

The script introduces the advancements of the O1 series model by OpenAI, highlighting its significant improvements in mathematics, coding, and long-range planning. The model's performance is benchmarked against competitive programming, mathematical contests, and scientific problem-solving, where it has exceeded human expert levels. The script emphasizes the role of Post-Training Scaling Law and the reinforcement learning during the training phase in achieving these capabilities. It also discusses the diminishing returns of merely scaling up parameters during pre-training and suggests that post-training reinforcement learning is a pivotal next step. The script references Ilya's 2018 MIT talk on the potential of reinforcement learning and self-play for AGI and OpenAI's exploration of scaling laws beyond parameter size.

05:00

🤖 Deep Dive into Post-Training Scaling and STaR Method

This section delves into the technical aspects of the O1 model's training, particularly focusing on the Post-Training Scaling Law. It explains how the training phase's computational demand is now linked to both the model parameter size and the computational load during reinforcement learning exploration. The script introduces the STaR method, which iteratively improves the model's reasoning capabilities by evaluating predictions and updating the model based on correct samples. It contrasts this approach with other optimization techniques like Monte Carlo Tree Search and Chain of Thought, highlighting STaR's ability to enhance explicit reasoning. Limitations of STaR, such as dependency on few-shot examples and restricted generalization, are also discussed, leading to the introduction of Quiet-STaR, which internalizes the reasoning process and broadens the model's applicability.

10:02

🔍 Quiet-STaR Innovations and Future AI Prospects

The final paragraph discusses the innovations of Quiet-STaR, which allows language models to 'think' before speaking by marking the beginning and end of thought processes with special tokens. It explores how Quiet-STaR uses distribution differences between reasoned and actual outcomes to introduce reward signals for reinforcement learning, improving the model's accuracy in predicting future tokens. The script also speculates on how OpenAI's O1 model might have optimized internal reasoning processes, potentially using critic models for fine-grained feedback. It reflects on the progress of large AI models since ChatGPT's release, noting the industry and academia's efforts to push their capabilities. The discussion concludes with the observation that while O1 excels in reasoning, it may not be as adept as an agent or assistant, suggesting a potential dichotomy between reasoning and instruction-following abilities that future models need to address.

Mindmap

Keywords

💡Post-Training Scaling Law

The Post-Training Scaling Law refers to the principle that the performance of AI models can be significantly enhanced not just by increasing the number of parameters during the pre-training phase, but also by scaling up the computational resources during the post-training phase. In the context of the video, this law is crucial as it underpins the performance improvements seen in the o1 model by OpenAI, suggesting that the era of relying solely on parameter scaling is shifting towards a more nuanced approach that includes post-training reinforcement.

💡Reinforcement Learning

Reinforcement Learning is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize some notion of cumulative reward. In the video, reinforcement learning plays a pivotal role in the o1 model's training process, enabling it to improve its reasoning and problem-solving abilities over time by learning from its successes and failures.

💡Codeforces

Codeforces is a competitive programming platform where participants solve algorithmic problems within a limited time. The video mentions that the o1 model's performance in Codeforces is competitive, ranking in the 89th percentile, which underscores the model's advanced capabilities in complex problem-solving and algorithmic thinking.

💡AIME

The American Invitational Mathematics Examination (AIME) is an advanced mathematics competition for high school students in the United States. The video highlights that the o1 model's performance in AIME-qualifying exams places it among the top 500 students in the U.S., indicating a high level of mathematical reasoning ability.

💡GPQA

GPQA stands for the Generalized Physics Question Answering benchmark, which is used to measure the accuracy of AI models in answering physics, biology, and chemistry questions. The video states that the o1 model surpasses human doctoral level accuracy on GPQA, showcasing its proficiency in scientific reasoning.

💡Test-Time Compute

Test-Time Compute refers to the computational resources and time an AI model uses during the testing phase to make inferences or predictions. The video discusses how optimizing Test-Time Compute can be more effective than merely scaling model parameters, emphasizing the importance of computational efficiency in AI performance.

💡STaR

STaR, or Self-Training via Reasoning, is a method described in the video as a way to bootstrap an AI model's reasoning capabilities by iteratively using the model's own reasoning to improve its performance. It is likened to a strategy gradient algorithm in reinforcement learning, where the model learns to select the best reasoning paths to enhance its accuracy.

💡Quiet-STaR

Quiet-STaR is an extension of the STaR method that introduces the concept of 'internal thinking' within the AI model. It allows the model to perform implicit reasoning without relying on external examples, thus enhancing its ability to handle complex tasks. The video suggests that Quiet-STaR is a significant step towards achieving human-like reasoning in AI.

💡Critic Model

A Critic Model, as mentioned in the video, is an evaluative model that provides feedback on the AI's performance. It is used to guide the training process by offering more precise rewards and corrections, especially for complex tasks where direct evaluation is challenging. The video implies that such models are essential for fine-tuning the AI's reasoning and decision-making processes.

💡Data Flywheel

The concept of a Data Flywheel in the video refers to a self-reinforcing cycle where the AI model's improved reasoning capabilities generate high-quality data, which in turn is used to further train and improve the model. This cycle is seen as a potential pathway towards achieving superintelligence by continuously bootstrapping the model's learning.

💡System 1 and System 2

System 1 and System 2 are cognitive concepts from Daniel Kahneman's 'Thinking, Fast and Slow'. System 1 represents fast, intuitive thinking, while System 2 is slow, deliberate, and logical. The video suggests that the o1 model is evolving from a reliance on System 1 to employing more of System 2, indicating a shift towards more reliable and complex reasoning abilities.

Highlights

OpenAI's o1 series models have achieved significant improvements in mathematics, coding, and long-term planning.

o1 ranks in the 89th percentile in competitive programming on Codeforces and among the top 500 students in the American Invitational Mathematics Examination (AIME).

o1 surpasses human doctoral accuracy on the GPQA benchmark for physics, biology, and chemistry problems.

Post-Training Scaling Law is pivotal for o1's performance leap, suggesting a reevaluation of computational resource allocation.

As model size increases, the marginal benefits of pre-training parameter scaling are diminishing.

Reinforcement learning during the post-training phase is identified as the next breakthrough for enhancing model reasoning and long-term problem-solving capabilities.

Ilya Sutskever expressed confidence in AGI through reinforcement learning and self-play at MIT in 2018.

OpenAI's exploration of scaling laws beyond parameters is evident in their 2021 paper on training verifiers for math word problems.

The inability of autoregressive models to self-correct answers is a challenge for progress in mathematical reasoning.

Reinforcement learning brings a paradigm shift in large language model training and introduces new scaling laws for post-training.

Training compute in the post-training phase includes not only model parameter scaling but also the computational load of reinforcement learning exploration.

Test-Time Compute, or the computational load during model reasoning and reflection, also affects model performance.

The necessity of sufficient computational power for post-training to enhance reasoning performance is becoming a critical factor.

o1's performance continues to improve with more reinforcement learning and extended thinking time.

STaR and Quiet-STaR methods are highlighted as being closest to o1's technical route and model performance.

STaR uses the model's reasoning capabilities to iteratively guide it to produce logical reasoning processes.

Quiet-STaR introduces 'internal thinking', transforming explicit reasoning into implicit processes within the model.

o1 likely optimizes the internal reasoning process, or 'implicit CoT', focusing training compute on this optimization.

Critic Model is introduced to provide fine-grained feedback for complex tasks that are difficult for the model to reason about internally.

o1 learns to optimize its reasoning chain and improve strategies, identifying and correcting errors, and breaking down complex steps.

o1 evolves from a fast, intuitive thinking model to a slower, more deliberate and reliable reasoning process, enhancing its ability to solve complex problems.

The potential for a data flywheel effect, where o1's reasoning process generates high-quality training data for self-improvement, is discussed.

The future of large models is considered, with a focus on balancing reasoning capabilities with the ability to follow instructions for building general intelligence.

Transcripts

play00:00

大家好,这里是最佳拍档,我是大飞

play00:02

北京时间9月13日的午夜

play00:05

OpenAI发布了o1系列模型

play00:07

在数学、代码、长程规划等问题上

play00:10

取得了显著的提升

play00:11

比方说

play00:12

在竞争性编程问题Codeforces中排名第89个百分位

play00:17

在美国数学奥林匹克竞赛AIME资格赛中

play00:20

跻身美国前500名学生之列

play00:23

在物理、生物和化学问题的基准GPQA 上

play00:26

也超过了人类博士水平的准确性

play00:29

而帮助o1取得如此性能飞跃的

play00:32

正是后训练阶段强化学习的Scaling

play00:34

以及测试推理阶段思考时间的Scaling

play00:37

今天我们主要来聊聊o1背后的一些技术

play00:40

尤其是前者

play00:41

也就是后训练阶段的缩放法则

play00:44

Post-Training Scaling Law

play00:46

它的出现

play00:47

可能会引发我们对于算力分配、后训练能力的重新思考

play00:53

其实大家已经发现

play00:54

随着大模型的尺寸逐渐增大

play00:57

纯粹预训练阶段参数Scaling Up所能带来的边际收益

play01:01

其实已经开始递减

play01:03

如果想要深度提升模型的推理能力和长程问题能力

play01:06

那么基于强化学习的后训练

play01:08

将会成为下一个突破点

play01:11

早在2018年Ilya在MIT的客座讲座上

play01:14

他就分享过自己对于通过强化学习和自博弈

play01:17

走向AGI的信心

play01:19

显然

play01:19

OpenAI也一直在探索参数以外的缩放法则

play01:23

早在2021年

play01:24

他们就在论文《训练验证器来解决数学应用题(Training Verifiers to Solve Math Word Problems)》中提到

play01:28

自回归模型在数学推理问题上很难进步的一点在于

play01:32

它没有办法进行回答的自主修正

play01:35

如果只是依靠生成式方法和扩大参数的规模

play01:38

那么在数学推理任务上带来的收益

play01:40

并不会太大

play01:41

所以,需要寻找额外的Scaling Laws

play01:44

而现在看来

play01:45

强化学习不仅带来了大语言模型训练的范式转变

play01:48

也带来了新的Scaling Laws

play01:50

即后训练缩放法则Post-Training Scaling Laws

play01:54

在新的缩放法则下

play01:55

训练阶段的计算量不再只是和模型参数量的上升有关

play01:59

同时也会包含强化学习探索时

play02:01

大语言模型推理的计算量

play02:04

与此同时

play02:05

在测试阶段模型推理和反思的计算量

play02:08

也就是Test-Time Compute

play02:09

也会影响模型最终的表现

play02:12

这在DeepMind最近发表的论文

play02:14

《扩大大语言模型测试时间的计算

play02:16

比扩大模型的参数更有效(Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters)》中

play02:20

就提到了这种范式的转变

play02:21

在后训练阶段,虽然模型的参数没变

play02:23

但是在训练算力上仍然会呈倍数的增长

play02:27

推理上也会随着模型思考能力的提高

play02:29

单次算力有所增长

play02:31

因此,是否有足够的算力来做后训练

play02:34

恐怕会成为以后能否提升推理性能的入场券了

play02:38

当然,OpenAI的发现也证明了这一点

play02:41

随着更多的强化学习和更多的思考时间

play02:43

o1的性能也在不断的提升

play02:46

而且后训练缩放法则的空间也还没有得到完全的探索

play02:50

Rich Sutton在《The Bitter Lesson》中已经指出

play02:53

只有两种技术可以随着算力增长

play02:55

分别是学习和搜索

play02:57

英伟达科学家Jim Fan也曾经说过

play03:00

模型的参数

play03:00

其实大部分是用来存储和记忆知识的

play03:03

因此

play03:04

随着参数Scaling Law的边际效益逐渐递减

play03:07

现在是时候

play03:08

将更多的算力转向后训练阶段和推理阶段了

play03:11

那这次OpenAI发布的o1模型

play03:13

究竟是怎么来做后训练阶段的强化学习呢?

play03:18

我们先来看一个生活中的例子

play03:20

当我们写东西或者说话的时候

play03:23

经常会停下来思考一下

play03:25

然而

play03:25

大语言模型在预测下一个token的时候

play03:27

更像是一种快思考的过程

play03:29

由于缺乏详细的中间推理步骤

play03:32

模型一开始可能会犯错

play03:34

而这些错误可能还会传播

play03:36

最终导致生成的答案也是错误的

play03:38

为了优化这个过程

play03:40

业内产生了一系列的方法

play03:42

包括使用蒙特卡洛树搜索(MCTS)

play03:45

将模型的输出

play03:46

建模为一系列token或者句子级别的节点

play03:49

然后提供奖励信号

play03:50

帮助模型来调整生成的回答

play03:53

另一种方式

play03:54

就是是通过思维链CoT(Chain of Thought)来优化模型的输出

play03:57

CoT可以通过分步推理的方式

play04:00

要求模型在生成最终答案之前

play04:02

先生成一系列的中间推理步骤

play04:04

这种“思维链”的生成过程

play04:06

有助于增强模型的推理能力

play04:08

尤其在数学和代码生成等任务中表现出色

play04:12

然而,CoT虽然能够生成中间步骤

play04:14

但是并没有教会模型

play04:16

如何从内部深入思考问题的关联

play04:19

特别是对于非常复杂、而且需要多步推理规划的任务来说

play04:23

中间的CoT推理过程,就显得更为重要

play04:26

谈到这里

play04:26

我们必须要提一下STaR和Quiet-STaR这两种方法

play04:30

STaR来自于论文《STaR:

play04:32

通过推理进行引导推理(STaR:

play04:33

Bootstrapping Reasoning With Reasoning)》,

play04:34

它的核心思路

play04:35

是利用大语言模型已有的推理能力

play04:37

迭代式的引导模型产生合理推理过程的能力

play04:41

并且将这个推理过程融入到训练过程之中

play04:44

让模型自己学会推理

play04:45

STaR的思路

play04:46

类似于强化学习中的策略梯度算法

play04:49

甚至整体的优化目标

play04:50

都可以近似为一个策略梯度优化的目标

play04:53

具体来说

play04:54

模型首先采样潜在推理路径的过程

play04:57

类似于强化学习中的策略选择动作

play05:00

基于环境状态

play05:01

选择一个可能的策略路径

play05:03

在STaR中,通过计算目标函数

play05:05

模型会对整个数据集的预测结果进行评估

play05:08

并且只根据预测正确的样本更新模型

play05:11

同时

play05:12

STaR会在同一批数据上进行多次梯度更新

play05:15

这也类似于某些策略梯度算法中的策略

play05:18

也就是通过多次调整同一批数据

play05:20

来稳定学习过程

play05:22

在强化学习中

play05:23

策略梯度算法会通过这种方式

play05:25

在探索动作空间时进行学习

play05:28

而STaR则通过探索推理和答案空间

play05:31

逐步改善推理生成的准确性

play05:33

这种方法和我们之前提到的通过细粒度奖励

play05:36

或者蒙特卡洛树搜索来优化输出有所不同

play05:40

因为模型在正确和错误的示例中

play05:42

更多学会的是如何进行显式的合理推理

play05:46

与此同时

play05:46

这种合理推理不只是能将问题拆解

play05:49

分步推理

play05:50

还可以适用于一般的常识问答任务上

play05:53

比方说,有这么一道问题

play05:55

什么东西可以用来装一只小狗

play05:57

选项分别是(a) 游泳池 (b) 篮子 (c) 后院 (d) 自己的家

play06:03

那么按照合理推理

play06:05

答案必须是可以用来携带一只小狗的东西

play06:08

选项中只有篮子是用来装东西的

play06:11

因此,答案是 (b) 篮子

play06:13

虽然STaR能够提升推理的准确性

play06:15

但是也存在几个局限性

play06:17

首先就是对少样本示例的依赖

play06:20

STaR在推理任务中高度依赖少量的Few-Shot推理示例

play06:24

这就会导致模型的推理能力较为有限

play06:26

难以应对复杂和广泛的任务

play06:28

其次是泛化能力受限

play06:30

STaR虽然能够通过迭代的方式

play06:32

提升模型的推理能力

play06:34

但是它的应用主要局限于特定的结构化任务上

play06:37

比如问题回答

play06:39

很难在开放领域或者任意的文本生成任务中

play06:42

取得同样的效果

play06:43

针对于STaR的局限性

play06:45

论文《Quiet-STaR:

play06:46

语言模型可以教会自己在说话前思考(Quiet-STaR:

play06:48

Language Models Can Teach Themselves to Think Before Speaking)》中

play06:49

提出了 “内部思维” 的概念

play06:51

将显式的推理过程

play06:52

转化为了模型内部隐式的推理过程

play06:55

从而摆脱了对于外部示例的依赖

play06:57

同时

play06:58

论文中还引入了可学习的 和 token

play07:03

来标记思维的开始和结束

play07:05

Quiet-STaR还实现了在更加一般文本上的推理学习

play07:09

这意味着大量复杂任务下的非结构化语料

play07:11

比如医疗、金融等领域

play07:13

都可以被加入到学习过程中

play07:15

Quiet-STaR利用带推理过程的结果与真实结果的分布差异

play07:19

引入奖励信号

play07:20

再通过强化的方法来优化生成的推理

play07:23

使得基于这些推理的模型

play07:25

在预测未来token方面更加准确

play07:28

就目前来看

play07:28

STaR和Quiet-STaR是最接近o1的技术路线和模型表现效果的

play07:33

但是如果想要进一步达到OpenAI o1的效果

play07:37

还需要克服很多问题

play07:39

比方说

play07:39

Quiet-STaR在生成内部思维的过程中

play07:42

每个Token都会生成下一步对应的思考过程

play07:45

这就会导致生成大量额外的token

play07:47

也会导致计算资源需求的大幅增加

play07:50

实际上

play07:51

模型需要学会动态调整思考过程中的Token

play07:54

其次

play07:55

对于更为复杂的任务和长程问题

play07:57

如何针对内部思考过程

play07:59

提供更为细粒度的奖励信号呢?

play08:02

仅仅通过比较合理推理的回答和正确回答是否一致

play08:06

或者预测分布的相似度

play08:08

显然是不够的

play08:09

从这个角度来看

play08:10

OpenAI o1应该也是沿着STaR和Quiet-STaR类似的路线

play08:14

优化了模型内部生成合理推理、也就是所谓“隐式CoT”的过程

play08:19

而后训练阶段强化学习的主要训练算力

play08:21

应该也是放在了对于内部推理过程的优化上

play08:25

那么

play08:25

如何来构造隐式CoT优化过程中的奖励呢?

play08:29

一般来说

play08:30

我们可以通过不同温度采样出来的推理路径

play08:32

来构建偏序

play08:34

也可以根据蒙特卡洛树搜索出来的、同时含有正确和错误的推理过程

play08:38

形成偏序

play08:40

这点会和先前蒙特卡洛树的用法有所不同

play08:43

现在蒙特卡洛树的节点上

play08:45

不再是最终生成答案中的某个token

play08:47

或者某个步骤

play08:48

而是隐式推理过程中的每一步

play08:51

同时

play08:51

为了提供更加细粒度的反馈和指导

play08:53

我们需要引入过程性的奖励

play08:56

但是针对模型自身已经难以提供合理推理过程的复杂问题

play09:00

我们还需要通过引入额外的、足够强的评价模型

play09:03

Critic Model,来解决这个问题

play09:05

前阵子我们做过一期节目

play09:07

介绍了OpenAI发布的CriticGPT

play09:10

它通过RLHF方法训练模型

play09:12

能够为真实世界中的代码任务

play09:14

书写自然语言的反馈

play09:16

并且成功泛化到其他的分布上

play09:18

这种反馈可以帮助人类进行更准确的评价

play09:21

从而实现对于复杂输出的有效奖励反馈

play09:25

先前OpenAI也在论文《协助人类评估者的自我批评模型(Self-critiquing models for assisting human evaluators)》中

play09:30

深入探究过自我批判的方法

play09:32

以及评价模型在辅助人类评判文本总结任务上的可行性

play09:36

所以说

play09:37

基于评价比生成更简单的原则

play09:39

o1在隐式思维链的训练过程中

play09:42

应当也引入了Critic的方法

play09:44

来提供更为精准的反馈

play09:46

最终,通过强化学习

play09:47

o1学会了优化它的思维链

play09:49

以及不断改进使用的策略

play09:52

它不仅学会了识别并且纠正错误

play09:54

还学会了将复杂的步骤

play09:56

分解成为更简单的步骤

play09:57

以及在当前方法无效的时候

play09:59

去尝试不同的解决方案

play10:01

这个过程大幅提高了模型的推理能力

play10:05

同时,在OpenAI披露的细节中

play10:07

生成过程中的推理Token也是动态引入的

play10:10

尽可能减少了不必要的思考所带来的额外算力损耗

play10:13

可以说

play10:14

OpenAI o1已经不再是一个即时给出答案的模型

play10:17

而是能够先进行深入思考、再给出答案

play10:21

按照丹尼尔·卡尼曼在《思考快与慢》一书中提出的理论

play10:24

o1正在从依赖于系统1

play10:26

即快速、自动、直觉、容易出错的思维模式

play10:30

逐步进化为采用系统2

play10:32

即缓慢、刻意、有意识而且更为可靠的推理过程

play10:36

这个转变赋予了o1解决之前无法应对的复杂问题的能力

play10:40

而这一切的实现

play10:42

都源于在训练后阶段中对Scaling Laws的应用与优化

play10:46

更有意思的是

play10:46

我们还可以构建一个数据飞轮

play10:49

也就是通过o1模型的推理过程

play10:51

自动生成大量高质量的训练数据

play10:54

这些数据可以被反复用来进一步提升模型的性能

play10:57

从而形成一个自我强化的良性循环

play11:00

在这个过程中

play11:01

模型的Bootstrap能力可以得到进一步的扩展

play11:04

不仅能够加速模型性能提升的速度

play11:07

也可能向着超级智能再迈出一步

play11:11

好了

play11:12

我们来总结一下o1背后的一些技术

play11:14

首先,o1模型使用了强化学习进行训练,

play11:18

通过引入动态的推理Token,

play11:19

从而启发式的采用了“隐式思维链”来“思考”问题,

play11:22

而且思考的时间越长,推理能力越强

play11:26

其次,o1模型的发布

play11:27

意味着AI能力的提升

play11:29

不再局限于预训练阶段

play11:31

还可以通过在后训练阶段中

play11:33

提升强化学习训练的探索时间

play11:35

以及增加模型推理的思考时间

play11:37

来提升模型的性能

play11:39

也就是所谓的后训练缩放法则

play11:41

Post-Training Scaling Laws

play11:43

第三,基于自我反思的o1模型

play11:45

不仅能够提升模型的Bootstrap能力

play11:47

还将大大提升模型对于没有见过的复杂问题的解决能力

play11:51

同时

play11:52

模型在推理过程

play11:54

还有可能形成大量高质量的数据飞轮

play11:56

从而向最终的超级智能迈进一步

play11:59

最后大飞我想说

play12:00

自从2022年ChatGPT面世以来

play12:03

大模型已经经过了近两年时间的迭代

play12:05

目前,无论是工业界还是学术界

play12:08

都在努力探索大模型的上限

play12:11

大家普遍认为

play12:12

要进一步提升大模型的能力

play12:14

要么就是通过合成数据

play12:16

来扩展数据和参数规模

play12:18

要么就是通过模态混合和模态穿透的方法

play12:21

借助其他模态来增强模型能力

play12:24

不过,我们从o1的表现中能够看到

play12:27

尽管在数学、物理等复杂任务上的推理能力

play12:30

有了大幅的提升

play12:31

但是在一些语言生成任务上

play12:34

并没有体现出更大的进步

play12:36

OpenAI的研究人员在访谈中也提到

play12:38

OpenAI o1擅长于推理能力

play12:40

但是并不能作为一个很好的Agent和助手

play12:43

也就是说

play12:43

当模型强大到一定程度的时候

play12:45

会出现推理能力和模型指令跟随能力的分离关系

play12:49

这对于我们想要构建通用智能体的目标来说

play12:52

如何去平衡二者的关系

play12:54

可能会成为今后大模型发展的一个核心问题

play12:58

大模型的天花板究竟有多高

play13:00

还要我们拭目以待

play13:01

感谢大家观看本期视频

play13:03

我们下期再见

Rate This

5.0 / 5 (0 votes)

Связанные теги
AI AdvancementsOpenAI o1Post-Training ScalingReasoning SkillsProblem SolvingMachine LearningDeep LearningReinforcement LearningAI ResearchTech Innovation
Вам нужно краткое изложение на английском?