Abstract:
Training large language models with Reinforcement Learning from Verifiable Rewards (RLVR) exhibits a set of distinctive and puzzling behaviors that remain poorly understood, including a two-stage learning curve, V-shaped response-length trajectories, and a pronounced vulnerability to catastrophic forgetting. In this work, we propose that these seemingly disparate phenomena can be explained using a single unifying theory: the model’s reasoning process maps to the self-organization of a semantic complex network whose topology remains persistently sparse, with the average degree pinned close to two. This topology imposes a fundamental mechanism for forgetting and learning: it first drives the system into a maximally frustrated state where “skill islands” form, slow-learning happens, and forgetting is induced; then it enters a sharp growth phase where the new skills are “bolted on”, driven by phase-transition-like learning at the web’s frontier. Equipped with the theory, we propose Annealed-RLVR, a principled algorithm that introduces an SFT-based “heating” step at the point of maximal frustration to resolve the competitive bottleneck and enhance the reasoning capability of the model. Experiments on a 1.5B-parameter model demonstrate that the approach outperforms standard RLVR on both in-distribution and out-of-distribution benchmarks. By recasting RLVR from black-box optimization into a predictable process of structural self-organization, our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.
报告人简介:
陈锟,中国科学院理论物理研究所副研究员。专注于人工智能、量子多体等交叉领域中的涌现现象与机制研究。他在中国科学技术大学获得学士学位,随后在合肥微尺度国家实验室和美国马萨诸塞州立大学获得量子信息科学和凝聚态物理博士学位。博士后期间,获西蒙斯基金会多电子问题国际合作项目的支持,先后在罗格斯大学和Flatiron研究所开展研究。受国家级青年人才计划支持,任国家重点研发专项课题负责人。