Tencent Hunyuan RLVR · Trust Region arXiv 2606.10968

CPPO

Beyond Uniform Token-Level Trust Region
in LLM Reinforcement Learning

§01

TL;DR

In one screen

PPO/GRPO-style trust regions treat every token the same. But in autoregressive generation, early tokens matter more and drift accumulates along the prefix. CPPO reallocates the divergence budget accordingly — a drop-in token mask, no new loss term. PPO/GRPO 一类信任域对所有 token 施加同一阈值。然而在自回归生成中,靠前的 token 影响更大,且偏移会沿 prefix 逐步累积。CPPO 依据这两点重新分配散度预算,其形式是一个即插即用的 token mask,不引入任何新的 loss 项。

  • Problem. A uniform threshold $D_t \le \delta$ is loose on high-impact early tokens and over-tight on late ones, with no memory of how far the prefix has already drifted. 问题。 统一阈值 $D_t \le \delta$ 在所有位置取同一值,对影响最大的靠前 token 约束过松、对靠后 token 约束过紧,且不计入 prefix 此前已累积的偏移。
  • Fix. A position weight $w_t$ tightens early positions; a cumulative prefix budget $\delta_b$ caps the weighted prefix average. Together they give a provably tighter improvement bound. 方法。 以位置权重 $w_t$ 收紧靠前位置,并以累积 prefix 预算 $\delta_b$ 约束整段 prefix 的加权平均散度。二者共同给出可证明更紧的 policy-improvement bound。
  • Result. Best AIME24/25/26 Avg@16 on all four Qwen3 settings; +5.56 over the best baseline on 30B-A3B-Base, and stable where others collapse. 结果。 在四个 Qwen3 设置上的验证集 AIME24/25/26 Avg@16 均取得最优;在 30B-A3B-Base 上较最强 baseline 提升 +5.56,并在其余方法发生训练崩溃之处仍保持稳定。
§02

Results

4× Qwen3
MethodQwen3-1.7B1.7B-Base8B-Base30B-A3B-Base
GRPO27.918.8923.9638.19
MinPRO27.7111.0429.7248.12
CISPO28.8211.8729.58collapse
DPPO28.1910.9028.8949.23
TRM-Max25.219.7226.7320.27
TRM-Avg26.8711.7027.9848.96
CPPO (ours)31.8812.7831.1154.79

Best validation AIME24/25/26 Avg@16 (%) within the matched window $[0,T^{\text{stop}}]$. All divergence masks share the same Top-K reduced-TV score and per-model threshold scale, so the comparison isolates the trust-region rule. Bold = best, underline = second. 在统一的训练窗口 $[0,T^{\text{stop}}]$ 内、于验证集上取得的最佳 AIME24/25/26 Avg@16(%)。所有 divergence mask 均采用同一套 Top-K reduced-TV 分数与按模型设定的阈值尺度,因此该比较所隔离的是 trust-region 规则本身,而非散度度量的差异。加粗为最优,下划线为次优。

+5.56 over the best baseline on 30B-A3B-Base — the longest-horizon (16k) setting, where remaining-horizon amplification is strongest. Over the matched DPPO baseline: +3.69 / +1.88 / +2.22 / +5.56 across the four models. 此为 30B-A3B-Base 上相对最强 baseline 的提升;该设置序列最长(16k),剩余长度带来的误差放大最为显著。相对同条件的 DPPO,四个模型依次为 +3.69 / +1.88 / +2.22 / +5.56,提升幅度随序列长度增大。
Validation AIME24/25/26 Avg@16 curves for the three Base-model runs.
Figure 2 — Validation curves on Qwen3-1.7B-Base, 8B-Base and 30B-A3B-Base. CPPO holds a consistent lead; its separation from DPPO widens over training as the prefix constraint engages. CISPO collapses and TRM-Max degrades to 20.27, while CPPO stays stable. Qwen3-1.7B-Base、8B-Base、30B-A3B-Base 上的验证曲线。CPPO 全程保持领先;随着 prefix 约束逐步生效,其相对 DPPO 的优势在训练中持续扩大。同设置下 CISPO 发生训练崩溃、TRM-Max 退化至 20.27,而 CPPO 始终稳定。
Ablations on Qwen3-1.7B: single-mechanism, position-weight ordering, hard mask vs soft gate.
Figure 3 — Qwen3-1.7B. Each mechanism beats DPPO alone and the full mask is best (left); the ordered schedule beats shuffled weights (middle); the soft-gate variant tracks the hard mask (right). Qwen3-1.7B。位置权重与 prefix 预算单独使用时均已超过 DPPO,二者结合的完整 mask 表现最佳(左);按位置由前及后递减的权重明显优于随机打乱的权重(中),表明起作用的是位置顺序而非阈值数值本身;soft-gate 变体与 hard mask 基本持平(右)。
Qwen3-1.7B-Base ablations: hyperparameter sensitivity, KL vs TV, Binary vs Top-K.
Figure 4 — Qwen3-1.7B-Base. Robust to $(\delta_b, w_{\min})$ (left) and to the divergence choice — KL matches TV, Binary matches Top-K (middle, right). The gain comes from enforcing thresholds as a prefix budget, not from the estimator. Qwen3-1.7B-Base。结果对超参 $(\delta_b, w_{\min})$ 不敏感(左),对散度度量的选择亦然——KL 与 TV 相当、Binary 与 Top-K 相当(中、右)。这表明增益源于将阈值以 prefix 预算的形式施加,而非源于散度估计器的选择。
§03

Why Uniform Fails

$$|\Delta(\mu,\pi)| \le \sum_{t \lt T}\lambda_t\, u_t, \qquad \lambda_t = 4\xi\,\bar{\ell}\,(T-t)$$
!

The asymmetry根源在于不对称性

The multiplier $\lambda_t \propto (T-t)$ grows with the remaining horizon: an early-token deviation reshapes every later token's conditioning, so its error compounds over the whole suffix. A flat $D_t \le \delta$ ignores this — under-penalizing early deviations and over-restricting late exploration. 系数 $\lambda_t \propto (T-t)$ 随剩余长度(remaining horizon)线性增长。靠前 token 的偏移会改变其后所有 token 的条件分布,其误差因而沿整段 suffix 持续累积。统一阈值 $D_t \le \delta$ 未能刻画这一结构,对靠前的高影响偏移约束不足,而对靠后的探索约束过度。

Uniform Threshold统一阈值

  • Same $\delta$ at every position各位置采用相同的 $\delta$
  • Loose on early, high-impact tokens对靠前的高影响 token 约束过松
  • No memory of accumulated drift不计入已累积的偏移

CPPO (Ours)CPPO(本文)

  • $w_t$ tightens early tokens$w_t$ 收紧靠前的 token
  • Relaxes shorter-horizon late tokens放松剩余长度更短的靠后 token
  • $\delta_b$ caps cumulative prefix drift$\delta_b$ 约束累积的 prefix 偏移
§04

Method

CPPO
01

Position-Weighted Threshold位置加权阈值

A decreasing linear schedule yields a position-dependent threshold $D_t \le \delta/w_t$ — strict early, relaxed late: 采用线性递减的权重调度,得到与位置相关的阈值 $D_t \le \delta/w_t$,靠前位置更严、靠后位置更宽:

$$w_t = 1 - \frac{1-w_{\min}}{T-1}(t-1),\qquad w_t \in [w_{\min}, 1]$$
02

Cumulative Prefix Budget累积 prefix 预算

An effective threshold caps the weighted prefix average, tightening as drift accumulates: 引入有效阈值,对加权 prefix 均值加以约束,并随偏移累积而逐步收紧:

$$w_t D_t \le c_t^{\text{CPPO}} := \min\{\delta,\; \delta + \delta_b W_{t-1} - S_{t-1}\}$$
03

Token Mask & SurrogateToken Mask 与替代目标

The two constraints fold into one feasibility test $I_t$; corrective updates are never blocked, and the mask plugs straight into the PPO/GRPO ratio–advantage objective: 两项约束合并为单一可行性判据 $I_t$;纠正性更新从不被屏蔽,该 mask 可直接嵌入 PPO/GRPO 的 ratio–advantage 目标:

$$M_t^{\text{CPPO}} = \mathbb{1}\!\left(\hat{A}_t(\rho_t-1)\le 0 \;\vee\; I_t\right),\quad I_t : w_t D_t \le \delta \;\wedge\; S_t \le \delta + \delta_b W_{t-1}$$
$$\mathcal{L}^{\text{CPPO}}_{\mu}(\pi) = \mathbb{E}_{\mu}\!\left[\sum_{t=1}^{T} M_t^{\text{CPPO}}\,\rho_t\,\hat{A}_t\right]$$

Reuses the same per-token divergence as DPPO — no extra loss terms, no new estimator. 沿用与 DPPO 相同的逐 token 散度,不增加 loss 项,亦不引入新的估计器

Cumulative prefix constraint on a token window.
Figure 5 — The prefix constraint on a token window. Blue: token-level threshold $\delta$ active. Orange: accumulated drift pushes the prefix average up, so the effective threshold drops below $\delta$ — the orange bar passes the token test yet is masked by the prefix budget. token 窗口上的 prefix 约束。蓝色:token 级阈值 $\delta$ 生效。橙色:累积偏移抬高 prefix 均值,使有效阈值降至 $\delta$ 以下;图中橙色柱虽满足 token 级判据,却被 prefix 预算所屏蔽。
Position-conditioned policy deviation and the induced threshold.
Figure 6 — Policy deviation $|\pi_\theta-\mu|$ from Qwen3-30B-A3B rollouts grows with token position (left, middle); the induced threshold $\delta/w_t$ is tighter early and relaxed late (right). 来自 Qwen3-30B-A3B rollout 的策略偏移 $|\pi_\theta-\mu|$ 随 token 位置递增(左、中);由此导出的阈值 $\delta/w_t$ 靠前更紧、靠后更松(右)。
§05

Takeaways

4⁄4

Best Everywhere全设置最优

Top score on all four Qwen3 models — dense & MoE, Base & post-trained.在四个 Qwen3 模型上均取得最高分,涵盖 dense 与 MoE、Base 与 post-trained。

+5.56

Long Horizon长序列收益

Biggest margin on 30B-A3B-Base (16k), where early-token propagation dominates.在 30B-A3B-Base(16k)上优势最大,此设置下靠前 token 的传播效应占主导。

Stable

No Collapse训练稳定

Trains stably where CISPO collapses and TRM-Max falls to 20.27.在 CISPO 发生训练崩溃、TRM-Max 退化至 20.27 的设置下仍能稳定收敛。

Drop-in

No New Loss无新增损失

A token mask on PPO/GRPO with DPPO's divergence — provably tighter bound.仅在 PPO/GRPO 之上增加一层 token mask,并沿用 DPPO 的散度,即可给出可证明更紧的 policy-improvement bound。