CPPO — Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

§01

TL;DR

In one screen

PPO/GRPO-style trust regions treat every token the same. But in autoregressive generation, early tokens matter more and drift accumulates along the prefix. CPPO reallocates the divergence budget accordingly — a drop-in token mask, no new loss term. PPO/GRPO 一类信任域对所有 token 施加同一阈值。然而在自回归生成中，靠前的 token 影响更大，且偏移会沿 prefix 逐步累积。CPPO 依据这两点重新分配散度预算，其形式是一个即插即用的 token mask，不引入任何新的 loss 项。

Problem. A uniform threshold $D_t \le \delta$ is loose on high-impact early tokens and over-tight on late ones, with no memory of how far the prefix has already drifted. 问题。 统一阈值 $D_t \le \delta$ 在所有位置取同一值，对影响最大的靠前 token 约束过松、对靠后 token 约束过紧，且不计入 prefix 此前已累积的偏移。
Fix. A position weight $w_t$ tightens early positions; a cumulative prefix budget $\delta_b$ caps the weighted prefix average. Together they give a provably tighter improvement bound. 方法。 以位置权重 $w_t$ 收紧靠前位置，并以累积 prefix 预算 $\delta_b$ 约束整段 prefix 的加权平均散度。二者共同给出可证明更紧的 policy-improvement bound。
Result. Best AIME24/25/26 Avg@16 on all four Qwen3 settings; +5.56 over the best baseline on 30B-A3B-Base, and stable where others collapse. 结果。 在四个 Qwen3 设置上的验证集 AIME24/25/26 Avg@16 均取得最优；在 30B-A3B-Base 上较最强 baseline 提升 +5.56，并在其余方法发生训练崩溃之处仍保持稳定。

§02

Results

4× Qwen3

Method	Qwen3-1.7B	1.7B-Base	8B-Base	30B-A3B-Base
GRPO	27.91	8.89	23.96	38.19
MinPRO	27.71	11.04	29.72	48.12
CISPO	28.82	11.87	29.58	collapse
DPPO	28.19	10.90	28.89	49.23
TRM-Max	25.21	9.72	26.73	20.27
TRM-Avg	26.87	11.70	27.98	48.96
CPPO (ours)	31.88	12.78	31.11	54.79

Best validation AIME24/25/26 Avg@16 (%) within the matched window $[0,T^{\text{stop}}]$. All divergence masks share the same Top-K reduced-TV score and per-model threshold scale, so the comparison isolates the trust-region rule. Bold = best, underline = second. 在统一的训练窗口 $[0,T^{\text{stop}}]$ 内、于验证集上取得的最佳 AIME24/25/26 Avg@16（%）。所有 divergence mask 均采用同一套 Top-K reduced-TV 分数与按模型设定的阈值尺度，因此该比较所隔离的是 trust-region 规则本身，而非散度度量的差异。加粗为最优，下划线为次优。

+5.56 over the best baseline on 30B-A3B-Base — the longest-horizon (16k) setting, where remaining-horizon amplification is strongest. Over the matched DPPO baseline: +3.69 / +1.88 / +2.22 / +5.56 across the four models. 此为 30B-A3B-Base 上相对最强 baseline 的提升；该设置序列最长（16k），剩余长度带来的误差放大最为显著。相对同条件的 DPPO，四个模型依次为 +3.69 / +1.88 / +2.22 / +5.56，提升幅度随序列长度增大。

Validation AIME24/25/26 Avg@16 curves for the three Base-model runs. — Figure 2 — Validation curves on Qwen3-1.7B-Base, 8B-Base and 30B-A3B-Base. CPPO holds a consistent lead; its separation from DPPO **widens over training** as the prefix constraint engages. CISPO collapses and TRM-Max degrades to 20.27, while CPPO stays stable. Qwen3-1.7B-Base、8B-Base、30B-A3B-Base 上的验证曲线。CPPO 全程保持领先；随着 prefix 约束逐步生效，其相对 DPPO 的优势在训练中持续扩大。同设置下 CISPO 发生训练崩溃、TRM-Max 退化至 20.27，而 CPPO 始终稳定。

Ablations on Qwen3-1.7B: single-mechanism, position-weight ordering, hard mask vs soft gate. — Figure 3 — Qwen3-1.7B. Each mechanism beats DPPO alone and the full mask is best (left); the ordered schedule beats shuffled weights (middle); the soft-gate variant tracks the hard mask (right). Qwen3-1.7B。位置权重与 prefix 预算单独使用时均已超过 DPPO，二者结合的完整 mask 表现最佳（左）；按位置由前及后递减的权重明显优于随机打乱的权重（中），表明起作用的是位置顺序而非阈值数值本身；soft-gate 变体与 hard mask 基本持平（右）。

Qwen3-1.7B-Base ablations: hyperparameter sensitivity, KL vs TV, Binary vs Top-K. — Figure 4 — Qwen3-1.7B-Base. Robust to $(\delta_b, w_{\min})$ (left) and to the divergence choice — KL matches TV, Binary matches Top-K (middle, right). The gain comes from enforcing thresholds as a **prefix budget**, not from the estimator. Qwen3-1.7B-Base。结果对超参 $(\delta_b, w_{\min})$ 不敏感（左），对散度度量的选择亦然——KL 与 TV 相当、Binary 与 Top-K 相当（中、右）。这表明增益源于将阈值以 **prefix 预算**的形式施加，而非源于散度估计器的选择。

§03

Why Uniform Fails

$$|\Delta(\mu,\pi)| \le \sum_{t \lt T}\lambda_t\, u_t, \qquad \lambda_t = 4\xi\,\bar{\ell}\,(T-t)$$

!

The asymmetry根源在于不对称性

The multiplier $\lambda_t \propto (T-t)$ grows with the remaining horizon: an early-token deviation reshapes every later token's conditioning, so its error compounds over the whole suffix. A flat $D_t \le \delta$ ignores this — under-penalizing early deviations and over-restricting late exploration. 系数 $\lambda_t \propto (T-t)$ 随剩余长度（remaining horizon）线性增长。靠前 token 的偏移会改变其后所有 token 的条件分布，其误差因而沿整段 suffix 持续累积。统一阈值 $D_t \le \delta$ 未能刻画这一结构，对靠前的高影响偏移约束不足，而对靠后的探索约束过度。

Uniform Threshold统一阈值

Same $\delta$ at every position各位置采用相同的 $\delta$
Loose on early, high-impact tokens对靠前的高影响 token 约束过松
No memory of accumulated drift不计入已累积的偏移

CPPO (Ours)CPPO（本文）

$w_t$ tightens early tokens$w_t$ 收紧靠前的 token
Relaxes shorter-horizon late tokens放松剩余长度更短的靠后 token
$\delta_b$ caps cumulative prefix drift$\delta_b$ 约束累积的 prefix 偏移

§04

Method

CPPO

01

Position-Weighted Threshold位置加权阈值

A decreasing linear schedule yields a position-dependent threshold $D_t \le \delta/w_t$ — strict early, relaxed late: 采用线性递减的权重调度，得到与位置相关的阈值 $D_t \le \delta/w_t$，靠前位置更严、靠后位置更宽：

$$w_t = 1 - \frac{1-w_{\min}}{T-1}(t-1),\qquad w_t \in [w_{\min}, 1]$$

02

Cumulative Prefix Budget累积 prefix 预算

An effective threshold caps the weighted prefix average, tightening as drift accumulates: 引入有效阈值，对加权 prefix 均值加以约束，并随偏移累积而逐步收紧：

$$w_t D_t \le c_t^{\text{CPPO}} := \min\{\delta,\; \delta + \delta_b W_{t-1} - S_{t-1}\}$$

03

Token Mask & SurrogateToken Mask 与替代目标

The two constraints fold into one feasibility test $I_t$; corrective updates are never blocked, and the mask plugs straight into the PPO/GRPO ratio–advantage objective: 两项约束合并为单一可行性判据 $I_t$；纠正性更新从不被屏蔽，该 mask 可直接嵌入 PPO/GRPO 的 ratio–advantage 目标：

$$M_t^{\text{CPPO}} = \mathbb{1}\!\left(\hat{A}_t(\rho_t-1)\le 0 \;\vee\; I_t\right),\quad I_t : w_t D_t \le \delta \;\wedge\; S_t \le \delta + \delta_b W_{t-1}$$

$$\mathcal{L}^{\text{CPPO}}_{\mu}(\pi) = \mathbb{E}_{\mu}\!\left[\sum_{t=1}^{T} M_t^{\text{CPPO}}\,\rho_t\,\hat{A}_t\right]$$

Reuses the same per-token divergence as DPPO — no extra loss terms, no new estimator. 沿用与 DPPO 相同的逐 token 散度，不增加 loss 项，亦不引入新的估计器。

Cumulative prefix constraint on a token window. — Figure 5 — The prefix constraint on a token window. **Blue:** token-level threshold $\delta$ active. **Orange:** accumulated drift pushes the prefix average up, so the effective threshold drops below $\delta$ — the orange bar passes the token test yet is masked by the prefix budget. token 窗口上的 prefix 约束。**蓝色：**token 级阈值 $\delta$ 生效。**橙色：**累积偏移抬高 prefix 均值，使有效阈值降至 $\delta$ 以下；图中橙色柱虽满足 token 级判据，却被 prefix 预算所屏蔽。

Position-conditioned policy deviation and the induced threshold. — Figure 6 — Policy deviation $|\pi_\theta-\mu|$ from Qwen3-30B-A3B rollouts grows with token position (left, middle); the induced threshold $\delta/w_t$ is tighter early and relaxed late (right). 来自 Qwen3-30B-A3B rollout 的策略偏移 $|\pi_\theta-\mu|$ 随 token 位置递增（左、中）；由此导出的阈值 $\delta/w_t$ 靠前更紧、靠后更松（右）。

§05

Takeaways

4⁄4

Best Everywhere全设置最优

Top score on all four Qwen3 models — dense & MoE, Base & post-trained.在四个 Qwen3 模型上均取得最高分，涵盖 dense 与 MoE、Base 与 post-trained。

+5.56

Long Horizon长序列收益

Biggest margin on 30B-A3B-Base (16k), where early-token propagation dominates.在 30B-A3B-Base（16k）上优势最大，此设置下靠前 token 的传播效应占主导。

Stable

No Collapse训练稳定

Trains stably where CISPO collapses and TRM-Max falls to 20.27.在 CISPO 发生训练崩溃、TRM-Max 退化至 20.27 的设置下仍能稳定收敛。

Drop-in

No New Loss无新增损失

A token mask on PPO/GRPO with DPPO's divergence — provably tighter bound.仅在 PPO/GRPO 之上增加一层 token mask，并沿用 DPPO 的散度，即可给出可证明更紧的 policy-improvement bound。