Position-Weighted Threshold位置加权阈值
A decreasing linear schedule yields a position-dependent threshold $D_t \le \delta/w_t$ — strict early, relaxed late: 采用线性递减的权重调度,得到与位置相关的阈值 $D_t \le \delta/w_t$,靠前位置更严、靠后位置更宽:
RLVR · Trust Region
arXiv 2606.10968
Beyond Uniform Token-Level Trust Region
in LLM Reinforcement Learning
PPO/GRPO-style trust regions treat every token the same. But in autoregressive generation, early tokens matter more and drift accumulates along the prefix. CPPO reallocates the divergence budget accordingly — a drop-in token mask, no new loss term. PPO/GRPO 一类信任域对所有 token 施加同一阈值。然而在自回归生成中,靠前的 token 影响更大,且偏移会沿 prefix 逐步累积。CPPO 依据这两点重新分配散度预算,其形式是一个即插即用的 token mask,不引入任何新的 loss 项。
| Method | Qwen3-1.7B | 1.7B-Base | 8B-Base | 30B-A3B-Base |
|---|---|---|---|---|
| GRPO | 27.91 | 8.89 | 23.96 | 38.19 |
| MinPRO | 27.71 | 11.04 | 29.72 | 48.12 |
| CISPO | 28.82 | 11.87 | 29.58 | collapse |
| DPPO | 28.19 | 10.90 | 28.89 | 49.23 |
| TRM-Max | 25.21 | 9.72 | 26.73 | 20.27 |
| TRM-Avg | 26.87 | 11.70 | 27.98 | 48.96 |
| CPPO (ours) | 31.88 | 12.78 | 31.11 | 54.79 |
Best validation AIME24/25/26 Avg@16 (%) within the matched window $[0,T^{\text{stop}}]$. All divergence masks share the same Top-K reduced-TV score and per-model threshold scale, so the comparison isolates the trust-region rule. Bold = best, underline = second. 在统一的训练窗口 $[0,T^{\text{stop}}]$ 内、于验证集上取得的最佳 AIME24/25/26 Avg@16(%)。所有 divergence mask 均采用同一套 Top-K reduced-TV 分数与按模型设定的阈值尺度,因此该比较所隔离的是 trust-region 规则本身,而非散度度量的差异。加粗为最优,下划线为次优。
The multiplier $\lambda_t \propto (T-t)$ grows with the remaining horizon: an early-token deviation reshapes every later token's conditioning, so its error compounds over the whole suffix. A flat $D_t \le \delta$ ignores this — under-penalizing early deviations and over-restricting late exploration. 系数 $\lambda_t \propto (T-t)$ 随剩余长度(remaining horizon)线性增长。靠前 token 的偏移会改变其后所有 token 的条件分布,其误差因而沿整段 suffix 持续累积。统一阈值 $D_t \le \delta$ 未能刻画这一结构,对靠前的高影响偏移约束不足,而对靠后的探索约束过度。
A decreasing linear schedule yields a position-dependent threshold $D_t \le \delta/w_t$ — strict early, relaxed late: 采用线性递减的权重调度,得到与位置相关的阈值 $D_t \le \delta/w_t$,靠前位置更严、靠后位置更宽:
An effective threshold caps the weighted prefix average, tightening as drift accumulates: 引入有效阈值,对加权 prefix 均值加以约束,并随偏移累积而逐步收紧:
The two constraints fold into one feasibility test $I_t$; corrective updates are never blocked, and the mask plugs straight into the PPO/GRPO ratio–advantage objective: 两项约束合并为单一可行性判据 $I_t$;纠正性更新从不被屏蔽,该 mask 可直接嵌入 PPO/GRPO 的 ratio–advantage 目标:
Reuses the same per-token divergence as DPPO — no extra loss terms, no new estimator. 沿用与 DPPO 相同的逐 token 散度,不增加 loss 项,亦不引入新的估计器。
Top score on all four Qwen3 models — dense & MoE, Base & post-trained.在四个 Qwen3 模型上均取得最高分,涵盖 dense 与 MoE、Base 与 post-trained。
Biggest margin on 30B-A3B-Base (16k), where early-token propagation dominates.在 30B-A3B-Base(16k)上优势最大,此设置下靠前 token 的传播效应占主导。
Trains stably where CISPO collapses and TRM-Max falls to 20.27.在 CISPO 发生训练崩溃、TRM-Max 退化至 20.27 的设置下仍能稳定收敛。
A token mask on PPO/GRPO with DPPO's divergence — provably tighter bound.仅在 PPO/GRPO 之上增加一层 token mask,并沿用 DPPO 的散度,即可给出可证明更紧的 policy-improvement bound。