V_0.5: Generalist Value Model as a Prior for Sparse RL Rollouts

Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye

Paper Code Model 🤗 Demo of V₀

V_0.5 is an adaptive baseline estimation framework for Reinforcement Learning with Verifiable Rewards (RLVR). By fusing the prediction from the generalist value model (V₀) with sparse empirical rollouts, V_0.5 achieves low-variance policy gradients without the need for synchronous value model training.

⚡ Powered by verl 0.6.1 and sglang 0.4.10

💡 Overview

🔴 The Dilemma

In RLVR, GRPO uses empirical sampling but suffers from high variance under sparse rollouts. PPO uses a coupled value model but requires expensive synchronous training and is susceptible to Out-Of-Distribution (OOD) hallucinations.

🟢 The Solution

V_0.5 dynamically balances the two. It leverages V₀ as a statistical prior and adaptively fuses it with sparse online rollouts, minimizing Mean Squared Error (MSE) on the fly for robust advantage estimation.

Performance of V_0.5 across diverse mathematical reasoning benchmarks. It demonstrates significant superiority over GRPO and DAPO, achieving faster convergence and over 10% performance improvement.

⚡ How It Works

V_0.5 transforms baseline estimation into a dynamic process of statistical testing and budget allocation. Instead of rigid sampling, it operates in four simple steps:

Prior Prediction: Before generation, the frozen generalist model (V₀) outputs a prior prediction \( V \).
Initial Rollouts: The policy generates a tiny initial set of rollouts to calculate an empirical mean \( \bar{v}_k \).
Adaptive Fusion (Shrinkage Estimator): The system assesses the reliability of the prior. It computes an adaptive weight \( \hat{w}_k \) based on the empirical bias and noise, safely fusing the empirical mean with the prior:
\( \hat{\mu}^* = \hat{w}_k \bar{v}_k + (1 - \hat{w}_k)V \)
If the prior is reliable, V_0.5 relies on it heavily to suppress variance. If a hallucination is detected, it swiftly reverts to the empirical mean.
Dynamic Budget Allocation (OSLA): Based on real-time uncertainty, V_0.5 automatically decides:

Comparison of PPO, GRPO, and V_0.5. V_0.5 uniquely constructs an adaptive baseline by fusing a frozen generalist prior with sparse empirical rollouts.

📖 Citation

If you find V_0.5 useful in your research, please cite our work:

@article{generalist_value_model_v05,
  author       = {Yi-Kai Zhang and Yueqing Sun and Hongyan Hao and Qi Gu and Xunliang Cai and De-Chuan Zhan and Han-Jia Ye},
  title        = {V0.5: Generalist Value Model as a Prior for Sparse RL Rollouts},
  journal      = {CoRR},
  volume       = {abs/2603.10848},
  year         = {2026}
}