V0.5: Generalist Value Model as a Prior for Sparse RL Rollouts

Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye

V0.5 is an adaptive baseline estimation framework for Reinforcement Learning with Verifiable Rewards (RLVR). By fusing the prediction from the generalist value model (V0) with sparse empirical rollouts, V0.5 achieves low-variance policy gradients without the need for synchronous value model training.

⚡ Powered by verl 0.6.1 and sglang 0.4.10

💡 Overview

🔴 The Dilemma

In RLVR, GRPO uses empirical sampling but suffers from high variance under sparse rollouts. PPO uses a coupled value model but requires expensive synchronous training and is susceptible to Out-Of-Distribution (OOD) hallucinations.

🟢 The Solution

V0.5 dynamically balances the two. It leverages V0 as a statistical prior and adaptively fuses it with sparse online rollouts, minimizing Mean Squared Error (MSE) on the fly for robust advantage estimation.

Performance of V0.5
Performance of V0.5 across diverse mathematical reasoning benchmarks. It demonstrates significant superiority over GRPO and DAPO, achieving faster convergence and over 10% performance improvement.

⚡ How It Works

V0.5 transforms baseline estimation into a dynamic process of statistical testing and budget allocation. Instead of rigid sampling, it operates in four simple steps:

Method Pipeline of V0.5
Comparison of PPO, GRPO, and V0.5. V0.5 uniquely constructs an adaptive baseline by fusing a frozen generalist prior with sparse empirical rollouts.

📖 Citation

If you find V0.5 useful in your research, please cite our work:

@article{generalist_value_model_v05,
  author       = {Yi-Kai Zhang and Yueqing Sun and Hongyan Hao and Qi Gu and Xunliang Cai and De-Chuan Zhan and Han-Jia Ye},
  title        = {V0.5: Generalist Value Model as a Prior for Sparse RL Rollouts},
  journal      = {CoRR},
  volume       = {abs/2603.10848},
  year         = {2026}
}