V0.5 is an adaptive baseline estimation framework for Reinforcement Learning with Verifiable Rewards (RLVR). By fusing the prediction from the generalist value model (V0) with sparse empirical rollouts, V0.5 achieves low-variance policy gradients without the need for synchronous value model training.
In RLVR, GRPO uses empirical sampling but suffers from high variance under sparse rollouts. PPO uses a coupled value model but requires expensive synchronous training and is susceptible to Out-Of-Distribution (OOD) hallucinations.
V0.5 dynamically balances the two. It leverages V0 as a statistical prior and adaptively fuses it with sparse online rollouts, minimizing Mean Squared Error (MSE) on the fly for robust advantage estimation.
V0.5 transforms baseline estimation into a dynamic process of statistical testing and budget allocation. Instead of rigid sampling, it operates in four simple steps:
If you find V0.5 useful in your research, please cite our work:
@article{generalist_value_model_v05,
author = {Yi-Kai Zhang and Yueqing Sun and Hongyan Hao and Qi Gu and Xunliang Cai and De-Chuan Zhan and Han-Jia Ye},
title = {V0.5: Generalist Value Model as a Prior for Sparse RL Rollouts},
journal = {CoRR},
volume = {abs/2603.10848},
year = {2026}
}