Policy Optimization

PPO used to be the default workhorse for RLHF because it’s reasonably stable and easy to reason about, but at LLM post-training scale its tradeoffs start to bite: the critic/value model is expensive to train and maintain, long text rollouts amplify variance and make advantage estimation brittle, clipping becomes a blunt instrument that can under-update (wasting samples) or over-update (destabilizing), and the whole loop turns into a systems-heavy exercise once “environment interaction” means generating thousands of tokens across distributed inference. As post-training shifted from short preference tuning toward reasoning-heavy objectives (RLVR, long-CoT, verifier-driven rewards, pass@k-style targets) and larger, more heterogeneous data mixtures, these weaknesses became harder to paper over with hyperparameter folklore because the optimization problem is noisier, the feedback signals are sparser, and the failure modes (reward hacking, length bias, mode collapse, over-regularization) are more punishing. That’s why the field has been moving beyond “just PPO” toward more robust, more LLM-native policy optimization: methods that reduce dependence on a critic, stabilize updates under long-horizon generation, better control distribution shift between samples and policy, and align the training objective with how we actually evaluate modern models, ultimately making post-training not just possible, but reliable under the messy realities of large-scale reasoning optimization. ...