Beyond PPO - The New Wave of Policy Optimization Techniques for LLM Post-Training
PPO used to be the default workhorse for RLHF because it’s reasonably stable and easy to reason about, but at LLM post-training scale its tradeoffs start to bite: the critic/value model is expensive to train and maintain, long text rollouts amplify variance and make advantage estimation brittle, clipping becomes a blunt instrument that can under-update (wasting samples) or over-update (destabilizing), and the whole loop turns into a systems-heavy exercise once “environment interaction” means generating thousands of tokens across distributed inference. As post-training shifted from short preference tuning toward reasoning-heavy objectives (RLVR, long-CoT, verifier-driven rewards, pass@k-style targets) and larger, more heterogeneous data mixtures, these weaknesses became harder to paper over with hyperparameter folklore because the optimization problem is noisier, the feedback signals are sparser, and the failure modes (reward hacking, length bias, mode collapse, over-regularization) are more punishing. That’s why the field has been moving beyond “just PPO” toward more robust, more LLM-native policy optimization: methods that reduce dependence on a critic, stabilize updates under long-horizon generation, better control distribution shift between samples and policy, and align the training objective with how we actually evaluate modern models, ultimately making post-training not just possible, but reliable under the messy realities of large-scale reasoning optimization. ...
Dissecting FlashInfer - A Systems Perspective on High-Performance LLM Inference
The next frontier of large language model optimization isn’t architectural - it’s infrastructural. We’ve squeezed what we can from model design; now, inference efficiency is dictated by how we map computation to hardware. The challenge is executing them with minimal memory movement, maximal kernel fusion and predictable latency across heterogeneous batches. Every inefficiency (redundant projection, scattered memory access, unaligned kernels) compounds at scale. The gap between theoretical FLOPs and delivered throughput is now a systems problem. ...
A Geometric Analysis of Transformer Representations via Optimal Transport
Transformer models have become the backbone of modern AI, yet their remarkable performance still comes with a critical limitation: we lack a clear understanding of how information is processed inside them. Traditional evaluation focuses on outputs, but this leaves open the deeper question of what actually happens between layers as a model learns to reason. In our work, we approach this problem through a geometric lens, using Optimal Transport to measure how entire distributions of representations shift across layers. This perspective allows us to contrast trained and untrained models, revealing that training does not simply tune parameters, but organizes computation into a structured three-phase strategy: encoding, refinement, and decoding, underpinned by an information bottleneck. By making this internal structure visible, we aim to move closer to principled interpretability, where understanding a model means understanding the pathways of information it discovers through learning. ...
Go With The Flow
Flow-based generative models are starting to turn heads as a cool alternative to traditional diffusion methods for things like image and audio generation. What makes them stand out is how they learn smooth, efficient paths to transform one distribution into another—basically a neat and mathematically solid way to generate data. They’ve been getting a lot more buzz lately, especially after Black Forest Labs dropped their FLUX models and SD3.5 model by Stability AI. That success has brought fresh attention to the earlier ideas behind Rectified Flows, which first popped up at ICLR 2023. ...