Dissecting FlashInfer - A Systems Perspective on High-Performance LLM Inference

The next frontier of large language model optimization isn’t architectural - it’s infrastructural. We’ve squeezed what we can from model design; now, inference efficiency is dictated by how we map computation to hardware. The challenge is executing them with minimal memory movement, maximal kernel fusion and predictable latency across heterogeneous batches. Every inefficiency (redundant projection, scattered memory access, unaligned kernels) compounds at scale. The gap between theoretical FLOPs and delivered throughput is now a systems problem. ...

October 9, 2025 · 21 min

A Geometric Analysis of Transformer Representations via Optimal Transport

Transformer models have become the backbone of modern AI, yet their remarkable performance still comes with a critical limitation: we lack a clear understanding of how information is processed inside them. Traditional evaluation focuses on outputs, but this leaves open the deeper question of what actually happens between layers as a model learns to reason. In our work, we approach this problem through a geometric lens, using Optimal Transport to measure how entire distributions of representations shift across layers. This perspective allows us to contrast trained and untrained models, revealing that training does not simply tune parameters, but organizes computation into a structured three-phase strategy: encoding, refinement, and decoding, underpinned by an information bottleneck. By making this internal structure visible, we aim to move closer to principled interpretability, where understanding a model means understanding the pathways of information it discovers through learning. ...

September 8, 2025 · 13 min

Go With The Flow

Flow-based generative models are starting to turn heads as a cool alternative to traditional diffusion methods for things like image and audio generation. What makes them stand out is how they learn smooth, efficient paths to transform one distribution into another—basically a neat and mathematically solid way to generate data. They’ve been getting a lot more buzz lately, especially after Black Forest Labs dropped their FLUX models and SD3.5 model by Stability AI. That success has brought fresh attention to the earlier ideas behind Rectified Flows, which first popped up at ICLR 2023. ...

April 27, 2025 · 24 min