/papers$cat ppo.html
PPO, explained to myself 🌱 seedling
the problem
Policy-gradient methods learn by nudging a policy in the direction that increases reward. The catch: take too big a step and you can wreck a policy that was working, because the data you collected is only valid near the policy that collected it. Earlier methods like TRPO solved this with a hard trust-region constraint — effective, but heavy to implement.
the key idea
PPO keeps the "don't move too far" spirit of TRPO but makes it cheap. Instead of a constrained optimization, it clips the objective so the update can't benefit from moving the policy too far from the old one. Simple to code, works with first-order optimizers, and you can run multiple epochs over the same batch.
how it works (the clipped objective)
Let r(θ) = π_θ(a|s) / π_old(a|s) be the probability ratio between new and old
policy, and  the advantage estimate. PPO maximizes:
L(θ) = E[ min( r(θ)·Â , clip(r(θ), 1−ε, 1+ε)·Â ) ]
- When the update is small,
r ≈ 1and it behaves like vanilla policy gradient. - When
rdrifts outside[1−ε, 1+ε], the clip removes the incentive to push further — the gradient flattens. εis typically ~0.2. Theminmakes the bound pessimistic, so it never rewards over-large steps.
why it matters
PPO hit a sweet spot of simplicity, stability, and sample-efficiency that made it the de facto default for a lot of RL — from control to, famously, the RLHF stage of training modern language models. When people say "we used RL to fine-tune," PPO is very often the algorithm underneath.
my take
What I find interesting is the trade-off: PPO swaps a mathematically clean constraint for a simpler heuristic that's easier to implement and tune in practice. Next I want to write up the policy-gradient theorem it builds on, and re-implement PPO from scratch to understand where it breaks.