/papers$cat ppo.html

PPO, explained to myself 🌱 seedling

Paper: Proximal Policy Optimization Algorithms · Schulman et al., 2017 · arXiv:1707.06347

the problem

Policy-gradient methods learn by nudging a policy in the direction that increases reward. The catch: take too big a step and you can wreck a policy that was working, because the data you collected is only valid near the policy that collected it. Earlier methods like TRPO solved this with a hard trust-region constraint — effective, but heavy to implement.

the key idea

PPO keeps the "don't move too far" spirit of TRPO but makes it cheap. Instead of a constrained optimization, it clips the objective so the update can't benefit from moving the policy too far from the old one. Simple to code, works with first-order optimizers, and you can run multiple epochs over the same batch.

how it works (the clipped objective)

Let r(θ) = π_θ(a|s) / π_old(a|s) be the probability ratio between new and old policy, and Â the advantage estimate. PPO maximizes:

L(θ) = E[ min( r(θ)·Â ,  clip(r(θ), 1−ε, 1+ε)·Â ) ]

When the update is small, r ≈ 1 and it behaves like vanilla policy gradient.
When r drifts outside [1−ε, 1+ε], the clip removes the incentive to push further — the gradient flattens.
ε is typically ~0.2. The min makes the bound pessimistic, so it never rewards over-large steps.

why it matters

PPO hit a sweet spot of simplicity, stability, and sample-efficiency that made it the de facto default for a lot of RL — from control to, famously, the RLHF stage of training modern language models. When people say "we used RL to fine-tune," PPO is very often the algorithm underneath.

my take

What I find interesting is the trade-off: PPO swaps a mathematically clean constraint for a simpler heuristic that's easier to implement and tune in practice. Next I want to write up the policy-gradient theorem it builds on, and re-implement PPO from scratch to understand where it breaks.