Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a reinforcement learning (RL) algorithm that balances sample efficiency, theoretical stability, and ease of implementation. Introduced by researchers at OpenAI in 2017 [1], PPO belongs to the family of policy gradient methods and was designed to address the training instability often found in standard REINFORCE algorithms while bypassing the computational complexity of its predecessor, Trust Region Policy Optimization (TRPO). PPO achieves this by introducing a “clipped surrogate objective.” This objective function limits how much the policy network can update in a single optimization step, preventing large performance drops that can occur when a policy deviates too much from its previous state. By enabling these constrained improvements through a first-order optimization process, PPO provides a training framework that is widely used in applications ranging from robotic locomotion to the Reinforcement Learning from Human Feedback (RLHF) used to align Large Language Models.

REINFORCE

To understand PPO, it is helpful to first look at REINFORCE [2], a foundational policy gradient algorithm. REINFORCE optimizes a parameterized policy directly by performing gradient ascent on the expected cumulative reward. The core of this approach is the policy gradient theorem, which expresses the gradient of the expected return $J(\theta)$ as:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) G_t \right] $$

Here, $\theta$ represents the parameters of the policy network, and $J(\theta)$ is the expected cumulative reward. $\pi_\theta(a_t | s_t)$ is the probability of taking action $a_t$ in state $s_t$ under the policy defined by $\pi_\theta$. $G_t$ is the return, defined as the cumulative discounted reward from time step $t$ onwards. $\tau$ represents a trajectory of states and actions sampled by following the policy, and $T$ is the total length of the trajectory. While conceptually straightforward, REINFORCE suffers from high variance in its gradient estimates and lacks a mechanism to prevent undesirably large updates to the policy parameters, which frequently leads to unstable training.

Unlike value-based algorithms such as Q-learning, which indirectly determine a policy by learning the value of taking a specific action in a given state, REINFORCE directly parameterizes and optimizes the policy itself. This direct approach allows it to naturally handle continuous action spaces and stochastic policies. Furthermore, compared to more advanced policy gradient methods like Actor-Critic algorithms, standard REINFORCE relies purely on full Monte Carlo returns ($G_t$) rather than bootstrapping with a learned value function (a “critic”). While relying on full trajectory returns ensures the gradient estimates are unbiased, it is exactly this property that contributes to the high variance and sample inefficiency that later methods, including PPO, sought to resolve.

Trust Region Policy Optimization (TRPO)

Like REINFORCE, Trust Region Policy Optimization (TRPO) [3] is a policy gradient algorithm that directly optimizes a parameterized policy network. However, while standard REINFORCE simply follows the steepest gradient direction—often taking excessively large steps that cause catastrophic drops in performance—TRPO belongs to a class of “trust region” methods. It was developed to address this instability by ensuring that each policy update mathematically guarantees performance improvement without deviating too far from the previous policy.

TRPO achieves this by framing the update as a constrained optimization problem consisting of two parts: an objective function to maximize, and a strict constraint on the parameters $\theta$. It is formulated as:

$$ \max_{\theta} \mathbb{E}_{s \sim \rho_{\theta_{old}}, a \sim \pi_{\theta_{old}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A_{\theta_{old}}(s, a) \right] $$$$ \text{subject to } \mathbb{E}_{s \sim \rho_{\theta_{old}}} \left[ D_{KL} \left( \pi_{\theta_{old}}(\cdot | s) \parallel \pi_\theta(\cdot | s) \right) \right] \le \delta $$

The first equation is the “surrogate objective” that actively pushes the new policy parameters ($\theta$) to increase the probability of actions that yielded a positive advantage ($A_{\theta_{old}} > 0$) compared to the old parameters ($\theta_{old}$). The term $\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}$ is the probability ratio between the new and old policies, and the advantage function $A_{\theta_{old}}(s, a)$ measures how much better an action $a$ is compared to the average action in state $s$.

The second equation is the strict constraint on $\theta$. It dictates that while maximizing the objective, the new policy $\pi_\theta$ cannot diverge from the old policy $\pi_{\theta_{old}}$ by more than a predefined hyperparameter $\delta$, measured by the average Kullback-Leibler ($D_{KL}$) divergence across the state distribution $\rho_{\theta_{old}}$. This bound defines the safe “trust region.”

While this formulation solves the instability of REINFORCE by providing strong theoretical guarantees for monotonic improvement, enforcing the strict KL divergence constraint requires computing second-order derivatives (specifically, the Fisher Information Matrix). Standard first-order optimizers like Adam cannot be used to solve this. Adam only looks at the first derivative (the steepness of the slope for individual parameters) and ignores the complex cross-parameter interactions (the curvature of the policy space) required to rigorously satisfy the hard KL boundary. Calculating this second-order geometry makes TRPO computationally expensive and difficult to scale to architectures involving large neural networks. It is exactly this tension—the need for the computational simplicity of first-order optimizers (like REINFORCE) and the reliable stability of TRPO—that perfectly sets the stage for Proximal Policy Optimization (PPO).

The PPO Solution: The Clipped Surrogate Objective

To resolve the tension between the computational complexity of TRPO and the instability of REINFORCE, PPO introduces a simple yet effective innovation: the clipped surrogate objective. Instead of relying on a complex, second-order KL divergence constraint to enforce the trust region, PPO incorporates the constraint directly into the objective function itself using a programmatic clipping operation.

The core objective function that PPO seeks to maximize is formulated as:

$$ L^{CLIP}(\theta) = \mathbb{E}_{t} \left[ \min \left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right] $$

In this equation, $r_t(\theta)$ denotes the probability ratio

$$\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$

which tracks how much the new policy deviates from the old one. The objective function is defined as an expectation over time steps ($\mathbb{E}_{t}$), the ratio is evaluated empirically only on the specific states ($s_t$) and actions ($a_t$) that the agent actually experienced and sampled during its recent interaction with the environment using the old policy ($\pi_{\theta_{old}}$). This sampling-based approach is what keeps the algorithm computationally tractable even in environments with massive or continuous state-action spaces. $A_t$ represents the estimated advantage at time step $t$, which is often computed using Generalized Advantage Estimation (GAE) to further reduce the variance seen in standard REINFORCE. The hyperparameter $\epsilon$ defines the clipping range and is typically set to a small value like 0.1 or 0.2.

The elegance of this formulation lies in how the $\min$ operator and the $\text{clip}$ function interact to penalize excessively large policy updates. We can break its behavior down into two scenarios based on the advantage:

Positive Advantage ($A_t > 0$): The action performed better than average, so we want to increase its probability. However, if the new policy increases the probability so much that the ratio $r_t(\theta)$ exceeds $1+\epsilon$, the clipping function activates. The $\min$ operator then bounds the objective at $(1+\epsilon)A_t$, meaning the policy network receives no further reward for pushing the probability any higher.
Negative Advantage ($A_t < 0$): The action performed worse than average, so we want to decrease its probability. If the ratio $r_t(\theta)$ falls below $1-\epsilon$, the clipping function kicks in again. Because $A_t$ is negative, the $\min$ operator selects the clipped bound of $(1-\epsilon)A_t$, stopping the network from overly penalizing the action.

By mathematically embedding this pessimistic bound (the “trust region”) directly into the objective function, PPO completely eliminates the need to compute the second-order Fisher Information Matrix required by TRPO. This allows the algorithm to be optimized using standard, computationally efficient first-order optimizers like Adam or Stochastic Gradient Descent (SGD). Ultimately, PPO achieves the best of both worlds: it mirrors the computational speed and architectural simplicity of REINFORCE while matching—and frequently exceeding—the reliable, monotonic improvement guarantees of TRPO.

Data Efficiency: Multi-Epoch Minibatch Updates

A major practical advantage of PPO over its predecessors is its data efficiency, specifically how it handles data batches. In standard on-policy algorithms like REINFORCE, the agent collects trajectory data, performs exactly one gradient update, and then must discard the data. Because any update shifts the policy, performing multiple updates on the same data causes catastrophic performance collapse. TRPO solved this instability with its mathematical trust region, but computing the required second-order derivatives is too computationally expensive to run repeatedly on the same data. Because PPO enforces its trust region via a simple clipped objective function, it completely circumvents this issue. PPO allows developers to collect a batch of transitions (typically a fixed number of steps across parallel environments, without waiting for episode ends), shuffle that data into minibatches, and safely perform multiple optimization epochs using standard optimizers like Adam. This ability to safely reuse on-policy data is central to PPO’s high sample efficiency.

The Value Function (Critic) Loss

While the clipped surrogate objective ($L^{CLIP}$) dictates how the policy (the “actor”) is updated, PPO is almost exclusively implemented as an Actor-Critic architecture. This means the algorithm must also learn a state-value function (the “critic”) to accurately estimate expected returns and compute the advantage $A_t$.

To train this critic, PPO calculates a value function loss ($L^{VF}$), which is typically the Mean Squared Error (MSE) between the network’s predicted value for a state $V_\theta(s_t)$ and the actual target return $V_t^{target}$ (often derived from GAE).

In practice, the actor and critic frequently share the same base neural network layers—an architecture often described as having “one body with two heads.” In this setup, the raw state is processed through a shared trunk of layers, which then splits into a lightweight actor head (outputting action probabilities) and a lightweight critic head (outputting the value estimate). This shared architecture provides two major benefits:

Computational Efficiency: The heavy lifting of processing the state is only done once per forward pass, and fewer total parameters need to be updated.
Robust Feature Extraction: Forcing the base layers to minimize both the policy loss and the value loss acts as a form of multi-task learning. Predicting the value of a state requires the network to deeply understand the environment. This auxiliary task helps the shared trunk learn richer, more generalized representations of the environment than the policy loss could learn alone.

Because of this shared architecture, the value function loss is evaluated simultaneously with the policy loss during each minibatch update, ensuring both the actor and critic improve together.

Entropy Bonus for Exploration

A common challenge in reinforcement learning is premature convergence, where the policy quickly discovers a suboptimal action and repeatedly exploits it, failing to explore other potentially better actions. To combat this, standard PPO implementations incorporate an entropy bonus into the objective function.

Entropy ($S$) is a mathematical measure of randomness or uncertainty in a probability distribution. In the context of PPO, a higher entropy means the policy’s action distribution is more uniform (highly exploratory), while a lower entropy means the policy is highly confident and deterministic.

By adding an entropy bonus term to the objective, the algorithm actively rewards the policy for remaining unpredictable. This bonus prevents the network from becoming too “sure” of itself too early, ensuring adequate exploration of the environment before settling on an optimal strategy.

The Total PPO Objective

With the policy objective, the value function loss, and the entropy bonus defined, we can construct the final objective function that a standard Actor-Critic PPO implementation actually maximizes during a training step:

$$ L^{TOTAL}(\theta) = L^{CLIP}(\theta) - c_1 L^{VF}(\theta) + c_2 S[\pi_\theta](s_t) $$

Where:

$L^{CLIP}(\theta)$ is the clipped surrogate objective that safely improves the policy.
$L^{VF}(\theta)$ is the value function (critic) loss, which penalizes the network for inaccurate return predictions.
$S[\pi_\theta](s_t)$ is the entropy bonus, which rewards the network for maintaining an exploratory action distribution.
$c_1$ and $c_2$ are hyperparameters (coefficients) that balance the importance of accurate value estimation and exploration against the primary policy optimization.

PPO Architecture

The Adaptive KL Penalty Variant (PPO-Penalty)

While the clipped surrogate objective ($L^{CLIP}$) is the most famous and widely used version of PPO, the original paper actually proposed two different methods for enforcing the trust region. The second method is known as PPO-Penalty.

Instead of applying a hard mathematical clip to the probability ratio, PPO-Penalty uses a soft penalty term based on the Kullback-Leibler (KL) divergence between the new policy and the old policy. It adds this penalty directly to the objective function:

$$ L^{KLPEN}(\theta) = \mathbb{E}_t \left[ r_t(\theta)A_t - \beta D_{KL} \left( \pi_{\theta_{old}}(\cdot | s_t) \parallel \pi_\theta(\cdot | s_t) \right) \right] $$

To ensure the penalty remains effective throughout training, PPO-Penalty adaptively scales the penalty coefficient $\beta$. If the KL divergence exceeds a target threshold (meaning the policy is changing too much), $\beta$ is increased to punish future large updates. If the KL divergence falls too far below the target (meaning the policy is changing too little), $\beta$ is decreased.

Although PPO-Penalty conceptually bridges the gap between TRPO and PPO more directly, the clipped version (PPO-Clip) became the standard due to its simpler implementation and empirically superior performance across a wider range of tasks.

Python Implementation Example

To bridge the gap between theory and practice, it is helpful to see how this mathematical objective translates into code. Because neural networks typically output log-probabilities for numerical stability, the probability ratio is often computed using exponentiation. Below is a minimal example using PyTorch that illustrates how the full PPO objective—incorporating the clipped policy loss, value loss, and entropy bonus—is computed during a training step:

import torch
import torch.nn.functional as F

def compute_ppo_total_loss(new_log_probs, old_log_probs, advantages, 
                           state_values, return_targets, entropy, 
                           epsilon=0.2, c1=0.5, c2=0.01):
    """
    Computes the Total PPO Objective Loss for an Actor-Critic architecture.
    """
    # ==========================================
    # 1. Policy Loss (Clipped Surrogate Objective)
    # ==========================================
    ratio = torch.exp(new_log_probs - old_log_probs)
    surrogate_1 = ratio * advantages
    clipped_ratio = torch.clamp(ratio, 1.0 - epsilon, 1.0 + epsilon)
    surrogate_2 = clipped_ratio * advantages
    
    # PyTorch minimizes loss, so we negate the objective we want to maximize
    policy_loss = -torch.min(surrogate_1, surrogate_2).mean()
    
    # ==========================================
    # 2. Value Function (Critic) Loss
    # ==========================================
    # Mean Squared Error between predicted values and target returns
    value_loss = F.mse_loss(state_values, return_targets)
    
    # ==========================================
    # 3. Entropy Bonus
    # ==========================================
    # We want to maximize entropy, so we negate it for minimization
    entropy_bonus = -entropy.mean()
    
    # ==========================================
    # 4. Total Loss
    # ==========================================
    total_loss = policy_loss + (c1 * value_loss) + (c2 * entropy_bonus)
    
    return total_loss

This snippet directly mirrors the $L^{TOTAL}(\theta)$ equation. By keeping the logic grounded in standard tensor operations, PPO avoids complex constraints and allows developers to optimize both the actor and critic simultaneously using standard frameworks.

Code-Level Optimizations & “Tricks”

While the mathematical formulation of PPO provides a strong theoretical foundation, much of its success in practice comes from specific code-level optimizations that were bundled with the original OpenAI Baselines implementation. A famous paper (Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO [4]) demonstrated that these implementation details are sometimes just as important to the algorithm’s performance as the clipped objective itself.

Some of the most critical “tricks” include:

Orthogonal Weight Initialization: Initializing neural network layers with orthogonal matrices rather than standard Gaussian noise to preserve gradient magnitudes.
Observation and Reward Normalization: Scaling environment states and rewards dynamically during training to maintain stable optimization dynamics.
Learning Rate Annealing: Gradually decaying the optimizer’s learning rate to zero as training progresses.
Value Function Clipping: Applying a similar clipping mechanism to the value function loss ($L^{VF}$) to prevent the critic from changing too rapidly.

These seemingly minor details work together to drastically stabilize training, highlighting that modern Reinforcement Learning relies on a synergy between robust algorithms and meticulous software engineering.

Applications of PPO

PPO is widely used across different domains due to its balance of sample efficiency and stability. Two prominent applications are robotic locomotion and the alignment of language models.

Robotic Locomotion

In robotic locomotion, the goal is to train legged robots (such as quadrupeds or bipeds) to navigate environments, maintain balance, and recover from perturbations. The problem is typically modeled as follows:

State ($s_t$): Sensor readings that may include joint angles, velocities, IMU data (orientation), and foot contact states.
Action ($a_t$): Continuous target joint positions or torques applied to the robot’s motors.
Reward ($r_t$): A function that encourages forward progress and energy efficiency while penalizing instability, large joint velocities, or falling.

Because motor control requires continuous actions, the PPO policy network outputs the parameters of a probability distribution (e.g., the mean and standard deviation of a Gaussian) for each joint. During training, actions are sampled from these distributions.

Training is typically conducted in physics simulators to collect large amounts of data safely. Techniques like domain randomization—where parameters such as friction, mass, and sensor noise are varied—are used to prepare the policy for deployment on physical hardware (sim-to-real transfer). PPO’s stability prevents catastrophic forgetting during training, while its compatibility with first-order optimizers allows it to efficiently process the large batches of data generated by parallel simulations.

Reinforcement Learning from Human Feedback (RLHF)

PPO is also a standard optimization algorithm in Reinforcement Learning from Human Feedback (RLHF), a process used to align Large Language Models (LLMs) with human preferences.

In this context, text generation is mapped to the RL framework:

State ($s_t$): The input prompt combined with the sequence of tokens generated so far.
Action ($a_t$): The next token generated by the LLM from its vocabulary.
Reward ($r_t$): A scalar score provided by a separately trained Reward Model that evaluates the quality of the response.

During RLHF, the LLM acts as the policy network ($\pi_\theta$). A common challenge in this phase is “reward hacking,” where the model learns to output repetitive or unnatural text that exploits the Reward Model to achieve high scores.

PPO mitigates this through its clipped surrogate objective, which restricts large policy updates. Additionally, a KL divergence penalty is typically added to the reward function. This penalty penalizes the model for diverging too far from a reference model (usually the supervised fine-tuned model).

While TRPO enforces a hard constraint on KL divergence that requires computing second-order derivatives, RLHF typically implements the KL divergence as a soft penalty subtracted from the reward signal ($r_{total} = r_{model} - \beta \text{KL}$). Because this modifies the scalar reward rather than imposing an optimization constraint, the LLM can still be trained using standard first-order optimizers like Adam, making the process computationally tractable for models with billions of parameters.

References

[1] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. URL: https://arxiv.org/abs/1707.06347

[2] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4), 229-256. URL: https://link.springer.com/article/10.1007/BF00992696

[3] Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR. URL: https://arxiv.org/abs/1502.05477

[4] Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., & Madry, A. (2020). Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. In International Conference on Learning Representations. URL: https://arxiv.org/abs/2005.12729

Mochan Shrestha [mochan.org]