mochan.org logo mochan.org logo

Mochan Shrestha [mochan.org]

Tutorials, Education, Learning and Knowledge

Actor Critic Methods in Reinforcement Learning

2025/03/21

Actor-critic methods combine value-based learning and policy-based learning, using two separate functions: the critic, which learns the value function, and the actor, which learns the policy. They work together. The critic evaluates the actions taken by the actor and provides feedback to improve the policy.

Value-Based Learning

In value-based learning, the agent learns to estimate the value of being in a given state. That is, the expected cumulative reward the agent can obtain from that state onward. The most common value functions are:

The agent uses these estimates to derive a policy, typically by acting greedily, choosing the action with the highest estimated Q-value. Well-known value-based algorithms include Q-learning and DQN. However, value-based methods struggle with continuous action spaces and do not directly learn a policy, making them less suitable for tasks that require fine-grained action control.

Policy-Based Learning

In policy-based learning, the agent directly learns a policy $π(a|s)$: a mapping from states to actions (or a probability distribution over actions) — without explicitly estimating value functions. The policy is typically parameterised by a neural network with weights θ, and training proceeds by optimizing these weights to maximize expected cumulative reward.

The core idea is to compute the policy gradient: the gradient of expected return with respect to θ. This gradient indicates how to adjust the policy parameters to make higher-reward actions more probable. The REINFORCE algorithm is the canonical policy gradient method, updating the policy using:

$$\nabla_\theta J(\theta) = \mathbb{E}_\pi \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot G_t \right]$$

where $G_t$ is the return from timestep $t$. Policy-based methods handle continuous action spaces naturally and can learn stochastic policies, but they tend to suffer from high variance in gradient estimates and slow convergence.

Combining Both: Actor-Critic

Actor-critic methods address the weaknesses of each approach by combining them. The actor is a policy network $π(a|s; θ)$ that selects actions, inheriting the ability to handle continuous actions and learn stochastic policies from policy-based learning. The critic is a value network $V(s; w)$ or $Q(s, a; w)$ that evaluates the actor’s choices — providing a low-variance learning signal in place of the noisy Monte Carlo returns used in REINFORCE.

Instead of using the full return $G_t$ to scale the policy gradient, the actor uses the critic’s estimate as feedback. A common formulation replaces $G_t$ with the advantage $A(s, a) = Q(s, a) - V(s)$, which measures how much better an action is compared to the average. This reduces variance while keeping the gradient estimate unbiased:

$$\nabla_\theta J(\theta) = \mathbb{E}_\pi \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A(s, a) \right]$$

The two networks are trained simultaneously: the critic minimizes the error in its value estimates (e.g. via TD learning), while the actor updates its policy using the advantage signal provided by the critic. This interplay allows actor-critic methods to be more stable and sample-efficient than pure policy gradient methods.

The Actor-Critic Algorithm

At each timestep, the agent observes a state, the actor selects an action, the environment returns a reward and next state, and both networks are updated. The critic is updated first using the TD error δ, which serves as the advantage estimate. Here $s'$ denotes the next state observed after taking action $a$ in state $s$:

$$\delta = r + \gamma V(s'; w) - V(s; w)$$

This measures how much better the outcome was compared to the critic’s prediction. A positive δ means the action led to a better-than-expected result; a negative δ means it was worse. The actor then uses this signal to reinforce or discourage the action taken.

We use the TD error $\delta$ to update both the critic and the actor. The critic parameters w are updated to minimize the squared TD error, while the actor parameters θ are updated along the policy gradient, scaled by δ. The learning rates α_θ and α_w control the step sizes for the actor and critic updates, respectively.

The critic parameters w are updated by minimizing the squared TD error:

$$w \leftarrow w + \alpha_w \cdot \delta \cdot \nabla_w V(s; w)$$

The actor parameters θ are updated along the policy gradient, scaled by δ:

$$\theta \leftarrow \theta + \alpha_\theta \cdot \delta \cdot \nabla_\theta \log \pi_\theta(a|s)$$

Pseudocode:

Initialise actor parameters θ and critic parameters w arbitrarily
Set discount factor γ, learning rates α_θ and α_w

For each episode:
    Observe initial state s

    For each timestep t:
        Sample action a ~ π(a|s; θ)
        Take action a, observe reward r and next state s'

        # Compute TD error (advantage estimate)
        δ = r + γ * V(s'; w) - V(s; w)

        # Update critic
        w ← w + α_w * δ * ∇_w V(s; w)

        # Update actor
        θ ← θ + α_θ * δ * ∇_θ log π_θ(a|s)

        s ← s'

        If s' is terminal: break

Variations of Actor-Critic

The vanilla actor-critic described above, while foundational, has practical limitations: training can be unstable, updates are correlated when generated from a single environment, and the critic can overestimate values. A number of algorithms have been developed to address these issues:

AlgorithmParallel WorkersReplay BufferPolicy Constraint / ClippingTwin CriticsEntropy RegularisationDeterministic PolicyKey Problem Addressed
Vanilla ACNoNoNoNoNoNoBaseline
A3CYes (async)NoNoNoNoNoCorrelated updates, slow training
A2CYes (sync)NoNoNoNoNoCorrelated updates, simpler than A3C
TRPONoNoHard KL constraintNoNoNoUnstable, destructively large updates
PPONoNoClipped objectiveNoNoNoUnstable updates (simpler than TRPO)
DDPGNoYesNoNoNoYesContinuous action spaces
TD3NoYesNoYesNoYesCritic overestimation bias
SACNoYesNoYesYesNoPoor exploration, sample inefficiency

Appendix: Symbol Reference

SymbolMeaning
$s$Current state
$s'$Next state (observed after taking action $a$ in state $s$)
$a$Action taken by the actor
$r$Reward received from the environment
$t$Current timestep
$\gamma$Discount factor (controls how much future rewards are valued)
$\pi(a \| s)$Policy: probability of taking action $a$ in state $s$
$\pi_\theta(a \| s)$Policy parameterised by $\theta$
$\theta$Actor network parameters
$V(s)$State-value function: expected return from state $s$
$V(s; w)$State-value function parameterised by $w$
$Q(s, a)$Action-value function: expected return after taking action $a$ in state $s$
$Q(s, a; w)$Action-value function parameterised by $w$
$w$Critic network parameters
$A(s, a)$Advantage function: $Q(s, a) - V(s)$
$\delta$TD error: $r + \gamma V(s'; w) - V(s; w)$
$G_t$Return from timestep $t$: cumulative discounted reward
$J(\theta)$Expected return under policy $\pi_\theta$
$\alpha_\theta$Learning rate for the actor
$\alpha_w$Learning rate for the critic