Reinforcement Learning -- value distribution Reinforcement Learning algorithm C51

Write in front

This paper mainly introduces some basic concepts of value distribution reinforcement learning, and then talks about the pioneering work of value distribution reinforcement learning - C51.

Introduction to value distribution reinforcement learning

First of all, it should be stated that value distribution reinforcement learning is a kind of value based reinforcement learning algorithm.

The classical value based reinforcement learning method uses expected value to model the cumulative return, which is expressed as value function V(s) or action value function Q(s,a).

In this modeling process, the complete distribution information is lost to a great extent.

The purpose of value distribution reinforcement learning is to solve the problem of distribution information loss, and model the whole distribution Z(s,a) of Cumulative Return random variables, rather than just its expectation.

If expressed by formula:

Q ( s t , a t ) = E Z ( s t , a t ) = E [ ∑ i = 1 ∞ γ t + i R ( s t + i , a t + i ) ] Q(s_t,a_t)=\mathbb{E}Z(s_t,a_t)=\mathbb{E}[\sum^\infty_{i=1}\gamma_{t+i}R(s_{t+i},a_{t+i})] Q(st​,at​)=EZ(st​,at​)=E[i=1∑∞​γt+i​R(st+i​,at+i​)]

Value distribution -- C51 algorithm

C51 algorithm from DeepMind A Distributional Perspective on Reinforcement Learning A penny.

In this article, the author first explains that the Q that the traditional DQN algorithm wants to learn is a numerical value, which means the expectation of future rewards and.

In the value distribution reinforcement learning series algorithms, the target changes from numerical value to a distribution. In the reinforcement learning of value distribution, the target also changes from the value Q to the random variable Z. this change can learn more information than the value, that is, the whole distribution.

The loss returned by the model is also transformed into a metric of similarity between the two distributions.

Key algorithm

Parametric distribution

In short, if the distribution value range is V m i n V_{min} Vmin # to V m a x V_{max} Vmax, which is divided into discrete N points, and each equal support is

{ z i = V m i n + i Δ z : 0 ≤ i < N , Δ z = V m a x − V m i n N − 1 } \{z_i=V_{min}+i\Delta z :0\leq i<N,\Delta z=\frac{V_{max}-V_{min}}{N-1}\} {zi​=Vmin​+iΔz:0≤i<N,Δz=N−1Vmax​−Vmin​​}

Each value output by the model corresponds to the probability of taking the current fulcrum

Projection Berman update

After the action of the above value distribution Behrman operator, the value range of the new random variable may exceed the range of the discretized support in the first fulcrum, as shown in the figure below.

Therefore, we must project the random variables updated by the Behrman operator onto the discrete support, that is, the projection Behrman update mentioned in the paper.

Simply put, the projection process is to assign the updated values of random variables to their adjacent fulcrums


C51 algorithm is the same as DQN

  1. The framework of C51 algorithm is still DQN algorithm
  2. The sampling process is still used ϵ −greedy\epsilon-greedy ϵ − greedy strategy, where greed is greedy to take expectations
  3. Use a separate target network

Differences between C51 algorithm and DQN

  1. The output of the convolution neural network of C51 algorithm is no longer the behavior value function, but the probability at the fulcrum.

  2. The loss function of C51 algorithm is no longer the sum of mean square deviation, but the KL divergence as described above

Finally, why is the algorithm called C51?

This is because in this paper, the author divides the values of random variables into 51 fulcrum categories.

Code part

Import dependency

from typing import Dict, List, Tuple

import gym
from visualdl import LogWriter

from tqdm import tqdm,trange
import numpy as np
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
import paddle.optimizer as optimizer

Special operators needed

def index_add_(parent, axis, idx, child):
        expend_dim = parent.shape[0]
        idx_one_hot = F.one_hot(idx.cast("int64"), expend_dim)
        child = paddle.expand_as(child.cast("float32").unsqueeze(-1), idx_one_hot)
        output = parent + (idx_one_hot.cast("float32").multiply(child)).sum(axis).squeeze()
        return output

experience replay

class ReplayBuffer:
    def __init__(self, obs_dim: int, size: int, batch_size: int = 32):
        self.obs_buf = np.zeros([size, obs_dim], dtype=np.float32)
        self.next_obs_buf = np.zeros([size, obs_dim], dtype=np.float32)
        self.acts_buf = np.zeros([size], dtype=np.float32)
        self.rews_buf = np.zeros([size], dtype=np.float32)
        self.done_buf = np.zeros([size], dtype=np.float32)
        self.max_size, self.batch_size = size, batch_size
        self.ptr, self.size, = 0, 0

    def store(
        obs: np.ndarray, 
        act: np.ndarray, 
        rew: float, 
        next_obs: np.ndarray, 
        done: bool,
        self.obs_buf[self.ptr] = obs
        self.next_obs_buf[self.ptr] = next_obs
        self.acts_buf[self.ptr] = act
        self.rews_buf[self.ptr] = rew
        self.done_buf[self.ptr] = done
        self.ptr = (self.ptr + 1) % self.max_size
        self.size = min(self.size + 1, self.max_size)

    def sample_batch(self):
        idxs = np.random.choice(self.size, size=self.batch_size, replace=False)
        return dict(obs=self.obs_buf[idxs],

    def __len__(self):
        return self.size

Define network structure

Although our DQN can learn the whole distribution of the game, in the environment of CartPole-v0, we still choose expectations as the basis for decision-making

Compared with the traditional DQN, our C51 will be updated differently.

class C51DQN(nn.Layer):
    def __init__(
        in_dim: int, 
        out_dim: int, 
        atom_size: int, 
        # initialization
        super(C51DQN, self).__init__() = support
        self.out_dim = out_dim
        self.atom_size = atom_size
        self.layers = nn.Sequential(
            nn.Linear(in_dim, 128), 
            nn.Linear(128, 128), 
            nn.Linear(128, out_dim * atom_size)

    def forward(self, x):
        dist = self.dist(x)
        q = paddle.sum(dist *, axis=2)
        return q
    def dist(self, x):
        q_atoms = self.layers(x).reshape([-1, self.out_dim, self.atom_size])
        dist = F.softmax(q_atoms, axis=-1)
        dist = dist.clip(min=float(1e-3))  # Avoid nan
        return dist

Parameter setting involved in Agent

env: gym environment
memory: Capacity of experience playback pool
batch_size: Training batch
epsilon: Random exploration parameters epsilon
epsilon_decay: Random exploration parameters epsilon Attenuation step
max_epsilon: Random exploration parameters epsilon upper limit
min_epsilon: Random exploration parameters epsilon lower limit
target_update: Target network update frequency
gamma: Attenuation factor
dqn: Training model
dqn_target: Target model
optimizer: optimizer
transition: Transition vector, containing state values state, Action value action, reward reward, Secondary state next_state, (Judge whether to) end the round done
v_min: Upper bound of discrete support
v_max: Discrete lower bound of support set
atom_size: Number of fulcrum
support: Support template (vector template for calculating distribution)
class C51Agent:
    def __init__(
        env: gym.Env,
        memory_size: int,
        batch_size: int,
        target_update: int,
        epsilon_decay: float,
        max_epsilon: float = 1.0,
        min_epsilon: float = 0.1,
        gamma: float = 0.99,
        # Parameters of C51 algorithm
        v_min: float = 0.0,
        v_max: float = 200.0,
        atom_size: int = 51,
        log_dir: str = "./log"
        obs_dim = env.observation_space.shape[0]
        action_dim = env.action_space.n
        self.env = env
        self.memory = ReplayBuffer(obs_dim, memory_size, batch_size)
        self.batch_size = batch_size
        self.epsilon = max_epsilon
        self.epsilon_decay = epsilon_decay
        self.max_epsilon = max_epsilon
        self.min_epsilon = min_epsilon
        self.target_update = target_update
        self.gamma = gamma
        self.v_min = v_min
        self.v_max = v_max
        self.atom_size = atom_size = paddle.linspace(
            self.v_min, self.v_max, self.atom_size

        # Define network
        self.dqn = C51DQN(
            obs_dim, action_dim, atom_size,
        self.dqn_target = C51DQN(
            obs_dim, action_dim, atom_size,

        self.optimizer = optimizer.Adam(parameters=self.dqn.parameters())

        self.transition = []
        self.is_test = False
        self.log_dir = log_dir

        self.log_writer = LogWriter(logdir = self.log_dir, comment= "Categorical DQN")

    def select_action(self, state: np.ndarray):
        if self.epsilon > np.random.random():
            selected_action = self.env.action_space.sample()
            selected_action = self.dqn(
            selected_action = selected_action.detach().numpy()
        if not self.is_test:
            self.transition = [state, selected_action]
        return selected_action

    def step(self, action: np.ndarray):
        next_state, reward, done, _ = self.env.step(int(action))

        if not self.is_test:
            self.transition += [reward, next_state, done]
        return next_state, reward, done

    def update_model(self):
        samples = self.memory.sample_batch()

        loss = self._compute_dqn_loss(samples)

        loss_show = loss
        return loss_show.numpy().item()
    def train(self, num_frames: int, plotting_interval: int = 200):
        self.is_test = False
        state = self.env.reset()
        update_cnt = 0
        epsilons = []
        losses = []
        scores = []
        score = 0
        epsilon = 0

        for frame_idx in trange(1, num_frames + 1):
            action = self.select_action(state)
            next_state, reward, done = self.step(action)

            state = next_state
            score += reward

            # End of turn
            if done:
                epsilon += 1
                state = self.env.reset()
                self.log_writer.add_scalar("Reward", value=paddle.to_tensor(score), step=epsilon)
                score = 0

            if len(self.memory) >= self.batch_size:
                loss = self.update_model()
                self.log_writer.add_scalar("Loss", value=paddle.to_tensor(loss), step=frame_idx)
                update_cnt += 1
                self.epsilon = max(
                    self.min_epsilon, self.epsilon - (
                        self.max_epsilon - self.min_epsilon
                    ) * self.epsilon_decay

                if update_cnt % self.target_update == 0:
    def test(self):
        self.is_test = True
        state = self.env.reset()
        done = False
        score = 0
        frames = []
        while not done:
            action = self.select_action(state)
            next_state, reward, done = self.step(int(action))

            state = next_state
            score += reward
        print("score: ", score)
        return frames

    def _compute_dqn_loss(self, samples: Dict[str, np.ndarray]):
        # Calculate loss
        state = paddle.to_tensor(samples["obs"],dtype="float32")
        next_state = paddle.to_tensor(samples["next_obs"],dtype="float32")
        action = paddle.to_tensor(samples["acts"],dtype="int64")
        reward = paddle.to_tensor(samples["rews"].reshape([-1, 1]),dtype="float32")
        done = paddle.to_tensor(samples["done"].reshape([-1, 1]),dtype="float32")

        delta_z = float(self.v_max - self.v_min) / (self.atom_size - 1)

        with paddle.no_grad():
            next_action = self.dqn_target(next_state).argmax(1)
            next_dist = self.dqn_target.dist(next_state)
            next_dist = next_dist[:self.batch_size,]
            _next_dist = paddle.gather(next_dist, next_action, axis=1)           
            eyes = np.eye(_next_dist.shape[0], _next_dist.shape[1]).astype("float32")
            eyes = np.repeat(eyes, _next_dist.shape[-1]).reshape(-1,_next_dist.shape[1],_next_dist.shape[-1])
            eyes = paddle.to_tensor(eyes)

            next_dist = _next_dist.multiply(eyes).sum(1)

            t_z = reward + (1 - done) * self.gamma *
            t_z = t_z.clip(min=self.v_min, max=self.v_max)
            b = (t_z - self.v_min) / delta_z
            l = b.floor().cast("int64")
            u = b.ceil().cast("int64")

            offset = (
                    0, (self.batch_size - 1) * self.atom_size, self.batch_size
                .expand([self.batch_size, self.atom_size])

            proj_dist = paddle.zeros(next_dist.shape)
            proj_dist = index_add_(
                (l + offset).reshape([-1]), 
                (next_dist * (u.cast("float32") - b)).reshape([-1])
            proj_dist = index_add_(
                (u + offset).reshape([-1]), 
                (next_dist * (b - l.cast("float32"))).reshape([-1])
            proj_dist = proj_dist.reshape(next_dist.shape)

        dist = self.dqn.dist(state)
        _dist = paddle.gather(dist[:self.batch_size,], action, axis=1)
        eyes = np.eye(_dist.shape[0], _dist.shape[1]).astype("float32")
        eyes = np.repeat(eyes, _dist.shape[-1]).reshape(-1,_dist.shape[1],_dist.shape[-1])
        eyes = paddle.to_tensor(eyes)
        dist_batch = _dist.multiply(eyes).sum(1)
        log_p = paddle.log(dist_batch)

        loss = -(proj_dist * log_p).sum(1).mean()

        return loss

    def _target_hard_update(self):
        # Update target model parameters

Define environment

env_id = "CartPole-v0"
env = gym.make(env_id)

Set random seed


Super parameter setting and training

num_frames = 20000
memory_size = 1000
batch_size = 32
target_update = 200
epsilon_decay = 1 / 2000

agent = C51Agent(env, memory_size, batch_size, target_update, epsilon_decay)

Visualization of training results

Posted by little_webspinner on Thu, 14 Apr 2022 02:10:07 +0930