# Write in front

This paper mainly introduces some basic concepts of value distribution reinforcement learning, and then talks about the pioneering work of value distribution reinforcement learning - C51.

# Introduction to value distribution reinforcement learning

First of all, it should be stated that value distribution reinforcement learning is a kind of value based reinforcement learning algorithm.

The classical value based reinforcement learning method uses expected value to model the cumulative return, which is expressed as value function V(s) or action value function Q(s,a).

In this modeling process, the complete distribution information is lost to a great extent.

The purpose of value distribution reinforcement learning is to solve the problem of distribution information loss, and model the whole distribution Z(s,a) of Cumulative Return random variables, rather than just its expectation.

If expressed by formula:

Q ( s t , a t ) = E Z ( s t , a t ) = E [ ∑ i = 1 ∞ γ t + i R ( s t + i , a t + i ) ] Q(s_t,a_t)=\mathbb{E}Z(s_t,a_t)=\mathbb{E}[\sum^\infty_{i=1}\gamma_{t+i}R(s_{t+i},a_{t+i})] Q(st​,at​)=EZ(st​,at​)=E[i=1∑∞​γt+i​R(st+i​,at+i​)]

# Value distribution -- C51 algorithm

C51 algorithm from DeepMind A Distributional Perspective on Reinforcement Learning A penny.

In this article, the author first explains that the Q that the traditional DQN algorithm wants to learn is a numerical value, which means the expectation of future rewards and.

In the value distribution reinforcement learning series algorithms, the target changes from numerical value to a distribution. In the reinforcement learning of value distribution, the target also changes from the value Q to the random variable Z. this change can learn more information than the value, that is, the whole distribution.

The loss returned by the model is also transformed into a metric of similarity between the two distributions.

## Key algorithm

### Parametric distribution

In short, if the distribution value range is V m i n V_{min} Vmin # to V m a x V_{max} Vmax, which is divided into discrete N points, and each equal support is

{ z i = V m i n + i Δ z : 0 ≤ i < N , Δ z = V m a x − V m i n N − 1 } \{z_i=V_{min}+i\Delta z :0\leq i<N,\Delta z=\frac{V_{max}-V_{min}}{N-1}\} {zi​=Vmin​+iΔz:0≤i<N,Δz=N−1Vmax​−Vmin​​}

Each value output by the model corresponds to the probability of taking the current fulcrum

### Projection Berman update

After the action of the above value distribution Behrman operator, the value range of the new random variable may exceed the range of the discretized support in the first fulcrum, as shown in the figure below.

Therefore, we must project the random variables updated by the Behrman operator onto the discrete support, that is, the projection Behrman update mentioned in the paper.

Simply put, the projection process is to assign the updated values of random variables to their adjacent fulcrums

## Summary

### C51 algorithm is the same as DQN

1. The framework of C51 algorithm is still DQN algorithm
2. The sampling process is still used ϵ −greedy\epsilon-greedy ϵ − greedy strategy, where greed is greedy to take expectations
3. Use a separate target network

### Differences between C51 algorithm and DQN

1. The output of the convolution neural network of C51 algorithm is no longer the behavior value function, but the probability at the fulcrum.

2. The loss function of C51 algorithm is no longer the sum of mean square deviation, but the KL divergence as described above

Finally, why is the algorithm called C51?

This is because in this paper, the author divides the values of random variables into 51 fulcrum categories.

## Code part

### Import dependency

from typing import Dict, List, Tuple

import gym
from visualdl import LogWriter

from tqdm import tqdm,trange
import numpy as np


### Special operators needed

def index_add_(parent, axis, idx, child):
expend_dim = parent.shape[0]
idx_one_hot = F.one_hot(idx.cast("int64"), expend_dim)
output = parent + (idx_one_hot.cast("float32").multiply(child)).sum(axis).squeeze()
return output


### experience replay

class ReplayBuffer:
def __init__(self, obs_dim: int, size: int, batch_size: int = 32):
self.obs_buf = np.zeros([size, obs_dim], dtype=np.float32)
self.next_obs_buf = np.zeros([size, obs_dim], dtype=np.float32)
self.acts_buf = np.zeros([size], dtype=np.float32)
self.rews_buf = np.zeros([size], dtype=np.float32)
self.done_buf = np.zeros([size], dtype=np.float32)
self.max_size, self.batch_size = size, batch_size
self.ptr, self.size, = 0, 0

def store(
self,
obs: np.ndarray,
act: np.ndarray,
rew: float,
next_obs: np.ndarray,
done: bool,
):
self.obs_buf[self.ptr] = obs
self.next_obs_buf[self.ptr] = next_obs
self.acts_buf[self.ptr] = act
self.rews_buf[self.ptr] = rew
self.done_buf[self.ptr] = done
self.ptr = (self.ptr + 1) % self.max_size
self.size = min(self.size + 1, self.max_size)

def sample_batch(self):
idxs = np.random.choice(self.size, size=self.batch_size, replace=False)
return dict(obs=self.obs_buf[idxs],
next_obs=self.next_obs_buf[idxs],
acts=self.acts_buf[idxs],
rews=self.rews_buf[idxs],
done=self.done_buf[idxs])

def __len__(self):
return self.size


### Define network structure

Although our DQN can learn the whole distribution of the game, in the environment of CartPole-v0, we still choose expectations as the basis for decision-making

Compared with the traditional DQN, our C51 will be updated differently.

class C51DQN(nn.Layer):
def __init__(
self,
in_dim: int,
out_dim: int,
atom_size: int,
support
):
# initialization
super(C51DQN, self).__init__()

self.support = support
self.out_dim = out_dim
self.atom_size = atom_size

self.layers = nn.Sequential(
nn.Linear(in_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, out_dim * atom_size)
)

def forward(self, x):
dist = self.dist(x)
q = paddle.sum(dist * self.support, axis=2)
return q

def dist(self, x):
q_atoms = self.layers(x).reshape([-1, self.out_dim, self.atom_size])
dist = F.softmax(q_atoms, axis=-1)
dist = dist.clip(min=float(1e-3))  # Avoid nan
return dist


### Parameter setting involved in Agent

Attribute:
env: gym environment
memory: Capacity of experience playback pool
batch_size: Training batch
epsilon: Random exploration parameters epsilon
epsilon_decay: Random exploration parameters epsilon Attenuation step
max_epsilon: Random exploration parameters epsilon upper limit
min_epsilon: Random exploration parameters epsilon lower limit
target_update: Target network update frequency
gamma: Attenuation factor
dqn: Training model
dqn_target: Target model
optimizer: optimizer
transition: Transition vector, containing state values state, Action value action, reward reward, Secondary state next_state, (Judge whether to) end the round done
v_min: Upper bound of discrete support
v_max: Discrete lower bound of support set
atom_size: Number of fulcrum
support: Support template (vector template for calculating distribution)

class C51Agent:
def __init__(
self,
env: gym.Env,
memory_size: int,
batch_size: int,
target_update: int,
epsilon_decay: float,
max_epsilon: float = 1.0,
min_epsilon: float = 0.1,
gamma: float = 0.99,
# Parameters of C51 algorithm
v_min: float = 0.0,
v_max: float = 200.0,
atom_size: int = 51,
log_dir: str = "./log"
):
obs_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

self.env = env
self.memory = ReplayBuffer(obs_dim, memory_size, batch_size)
self.batch_size = batch_size
self.epsilon = max_epsilon
self.epsilon_decay = epsilon_decay
self.max_epsilon = max_epsilon
self.min_epsilon = min_epsilon
self.target_update = target_update
self.gamma = gamma

self.v_min = v_min
self.v_max = v_max
self.atom_size = atom_size
self.v_min, self.v_max, self.atom_size
)

# Define network
self.dqn = C51DQN(
obs_dim, action_dim, atom_size, self.support
)
self.dqn_target = C51DQN(
obs_dim, action_dim, atom_size, self.support
)
self.dqn_target.eval()

self.transition = []

self.is_test = False

self.log_dir = log_dir

self.log_writer = LogWriter(logdir = self.log_dir, comment= "Categorical DQN")

def select_action(self, state: np.ndarray):
if self.epsilon > np.random.random():
selected_action = self.env.action_space.sample()
else:
selected_action = self.dqn(
).argmax()
selected_action = selected_action.detach().numpy()

if not self.is_test:
self.transition = [state, selected_action]

return selected_action

def step(self, action: np.ndarray):
next_state, reward, done, _ = self.env.step(int(action))

if not self.is_test:
self.transition += [reward, next_state, done]
self.memory.store(*self.transition)

return next_state, reward, done

def update_model(self):
samples = self.memory.sample_batch()

loss = self._compute_dqn_loss(samples)

loss.backward()
self.optimizer.step()
loss_show = loss
return loss_show.numpy().item()

def train(self, num_frames: int, plotting_interval: int = 200):
self.is_test = False

state = self.env.reset()
update_cnt = 0
epsilons = []
losses = []
scores = []
score = 0
epsilon = 0

for frame_idx in trange(1, num_frames + 1):
action = self.select_action(state)
next_state, reward, done = self.step(action)

state = next_state
score += reward

# End of turn
if done:
epsilon += 1
state = self.env.reset()
scores.append(score)
score = 0

if len(self.memory) >= self.batch_size:
loss = self.update_model()
losses.append(loss)
update_cnt += 1

self.epsilon = max(
self.min_epsilon, self.epsilon - (
self.max_epsilon - self.min_epsilon
) * self.epsilon_decay
)
epsilons.append(self.epsilon)

if update_cnt % self.target_update == 0:
self._target_hard_update()

self.env.close()

def test(self):
self.is_test = True

state = self.env.reset()
done = False
score = 0

frames = []
while not done:
frames.append(self.env.render(mode="rgb_array"))
action = self.select_action(state)
next_state, reward, done = self.step(int(action))

state = next_state
score += reward

print("score: ", score)
self.env.close()

return frames

def _compute_dqn_loss(self, samples: Dict[str, np.ndarray]):
# Calculate loss

delta_z = float(self.v_max - self.v_min) / (self.atom_size - 1)

next_action = self.dqn_target(next_state).argmax(1)
next_dist = self.dqn_target.dist(next_state)
next_dist = next_dist[:self.batch_size,]
eyes = np.eye(_next_dist.shape[0], _next_dist.shape[1]).astype("float32")
eyes = np.repeat(eyes, _next_dist.shape[-1]).reshape(-1,_next_dist.shape[1],_next_dist.shape[-1])

next_dist = _next_dist.multiply(eyes).sum(1)

t_z = reward + (1 - done) * self.gamma * self.support
t_z = t_z.clip(min=self.v_min, max=self.v_max)
b = (t_z - self.v_min) / delta_z
l = b.floor().cast("int64")
u = b.ceil().cast("int64")

offset = (
0, (self.batch_size - 1) * self.atom_size, self.batch_size
).cast("int64")
.unsqueeze(1)
.expand([self.batch_size, self.atom_size])
)

proj_dist.reshape([-1]),
0,
(l + offset).reshape([-1]),
(next_dist * (u.cast("float32") - b)).reshape([-1])
)
proj_dist.reshape([-1]),
0,
(u + offset).reshape([-1]),
(next_dist * (b - l.cast("float32"))).reshape([-1])
)
proj_dist = proj_dist.reshape(next_dist.shape)

dist = self.dqn.dist(state)
eyes = np.eye(_dist.shape[0], _dist.shape[1]).astype("float32")
eyes = np.repeat(eyes, _dist.shape[-1]).reshape(-1,_dist.shape[1],_dist.shape[-1])
dist_batch = _dist.multiply(eyes).sum(1)

loss = -(proj_dist * log_p).sum(1).mean()

return loss

def _target_hard_update(self):
# Update target model parameters


### Define environment

env_id = "CartPole-v0"
env = gym.make(env_id)


### Set random seed

np.random.seed(seed)
env.seed(seed)


### Super parameter setting and training

num_frames = 20000
memory_size = 1000
batch_size = 32
target_update = 200
epsilon_decay = 1 / 2000

agent = C51Agent(env, memory_size, batch_size, target_update, epsilon_decay)
agent.train(num_frames)


### Visualization of training results

Posted by little_webspinner on Thu, 14 Apr 2022 02:10:07 +0930