Write in front
This paper mainly introduces some basic concepts of value distribution reinforcement learning, and then talks about the pioneering work of value distribution reinforcement learning - C51.
Introduction to value distribution reinforcement learning
First of all, it should be stated that value distribution reinforcement learning is a kind of value based reinforcement learning algorithm.
The classical value based reinforcement learning method uses expected value to model the cumulative return, which is expressed as value function V(s) or action value function Q(s,a).
In this modeling process, the complete distribution information is lost to a great extent.
The purpose of value distribution reinforcement learning is to solve the problem of distribution information loss, and model the whole distribution Z(s,a) of Cumulative Return random variables, rather than just its expectation.
If expressed by formula:
Q ( s t , a t ) = E Z ( s t , a t ) = E [ ∑ i = 1 ∞ γ t + i R ( s t + i , a t + i ) ] Q(s_t,a_t)=\mathbb{E}Z(s_t,a_t)=\mathbb{E}[\sum^\infty_{i=1}\gamma_{t+i}R(s_{t+i},a_{t+i})] Q(st,at)=EZ(st,at)=E[i=1∑∞γt+iR(st+i,at+i)]
Value distribution -- C51 algorithm
C51 algorithm from DeepMind A Distributional Perspective on Reinforcement Learning A penny.
In this article, the author first explains that the Q that the traditional DQN algorithm wants to learn is a numerical value, which means the expectation of future rewards and.
In the value distribution reinforcement learning series algorithms, the target changes from numerical value to a distribution. In the reinforcement learning of value distribution, the target also changes from the value Q to the random variable Z. this change can learn more information than the value, that is, the whole distribution.
The loss returned by the model is also transformed into a metric of similarity between the two distributions.
Key algorithm
Parametric distribution
In short, if the distribution value range is V m i n V_{min} Vmin # to V m a x V_{max} Vmax, which is divided into discrete N points, and each equal support is
{ z i = V m i n + i Δ z : 0 ≤ i < N , Δ z = V m a x − V m i n N − 1 } \{z_i=V_{min}+i\Delta z :0\leq i<N,\Delta z=\frac{V_{max}-V_{min}}{N-1}\} {zi=Vmin+iΔz:0≤i<N,Δz=N−1Vmax−Vmin}
Each value output by the model corresponds to the probability of taking the current fulcrum
Projection Berman update
After the action of the above value distribution Behrman operator, the value range of the new random variable may exceed the range of the discretized support in the first fulcrum, as shown in the figure below.
Therefore, we must project the random variables updated by the Behrman operator onto the discrete support, that is, the projection Behrman update mentioned in the paper.
Simply put, the projection process is to assign the updated values of random variables to their adjacent fulcrums
Summary
C51 algorithm is the same as DQN
- The framework of C51 algorithm is still DQN algorithm
- The sampling process is still used ϵ −greedy\epsilon-greedy ϵ − greedy strategy, where greed is greedy to take expectations
- Use a separate target network
Differences between C51 algorithm and DQN
-
The output of the convolution neural network of C51 algorithm is no longer the behavior value function, but the probability at the fulcrum.
-
The loss function of C51 algorithm is no longer the sum of mean square deviation, but the KL divergence as described above
Finally, why is the algorithm called C51?
This is because in this paper, the author divides the values of random variables into 51 fulcrum categories.
Code part
Import dependency
from typing import Dict, List, Tuple import gym from visualdl import LogWriter from tqdm import tqdm,trange import numpy as np import paddle import paddle.nn as nn import paddle.nn.functional as F import paddle.optimizer as optimizer
Special operators needed
def index_add_(parent, axis, idx, child): expend_dim = parent.shape[0] idx_one_hot = F.one_hot(idx.cast("int64"), expend_dim) child = paddle.expand_as(child.cast("float32").unsqueeze(-1), idx_one_hot) output = parent + (idx_one_hot.cast("float32").multiply(child)).sum(axis).squeeze() return output
experience replay
class ReplayBuffer: def __init__(self, obs_dim: int, size: int, batch_size: int = 32): self.obs_buf = np.zeros([size, obs_dim], dtype=np.float32) self.next_obs_buf = np.zeros([size, obs_dim], dtype=np.float32) self.acts_buf = np.zeros([size], dtype=np.float32) self.rews_buf = np.zeros([size], dtype=np.float32) self.done_buf = np.zeros([size], dtype=np.float32) self.max_size, self.batch_size = size, batch_size self.ptr, self.size, = 0, 0 def store( self, obs: np.ndarray, act: np.ndarray, rew: float, next_obs: np.ndarray, done: bool, ): self.obs_buf[self.ptr] = obs self.next_obs_buf[self.ptr] = next_obs self.acts_buf[self.ptr] = act self.rews_buf[self.ptr] = rew self.done_buf[self.ptr] = done self.ptr = (self.ptr + 1) % self.max_size self.size = min(self.size + 1, self.max_size) def sample_batch(self): idxs = np.random.choice(self.size, size=self.batch_size, replace=False) return dict(obs=self.obs_buf[idxs], next_obs=self.next_obs_buf[idxs], acts=self.acts_buf[idxs], rews=self.rews_buf[idxs], done=self.done_buf[idxs]) def __len__(self): return self.size
Define network structure
Although our DQN can learn the whole distribution of the game, in the environment of CartPole-v0, we still choose expectations as the basis for decision-making
Compared with the traditional DQN, our C51 will be updated differently.
class C51DQN(nn.Layer): def __init__( self, in_dim: int, out_dim: int, atom_size: int, support ): # initialization super(C51DQN, self).__init__() self.support = support self.out_dim = out_dim self.atom_size = atom_size self.layers = nn.Sequential( nn.Linear(in_dim, 128), nn.ReLU(), nn.Linear(128, 128), nn.ReLU(), nn.Linear(128, out_dim * atom_size) ) def forward(self, x): dist = self.dist(x) q = paddle.sum(dist * self.support, axis=2) return q def dist(self, x): q_atoms = self.layers(x).reshape([-1, self.out_dim, self.atom_size]) dist = F.softmax(q_atoms, axis=-1) dist = dist.clip(min=float(1e-3)) # Avoid nan return dist
Parameter setting involved in Agent
Attribute: env: gym environment memory: Capacity of experience playback pool batch_size: Training batch epsilon: Random exploration parameters epsilon epsilon_decay: Random exploration parameters epsilon Attenuation step max_epsilon: Random exploration parameters epsilon upper limit min_epsilon: Random exploration parameters epsilon lower limit target_update: Target network update frequency gamma: Attenuation factor dqn: Training model dqn_target: Target model optimizer: optimizer transition: Transition vector, containing state values state, Action value action, reward reward, Secondary state next_state, (Judge whether to) end the round done v_min: Upper bound of discrete support v_max: Discrete lower bound of support set atom_size: Number of fulcrum support: Support template (vector template for calculating distribution)
class C51Agent: def __init__( self, env: gym.Env, memory_size: int, batch_size: int, target_update: int, epsilon_decay: float, max_epsilon: float = 1.0, min_epsilon: float = 0.1, gamma: float = 0.99, # Parameters of C51 algorithm v_min: float = 0.0, v_max: float = 200.0, atom_size: int = 51, log_dir: str = "./log" ): obs_dim = env.observation_space.shape[0] action_dim = env.action_space.n self.env = env self.memory = ReplayBuffer(obs_dim, memory_size, batch_size) self.batch_size = batch_size self.epsilon = max_epsilon self.epsilon_decay = epsilon_decay self.max_epsilon = max_epsilon self.min_epsilon = min_epsilon self.target_update = target_update self.gamma = gamma self.v_min = v_min self.v_max = v_max self.atom_size = atom_size self.support = paddle.linspace( self.v_min, self.v_max, self.atom_size ) # Define network self.dqn = C51DQN( obs_dim, action_dim, atom_size, self.support ) self.dqn_target = C51DQN( obs_dim, action_dim, atom_size, self.support ) self.dqn_target.load_dict(self.dqn.state_dict()) self.dqn_target.eval() self.optimizer = optimizer.Adam(parameters=self.dqn.parameters()) self.transition = [] self.is_test = False self.log_dir = log_dir self.log_writer = LogWriter(logdir = self.log_dir, comment= "Categorical DQN") def select_action(self, state: np.ndarray): if self.epsilon > np.random.random(): selected_action = self.env.action_space.sample() else: selected_action = self.dqn( paddle.to_tensor(state,dtype="float32"), ).argmax() selected_action = selected_action.detach().numpy() if not self.is_test: self.transition = [state, selected_action] return selected_action def step(self, action: np.ndarray): next_state, reward, done, _ = self.env.step(int(action)) if not self.is_test: self.transition += [reward, next_state, done] self.memory.store(*self.transition) return next_state, reward, done def update_model(self): samples = self.memory.sample_batch() loss = self._compute_dqn_loss(samples) self.optimizer.clear_grad() loss.backward() self.optimizer.step() loss_show = loss return loss_show.numpy().item() def train(self, num_frames: int, plotting_interval: int = 200): self.is_test = False state = self.env.reset() update_cnt = 0 epsilons = [] losses = [] scores = [] score = 0 epsilon = 0 for frame_idx in trange(1, num_frames + 1): action = self.select_action(state) next_state, reward, done = self.step(action) state = next_state score += reward # End of turn if done: epsilon += 1 state = self.env.reset() self.log_writer.add_scalar("Reward", value=paddle.to_tensor(score), step=epsilon) scores.append(score) score = 0 if len(self.memory) >= self.batch_size: loss = self.update_model() self.log_writer.add_scalar("Loss", value=paddle.to_tensor(loss), step=frame_idx) losses.append(loss) update_cnt += 1 self.epsilon = max( self.min_epsilon, self.epsilon - ( self.max_epsilon - self.min_epsilon ) * self.epsilon_decay ) epsilons.append(self.epsilon) if update_cnt % self.target_update == 0: self._target_hard_update() self.env.close() def test(self): self.is_test = True state = self.env.reset() done = False score = 0 frames = [] while not done: frames.append(self.env.render(mode="rgb_array")) action = self.select_action(state) next_state, reward, done = self.step(int(action)) state = next_state score += reward print("score: ", score) self.env.close() return frames def _compute_dqn_loss(self, samples: Dict[str, np.ndarray]): # Calculate loss state = paddle.to_tensor(samples["obs"],dtype="float32") next_state = paddle.to_tensor(samples["next_obs"],dtype="float32") action = paddle.to_tensor(samples["acts"],dtype="int64") reward = paddle.to_tensor(samples["rews"].reshape([-1, 1]),dtype="float32") done = paddle.to_tensor(samples["done"].reshape([-1, 1]),dtype="float32") delta_z = float(self.v_max - self.v_min) / (self.atom_size - 1) with paddle.no_grad(): next_action = self.dqn_target(next_state).argmax(1) next_dist = self.dqn_target.dist(next_state) next_dist = next_dist[:self.batch_size,] _next_dist = paddle.gather(next_dist, next_action, axis=1) eyes = np.eye(_next_dist.shape[0], _next_dist.shape[1]).astype("float32") eyes = np.repeat(eyes, _next_dist.shape[-1]).reshape(-1,_next_dist.shape[1],_next_dist.shape[-1]) eyes = paddle.to_tensor(eyes) next_dist = _next_dist.multiply(eyes).sum(1) t_z = reward + (1 - done) * self.gamma * self.support t_z = t_z.clip(min=self.v_min, max=self.v_max) b = (t_z - self.v_min) / delta_z l = b.floor().cast("int64") u = b.ceil().cast("int64") offset = ( paddle.linspace( 0, (self.batch_size - 1) * self.atom_size, self.batch_size ).cast("int64") .unsqueeze(1) .expand([self.batch_size, self.atom_size]) ) proj_dist = paddle.zeros(next_dist.shape) proj_dist = index_add_( proj_dist.reshape([-1]), 0, (l + offset).reshape([-1]), (next_dist * (u.cast("float32") - b)).reshape([-1]) ) proj_dist = index_add_( proj_dist.reshape([-1]), 0, (u + offset).reshape([-1]), (next_dist * (b - l.cast("float32"))).reshape([-1]) ) proj_dist = proj_dist.reshape(next_dist.shape) dist = self.dqn.dist(state) _dist = paddle.gather(dist[:self.batch_size,], action, axis=1) eyes = np.eye(_dist.shape[0], _dist.shape[1]).astype("float32") eyes = np.repeat(eyes, _dist.shape[-1]).reshape(-1,_dist.shape[1],_dist.shape[-1]) eyes = paddle.to_tensor(eyes) dist_batch = _dist.multiply(eyes).sum(1) log_p = paddle.log(dist_batch) loss = -(proj_dist * log_p).sum(1).mean() return loss def _target_hard_update(self): # Update target model parameters self.dqn_target.load_dict(self.dqn.state_dict())
Define environment
env_id = "CartPole-v0" env = gym.make(env_id)
Set random seed
np.random.seed(seed) paddle.seed(seed) env.seed(seed)
Super parameter setting and training
num_frames = 20000 memory_size = 1000 batch_size = 32 target_update = 200 epsilon_decay = 1 / 2000 agent = C51Agent(env, memory_size, batch_size, target_update, epsilon_decay) agent.train(num_frames)
Visualization of training results