#### preface

BP neural network, when training, given a set of inputs and outputs, constantly train the weights to make the output stable. But BP neural network is not suitable for all scenes and does not really reflect the real characteristics of some scenes. Back to the classical probability theory problem, coin tossing problem, suppose you have tossed 100 times, 90 times are positive and 10 times are negative. What is the probability of positive if you continue to toss once now? If there is no previous experience, it is normal to think that the probability of positive again is 50%, but because we have conducted experiments on this before, that is, with experience, according to Bayesian law, the probability of positive will certainly be greater than 50%. BP neural network also lacks a kind of feedback to the previous results.

Common and easy to understand algorithm affected by the front position, addition algorithm. The ten digit result is affected by the one digit result, because there may be carry. Similarly, the hundred digit result is affected by the ten digit result. As shown in the figure

This algorithm affected by the front position is very common, and the classical BP neural network can not well reflect the characteristics of this algorithm. It is necessary to optimize and transform the classical BP neural network, that is, introduce the influence of the front position and historical data on the network to make it time-series. Infer the subsequent events through the correlation of historical data.

#### Recurrent neural network RNN

From the dynamic graph of the previous addition algorithm, the existing BP neural network is transformed, that is, the influence of the result of adding the previous position on the subsequent network.

Here, the BP neural network is designed in the form of the figure above, which vividly reveals the characteristics of the recurrent neural network. The forward result is used as the next input and affects the result of the next network. Recurrent neural network has achieved good results in many directions. A special recurrent neural network, Long Short Term network (LSTM), has the brightest results and is the star in this direction.

LSTM structure:

Let's look at the implementation of LSTM. For more information about LSTM, http://nicodjimenez.github.io/2014/08/08/lstm.html And https://github.com/nicodjimenez/lstm/blob/master/lstm.py , the solution process of LSTM is given in detail.

The iterative process of the algorithm is:

http://nicodjimenez.github.io/2014/08/08/lstm.html

https://github.com/nicodjimenez/lstm

The algorithm is not much different from BP neural network, but we should pay attention to the increment and iteration of each variable.

#### Using RNN to realize the addition process

The simplified method is as follows:

If you remove the layer_ Layer 1, then it is the simplest BP neural network. Layer is introduced here_ 1 layer, which makes the classical BP neural network have one more input, layer_ Layer 1 in the addition algorithm indicates that the previous input can reflect the characteristics of the addition algorithm. From the perspective of structure, the deformation of this LSTM is not very complex, but now the important thing is how to calculate the increment of each level and then iterate.

Build a binary network, input as two nodes, a hidden layer has 16 nodes, and an intermediate layer has 16 nodes to store carry.

for position in range(binary_dim): # generate input and output X = np.array([[a[binary_dim - position - 1], b[binary_dim - position - 1]]]) y = np.array([[c[binary_dim - position - 1]]]).T # hidden layer (input ~+ prev_hidden) layer_1 = sigmoid(np.dot(X, synapse_0) + np.dot(layer_1_values[-1], synapse_h)) # output layer (new binary representation) layer_2 = sigmoid(np.dot(layer_1, synapse_1)) # did we miss?... if so by how much? layer_2_error = y - layer_2 layer_2_deltas.append((layer_2_error) * sigmoid_output_to_derivative(layer_2)) overallError += np.abs(layer_2_error[0]) # decode estimate so we can print it out d[binary_dim - position - 1] = np.round(layer_2[0][0]) # store hidden layer so we can use it in the next timestep layer_1_values.append(copy.deepcopy(layer_1))

The addend and the addend are used as inputs to calculate the errors of the hidden layer and the middle layer. This is an additive recursive neural network, which can be regarded as a base in the middle layer. Here, it should be remembered that the input layer of this adder is not eight nodes, but only two. It is also a minimum adder.

Error propagation is also used in training. Here we mainly need to solve the derivative problem. python code implementation:

# compute sigmoid nonlinearity def sigmoid(x): output = 1/(1+np.exp(-x)) return output # convert output of sigmoid function to its derivative def sigmoid_output_to_derivative(output): return output*(1-output)

For the calculation of intermediate results, please refer to the process of LSTM:

for position in range(binary_dim): X = np.array([[a[position], b[position]]]) layer_1 = layer_1_values[-position - 1] prev_layer_1 = layer_1_values[-position - 2] # error at output layer layer_2_delta = layer_2_deltas[-position - 1] # error at hidden layer layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + \ layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(layer_1) # let's update all our weights so we can try again synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta) synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta) synapse_0_update += X.T.dot(layer_1_delta) future_layer_1_delta = layer_1_delta

Update of variables:

synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta) synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta) synapse_0_update += X.T.dot(layer_1_delta)

Where layer_1_delta variable is the sum of two variables:

layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(layer_1)

The complete iterative process is:

https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/

In the recursive neural network, it is not much different from the classical BP neural network in structure. The key point is to solve the increment and iterate.

#### Application of recurrent neural network

The biggest difference between recurrent neural network and BP neural network is the introduction of time series, which can infer future events according to previous data. This is a popular direction now. More applications are in the processing of voice and text. There are quite a lot of applications of recurrent neural network on the Internet, such as writing lyrics like Wang Feng, dictating Tang poetry, writing cold jokes and so on. But to write decent lyrics and poems, we still need to do a lot of processing. If the recursive neural network is applied to the recommendation system, it will also get good results.

#### reference resources

http://blog.csdn.net/zzukun/article/details/49968129

http://www.jianshu.com/p/9dc9f41f0b29

http://nicodjimenez.github.io/2014/08/08/lstm.html

https://github.com/nicodjimenez/lstm

http://blog.csdn.net/longxinchen_ml/article/details/51253526

https://github.com/karpathy/char-rnn

http://blog.csdn.net/v_july_v/article/details/52796239