A RNN(Recurrent Neural Network) layer is composed of a set of recurrent neurons. The output of each recurrent neuron is dependent not only on the input vector but also the outputs of all recurrent neurons at previous time.

\(Y_t=\phi(\left[\begin{matrix}X_t &Y_{t-1}\end{matrix}\right]\left[\begin{matrix}W_x\\W_y\end{matrix}\right]+B)\)Note that \(Y_t\) is the output of the shape \((m,n_{neurons})\) where m is the number of input instances, \(n_{neurons}\) is the number of neurons in the layer. \(X_t\) is the input of the shape \((m,n_{inputs})\). \(Y_{t-1}\) is the output at time t-1 of the shape \((m,n_{neurons})\), \(W_x\) is the weights with regard to the inputs, its shape is \((n_{inputs},n_{neurons})\). \(W_y\) is the weight matrix with regard to the outputs at time t-1, its shape is \((n_{neurons},n_{neurons})\). B is the bias matrix(NOT a vector) composed of m rows and \(n_{neurons}\) columns. Every row has the same content because we use the same set of biases for every instance in the mini-batch. \(\left[\begin{matrix}X_t &Y_{t-1}\end{matrix}\right]\) is of the shape \((m,n_{inputs}+n_{neurons})\). \(\left[\begin{matrix}W_x\\W_y\end{matrix}\right]\) has the shape \((n_{inputs}+n_{neurons},n_{neurons})\).

Note that do not confuse between different input features and different different values of a feature at different time steps. Do not confuse between different outputs and different values of an output at different time steps. Do not confuse between different instances and different values of an instance at different time steps.

An instance that is fed to the model includes a whole series of data along time steps. The output of the model includes a whole series of data along time. The cost is a function of the whole series of data of the output.

To create a recurrent neuron in tensorflow, you may think of using the code below:

x=tf.placeholder(tf.float32) w=tf.Variable(1.0) b=tf.Variable(2.0) y=x*w+y+b

But this code does not work:

`ValueError: Tensor("add_1:0", dtype=float32) must be from the same graph as Tensor("mul:0", dtype=float32).`

You may try to give y an initialization value:

x=tf.placeholder(tf.float32) w=tf.Variable(1.0) y=tf.zeros([]) b=tf.Variable(2.0) y=x*w+y+b

But it will create a graph totally different from what you want:

In short, you cannot connect the output of an operation back to the operation itself, as an input. Taking the output of an operation as one of its input would introduce an infinite dependency, tensorflow could not compute the output in limited steps.

Since the cost function is a function of the whole series of data of the output, to evaluate the cost in the computation graph, the graph must accept the whole series of input and output the whole series of output data. If the graph only accepted the input at one time step, we could not compute the cost by evaluating the cost node in the graph. We can construct such a computation graph with a cost node by unrolling the RNN along time steps. Now the RNN compute graph is composed of multiple similar sub-graphs called RNN cells. Every RNN cell shares a common set of weights and biases. Each cell computes the outputs at one time step taking the input at that time and the output of the cell that computes the output at previous time step. The tensorflow function to construct such a compute graph is static_rnn.

X0=tf.placeholder(tf.float32,[None,n_inputs]) X1=tf.placeholder(tf.float32,[None,n_inputs]) basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons) output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,[X0,X1],dtype=tf.float32) Y0,Y1=output_seqs init=tf.global_variables_initializer() X0_batch=np.array([[0,1,2],[3,4,5],[6,7,8],[9,0,1]]) X1_batch=np.array([[9,8,7],[0,0,0],[6,5,4],[3,2,1]]) with tf.Session()as sess: init.run() Y0_val,Y1_val=sess.run([Y0,Y1],feed_dict={X0:X0_batch,X1:X1_batch}) print(Y0_val) print(Y1_val)

You can see static_rnn creates a scope “rnn” which takes X0 and X1 as its inputs. You can specify the inputs of a RNN in the second parameter of static_rnn, as a list of tensors with the same shape. Here the list of tensors is [X0,X1] which is the list of data at 2 time steps. The shape of X0 and X1 must be equal because the basic RNN cells accept the same input features, and for each instance you must provide its values at 2 time steps. Let’s see what is inside the rnn scope:

The basic_rnn_cell has not the same meaning as the basic rnn cell we talked above. In fact the basic_rnn_cell contains all(2) the rnn cells:

You can see the two cascaded rnn cells use the same kernel(weights) and bias. The concat node of the second cell combines the output of the first cell and X1 to form the input of the second cell. The first cell also has a concat node which combines X0 and a zero vector to form the input of the cell.

You can use one placehoder to represent X0,X1,…, and unpack the placehoder to a list of rank-1 dimensional tensors, then feed the list to RNN, and stack the outputs into one output:

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs]) X_seqs=tf.unstack(tf.transpose(X,perm=[1,0,2])) basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons) #output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,[X0,X1],dtype=tf.float32) output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,X_seqs,dtype=tf.float32) outputs=tf.transpose(tf.stack(output_seqs),perm=[1,0,2]) init=tf.global_variables_initializer()

You can imagine that as the time steps increases, the compute graph constructed by static_rnn will become more and more complex because the basic_rnn_cell scope will contain more and more cascaded cells. It is called static rnn because every node in the graph would be executed just once whenever you predict or train the model. We can use dynamic_rnn to build a concise graph for rnn for large time steps.

n_steps=20 n_neurons=5 n_inputs=3 X=tf.placeholder(tf.float32,[None,n_steps,n_inputs]) basic_cell=tf.nn.rnn.BasicRNNCell(num_units=n_neurons) outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32)

X contains a list of instances(the size of batch). An instance is a list of values(features)(3) at different times(20). We feed the RNN a fixed size of training set at each time. The output is a list of outputs (the size of batch) corresponding to the input instances. Each output is actually a list of output values(5) at different times(20). This actually produces the following compute graph:

There is only one input to the RNN scope. The shape of the input is(unknown instances, 2 time steps, 3 features). However, the rnn scope is rather complex now, compared to the static rnn:

To understand what is under the hood, you need to be familiar with various tensorflow operations.

Tensorflow range operation outputs a list tensor like the python range function.

Tensorflow concat operation can have multiple inputs. One of the inputs is the “axis” parameter of tf.concat() which specifies the axis along which the other inputs are joined. The other inputs form a list that is the “values” parameter of tf.concat().

In the above example, the “axis” input (0) is the “axis” parameter, the “Const” input([5]) is the second element of the input list, which is [5], the output of the ExpandDims operation is a the first element of the input list, which is [50]. The concat operation combines the two lists into a new list [50,5].

Tensorflow strided_slice operation gets a slice from a list. The first parameter is the list to get a slice from. tf.strided_slice(input, begin, end, stride) gets a slice of input from begin to(not include) end with a stride step. Note that begin, end, stride are all 1d tensors(not scalars) with only one element. For example, in the following graph, the “stack” input([0]) is the “begin” parameter, the “stack_1″ ([1]) input is the “end” parameter, the “stack_2″([1]) input is the “stride” parameter, the “Shape” input([50,3], the number 2 along the edge is the shape of the tensor) is the “input” parameter. The operation takes the elements from 0 index to (not include) 1 index of the input which is actually input[0], i.e, the scalar 50 as the output.

Tensorflow ExpandDims operation adds a new dimension for the input tensor. It is created by the tf.expand_dims(input, axis) function.

In the above example, the “dim” input (0) is the axis parameter, the output of the strided_slice(50) is the “input” parameter. The ExpandDims operation expands a scalar to a 1d tensor with only one element 50.

Tensorflow TensorArray operation creates an array of Tensors of the input size. You can refer to my post about the details of TensorArray.

Now back to the compute graph for the dynamic RNN:

Basically, the graph is composed of two parts: the left part is the input/output, and the right part is the while_loop.

Look at the left part. The bottom transpose operation transposes the input from (?,2,3) to (2,?,3). Now the first dimension is the time step, the second dimension is the instance, the third dimension is the feature.

rnn

rnn

The outputs will be stacked by rnn

Now look at the right part, i.e., the while loop scope:

The rnn

The read tensor (the instances at a time) is fed to rnn

Here is the details of the basic_rnn_cell:

As you can see, the basic_rnn_cell combines the input and the output using a concat operation, multiplies the weights, adds the biases, passes through the Tanh function, and outputs the result.

The output is read by rnn

The bottom part of the while scope describes the dynamics of the iteration counter:

The iteration counter counts the output(in this case y0,y1). When the iteration counter >1, the loop exits.

The right part of the while scope describes the dynamics of the time:

The time is used to count and read the input, in this case, x0, x1. when time>1, the loop ends. The time and iteration counter seem redundant. Need to look into the source code to investigate their actual usage.

Now look at the top-right of the while scope. Here is the only used exit of the loop.

After the loop ends, rnn

Compared to static RNN, the basic rnn cell only occurs once in the dynamic RNN graph. Although extra nodes are added to let the rnn cell execute multiple times in one run, the complexity won’t increase with the number of time steps.

If the input instances have different time steps, we can set the sequence_length of tf.nn.dynamic_rnn to specify the time steps for every instance. The output would be zero past the input sequence length.

Now look back at the functions to construct the dynamic RNN:

basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons) outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

The tf.nn.rnn_cell.BasicRNNCell function has only one required parameter num_units. The tf.nn.dynamic_rnn function requires the input X has a rank of 3, but only the third dimension(the input) needs specific size. Basically, after you specify the number of the neurons and the size of the input feature, you can construct the graph of a dynamic RNN. You can provide various number of steps/number of instances when running the graph.

### Apply RNN to MNIST

n_steps = 28 n_inputs = 28 n_neurons = 150 n_outputs = 10 reset_graph() learning_rate = 0.001 X = tf.placeholder(tf.float32, [None, n_steps, n_inputs]) y = tf.placeholder(tf.int32, [None]) basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons) outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32) logits = tf.layers.dense(states, n_outputs) xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits) loss = tf.reduce_mean(xentropy) optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) training_op = optimizer.minimize(loss) correct = tf.nn.in_top_k(logits, y, 1) accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) (X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data() X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0 X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0 y_train = y_train.astype(np.int32) y_test = y_test.astype(np.int32) X_valid, X_train = X_train[:5000], X_train[5000:] y_valid, y_train = y_train[:5000], y_train[5000:] X_test = X_test.reshape((-1, n_steps, n_inputs)) n_epochs=100 batch_size=150 with tf.Session() as sess: init.run() for epoch in range(n_epochs): for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size): X_batch = X_batch.reshape((-1, n_steps, n_inputs)) sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch}) acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test}) print(epoch, "Last batch accuracy:", acc_batch, "Test accuracy:", acc_test)

A pitfall of dynamic RNN is you can only connect the final output(the output of last time step, or state) to the following full connected NN. While in static RNN, you can connect the output of every time step to the following network. The final output(state) is of shape (?,150), the output of the dense NN is of shape (?,10). The output of the dense NN is fed into the sparse_softmax_cross_entropy_with_logits module together with the labels of the input instances y. The output of sparse_softmax_cross_entropy_with_logits is a vector, each component of which is the cross-entropy of an input instance. This requires the dense NN has 10 outputs and each output(a logit) can express the probability of an input instance belongs to the class the output represents. Note that the whole network before sparse_softmax_cross_entropy_with_logits is the complete model that can be used to predict. sparse_softmax_cross_entropy_with_logits and reduce_mean, etc. are just used to compute the loss function that is used to train the model. The model simulates the probability-like functions(logits) of the input space. According to the Universal approximation theorem, we need at most 28*28+4=788 neurons to approximate such functions for full connected ReLU networks. However, since our RNN is not a full connected network, we use more neurons (150*28=4200) to approximate them.

Note that X_train is of shape (55000,784). When we extract a batch from it, we need to reshape it to (150,28,28) to cater for the format of the input of the RNN. After an epoch, we evaluate the accuracy of the model using the last batch of the training data and the whole test data.

### Use RNN to predict time series

import matplotlib import matplotlib.pyplot as plt t_min, t_max = 0, 30 resolution = 0.1 def time_series(t): return t * np.sin(t) / 3 + 2 * np.sin(t*5) def next_batch(batch_size, n_steps): t0 = np.random.rand(batch_size, 1) * (t_max - t_min - n_steps * resolution) Ts = t0 + np.arange(0., n_steps + 1) * resolution ys = time_series(Ts) return ys[:, :-1].reshape(-1, n_steps, 1), ys[:, 1:].reshape(-1, n_steps, 1) n_steps = 20 n_inputs = 1 n_neurons = 100 n_outputs = 1 X = tf.placeholder(tf.float32, [None, n_steps, n_inputs]) y = tf.placeholder(tf.float32, [None, n_steps, n_outputs]) cell = tf.contrib.rnn.OutputProjectionWrapper( tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons, activation=tf.nn.relu), output_size=n_outputs) outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32) learning_rate = 0.001 loss = tf.reduce_mean(tf.square(outputs - y)) # MSE optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) training_op = optimizer.minimize(loss) init = tf.global_variables_initializer() n_iterations = 1500 batch_size = 50 with tf.Session() as sess: init.run() for iteration in range(n_iterations): X_batch, y_batch = next_batch(batch_size, n_steps) sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) if iteration % 100 == 0: mse = loss.eval(feed_dict={X: X_batch, y: y_batch}) print(iteration, "\tMSE:", mse)

The difficulty in understanding the code is the next_batch(batch_size) function. np.random.rand(batch_size, 1) generates a (50,1) array whose elements are random number in [0,1). t0 is now a (50,1) array, each element of which is a start time of a time series. np.arange(0., n_steps + 1) * resolution is an array with 21 evenly spaced numbers. What is array (50,1) + array(21,)? It is an array (50,21). Each row is formed by the number in the row of the first array plus the numbers in the second array. So the result is 50 time series starting at different(random) times. ys is also (50,21). ys[:,:-1] extract the first 20 columns of the array. So ys[:,:-1].reshape(-1, n_steps, 1) is now (50,20,1), which is the 50 instances in a batch(X_batch), each instance has 20 time steps of scalars. ys[:,1:].reshape(-1, n_steps, 1) extracts the last 20 columns of ys and reshapes it to an array of (50,20,1), which are the 50 targets(each is a 20 time steps of scalars) for the 50 instances, i.e., y_batch. With a X_batch and y_batch, we can run a training step. We run 1500 training steps in total. After every 100 training steps, we print the loss of the current batch to see the progress.

Note that in this model, we feed a whole time series to get a whole time series in one shot, e.g., we feed x0,x1,…x19 to get y0,y1,…y19, which are the prediction values of x1,x2,….,x20. That does not mean we use future data to predict current data. yi is predicted by x0,x1,…xi. Note also that although we use OutputProjectionWrapper to sum up(no ReLU thereafter) the 1000 outputs(after 1000 ReLUs) of neurons to get one output, the tensors we connect back to the input of a neuron is the 1000 outputs(after 1000ReLUs), not the single final output.

The Creative RNN gives us some inspiration:

sequence=[0.]*n_steps foriterationinrange(300): X_batch=np.array(sequence[-n_steps:]).reshape(1,n_steps,1) y_pred=sess.run(outputs,feed_dict={X:X_batch}) sequence.append(y_pred[0,-1,0])

Note that [0.]*n_steps produces a list of 20 elements(all zeros). This is different than numpy array multiplied by a scalar(which does not change the size of the array). Although we feed the model with an all-zero series, the created sequence is like the time series used to train the model. In other words, the model uses its parameters(weights/biases) to memorize the information of the training data.

Until now, the RNN we consider is composed of one layer of recurrent neurons. We can construct a multi-layer RNN (deep RNN) using the following code:

n_neurons=100 n_layers=3 layers=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu) for layer in range(n_layers)] multi_layer_cell=tf.contrib.rnn.MultiRNNCell(layers) outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

You can distribute layers to different devices by reimplementing the basic cell:

class DeviceCellWrapper(tf.contrib.rnn.RNNCell): def__init__(self,device,cell): self._cell=cell self._device=device @property def state_size(self): returnself._cell.state_size @property def output_size(self): return self._cell.output_size def__call__(self,inputs,state,scope=None): with tf.device(self._device): return self._cell(inputs,state,scope) devices=["/gpu:0","/gpu:1","/gpu:2"] cells=[DeviceCellWrapper(dev,tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)) for dev in devices] multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells) outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

Note that although the new class inherits from tf.contrib.rnn.RNNCell, it has nothing to do with the parent class(not calls parent’s function). It is rather a proxy or a wrapper. When tf.nn.dynamic_rnn calls it to create a new cell, it creates the cell in its own device. This way, the different layers are pinned to different devices.

You can apply dropout to the inputs or outputs of a RNN layer by wrapping the basic RNN cell with tf.contrib.rnn.DropoutWrapper.

keep_prob=0.5 cells=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons) for layer in range(n_layers)] cells_drop=[tf.contrib.rnn.DropoutWrapper(cell,input_keep_prob=keep_prob) for cell in cells] multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells_drop) rnn_outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

Even a RNN deals with a moderately long sequences, the unrolled network will be too deep to face a lot of problems such as the gradient vanishing problem. You should use BasicLSTMCell to construct the RNN.

X0=tf.placeholder(tf.float32,[None,n_inputs]) X1=tf.placeholder(tf.float32,[None,n_inputs]) X=tf.placeholder(tf.float32,[None,n_steps,n_inputs]) X_seqs=tf.unstack(tf.transpose(X,perm=[1,0,2])) basic_cell=tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons) output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,X_seqs,dtype=tf.float32) outputs=tf.transpose(tf.stack(output_seqs),perm=[1,0,2])

Here we use static_rnn instead of dynamic_rnn to make the graph simpler.

The basic_lstm_cell sub-graph:

The left part is the unrolled lstm for t0, the right part is the unrolled lstm for t1. They share the same kernel and bias. Note that the weights for computing the forget gate controller, the input gate controller, the output gate controller, and the output itself, are all put in the same matrix. The same is for the biases. So you will find the weight matrix is (20,8) and the biase matrix is (20,). The output of these controllers and the output itself is calculated in one shot of matrix operations with the combination of input and the last state, then split into 4 tensors of (?,5) corresponding to repective controllers and the output.

The output then goes through a Tanh activation function as done in basic RNN cell:

Then the ouput is multiplied element-wise by the output of the input gate controller:

The filtered output is then added to the long-term state:

The result is the long term state for the next time step. To get the final output of current step, the output goes through another tanh activation function:

, and multiplied element-wise by the output of the output gate controller:

After filtered by the output gate controller, the output becomes the final output and the one of the input(concat_1) of next time LSTM. Here is how the output of the output gate controller is generated:

Here is the data flow to produce the output of the forget gate controller:

Note that after getting the weighted sum of concated inputs, the result is added by an all-1 constant tensor(Const_2) before going to the sigmoid operation. This guarantees the forget gate controller does not output an all-zero tensor at the beginning. The output of the forget gate controller is multiplied element-wise by the long-term state to selectively forget some neurons’ last state:

Because the long-term state is an all-0 tensor at the first time step, we’d better see the “forget” operation at the second time step:

You can see one of the input of rnn

Here is the tensor flow to produce the output of the input gate controller: