In one line, RNN is a neural network with data input, input from the previous unit, and an output.
The following image is from [4]:
from IPython.display import Image
Image(filename='figures/RNN-rolled.png', width=200)
If we unroll this recurrence connection according to time $t$, then we get the following figure (from [4]):
from IPython.display import Image
Image(filename='figures/RNN-unrolled.png', width=800)
from IPython.display import Image
Image(filename='figures/RNN-longtermdependencies.png', width=600)
Tensorflow controls the turncation parameter through "num_steps" variable [7]. The following figure shows the Tensorflow implementation of BPTT with $num\_steps = 3$.
# Placeholder for the inputs in a given iteration.
import tensorflow as tf
batch_size = 1 # process each word at a time
num_steps = 3
words = tf.placeholder(tf.int32, [batch_size, num_steps])
from IPython.display import Image
Image(filename='figures/RNN_tf_truncated_backprop.png', width=600)
This style of BPTT have a problem of capturing dependencies beoyond "num_steps".
In Ilya Sutskever’s Ph.D. thesis [6] and [5], they talk about other variants of turncated back propagation.
from IPython.display import Image
Image(filename='figures/ilya_phd_thesis_bptt_truncate.png', width=600)
As mentioned in [5], the Tensorflow implementation is when $k_1 = k_2 = num\_steps$.
If you want to capture longer range dependencies than $num\_steps$, then one way is to set $k_2 << k_1$, for example $k_2 = 1$. The following figure is the case when $k_1 = 3$ and $k_2 = 1$ (which I borrowed from [5]):
from IPython.display import Image
Image(filename='figures/RNN_true_truncated_backprop.png', width=600)
I will borrow Ilya Sutskever's words from his PhD thesis [5]: "Consequently, its hidden states have been exposed to many timesteps and so may contain useful information about the far past, which would be opportunistically exploited."
Specifically, in this case, the 3rd RNN cell is exposed 2 more times than the Tensorflow implementaiton.
The other way to capture longer dependency is to simply increase $num\_steps$.
The chain rule for derivatives is applied when computing derivatives of a function composed by two or more functions (according to Wikipedia https://en.wikipedia.org/wiki/Chain_rule)
The chian rule is important when training a neural network because one nueron is dependent on another. This is the same case for RNNs. Let's be more precise and use a variable to define RNNs. Check out the following figure (from [1]) to get an intuitive idea of it.
from IPython.display import Image
Image(filename='figures/rnn-bptt1.png', width=600)
Where, $E_t$: is the error at time $t$. It is typically defined as categorical cross-entropy i.e. $$ E_t = -y_t \log(\hat{y_t}) $$ and we sum up the error and the total loss will be $$ E = \frac{1}{N}\sum_t^N -y_t \log(\hat{y_t}) $$
$s_2$ requires $s_1$, which is also dependent on a variable $s_0$.
Similarly, to compute the gradient for one example or one sentence, we sum up the gradient at all time $t$
$$ \frac{\partial E}{\partial{W}} = \sum_t \frac{\partial E_t}{\partial{W}} $$We will see in few minutes that using the chain rule or the computation graph of the chain rule, this is a product of partial derivatives on dependent variables.
The visualiztion of chain rules useful the intuitive understanding of the vanishing gradient problem. In my opinion, there are two difficulties in understanding the theory behing gradient vanishing problem:
To get ready to dive deep into to these two points, let's do a simple back propagation example on a simple function (folowing step by step presented in [1]).
Assume we are focusing on the following function (which corresponds to the cost function in the neural network training settings):
\begin{equation} f(c, d) = c d \\ c = a + b \\ d = b + 1 \\ \end{equation}The compuation graph of chain rules is the following (image ref [1]):
from IPython.display import Image
Image(filename='figures/computation_graph_1.png', width=300)
What is good about this graph? The answer is, you can now just compute the derivative of each node by summing up all the product of derivatives!
from IPython.display import Image
Image(filename='figures/computation_grpah_2_with_derivatives_edges.png', width=300)
The chain rule applied to $f(c, d)$ is \begin{align} \frac{\partial f(c, d)}{\partial a} &= \frac{\partial f}{\partial c} \frac{\partial c}{\partial a} \ &= d * b \end{align} and
\begin{align} \frac{\partial f(c, d)}{\partial b} &= \frac{\partial f}{\partial d} \frac{\partial d}{\partial b} + \frac{\partial f}{\partial c} \frac{\partial c}{\partial b} \\ &= c * 1 + d * a \end{align}Notice that it is simply following the all possible path in this graph (bottom to up) from input variables $a, b$.
Notice that when looking into the equations, there are derivatives that are shared among two equations, namely $\frac{\partial f}{\partial c} $? So if we naively compute the derivatives (or gradients in higher dimension) on each variable, it is computing the derivation on $\frac{\partial f}{\partial c} $ multiple times (well, twice in this case though).
Solution? Dynamic programming or in this particular application, there is a fancy name back propagation.
Now let's do the opposite i.e. going from up to bottom in this graph:
from IPython.display import Image
Image(filename='figures/computation_graph_3_back_prop.png', width=300)
CORRECTION: $\frac{\partial f}{\partial b} = c * 1 + d + a$.
So going from up to bottom and bottom to up is almost the same except that you traverse only once per edge for the top to bottom traversal. This is because the computational graph created from chain rule can have multiple children per node.
In the nueral network case, the bottom nodes (or the leaf nodes) are the weight parameters $W$ and the biases $b$. When the number of these parameters are large, it saves a LOT of time to compute the gradient according to each parameter.
How is the back propagation related to the vanishing gradient on the standard RNNs?
I am not going into mathematical details since I infer it will covered in Jordan's class, but I can explain the intuitive understanding.
The example I borrowed from [1] is the case of scalar or one-dimensional vector with a pretty shallow graph. What is the compuational graph created by chain rule for the simple RNN?
from IPython.display import Image
Image(filename='figures/rnn_cell.png', width=300)
When constructing the compuational graph for RNN, the key part is $s_{t-1} \rightarrow s_t$.
Image(filename='figures/computational_graph_4_rnn.png', width=600)
$J$ is a Jacobian matrix is the vector extension of first-order partial derivatives. (FYI, the second-order one is called Hessian matrix)
So remember from our fist example that when we backprogate from the top node to bottom node, we multiply the partial derivatives that corresponds to each edge. Also, notice that there is only one edge from $t-1$ nodes to $t$ nodes, namely $s_{t-1} \rightarrow s_{t}$.
In [3], there is a proof that the determinant of the Jacobian matrix betwen $s_{t-1}$ and $s_{t}$ is less than 1. In other words, every time we backpropagate through $s_{t-1} \leftarrow s_{t}$, the gradient value which we are currently holding will decrease!
$\therefore$ the standard RNN will not capture the long range dependencies.
If you want to know more about the ICML 2013 paper and cannot wait for Jodan's lecture in mid March, then I suggest you look into Richard Socher's slides [8] as a supplementary material for understanding what is going on.
Image(filename='figures/LSTM3-chain.png', width=600)
Image(filename='figures/lstm_cell.png', width=600)
where
where $\sigma$ is a sigmoid function.
The blue arrows represent the input filtered by either one of the following weighted matrices:
So notice that for example, if the input word vector $x_i$ is $300$ dimension, and $h_i$ is $100$ dimension. The dimesion of $W^{x\_input}$ is $300 \times 100 = 30,000$, so this single matrix has $30,000$ parameters to train. Furthermore, there are a total of $4$ similar matrices, so we have to train $30,000 \times 4 = 120,000$ parameters to train for the input word vector $x_i$.
Similarily, $W^{h\_input}$ has $100 \times 100 = 10,000$ parameters to train and there are total of $4$ such matrices, so we have $10,000 \times 4 = 40,000$ parameters to train for the hidden unit $h_i$.
To sum up, we have to trian $120,000 + 40,000 = 160,000$ parameters (plus the bias vector) to train when the input word vector is $300$ dimension and the hidden unit is in $100$ dimension.
This number of parameters is huge compated to the simple RNN such as
In case when the forget gate vector = $(1, 1, ..., 1)$, then it will simply copy your cell vector fro mthe previous time. This is different from the derivative edge between $s_t$ and $s_{t-1}$ having $< 1$.
"Most useful when you have lots and lots of data" [9]
So having $4$ times the number of parameters of the simple RNN seems to be a bit expensive.
The simpler version of LSTM and a bit complex than the simple RNN is the GRU (following image from [8]).
Image(filename='figures/gru_socher_slide.png', width=600)
Now, the weight matrices (or parameter matrices) are
Since there is one less matrix than LSTM, the parameters are less: