Goals of this meeting¶

Intuitive understanding of the back propagation
Intuitive understanding of the truncated back propagation through time
Intuitive understanding of the gradient vanishing problem
Use LSTM and GRU using Keras

Topics of this meeting¶

High-level overview of Recurrent Neural Networks (RNNs)
Intuitive Explanation of Back Propagation using computational graph of chain rules
Intuitive Explanation of math behind the Gradient Vanishing Problem
LSTM and GRU

References¶

[1] http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
[2] http://colah.github.io/posts/2015-08-Backprop/
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/
[3] Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. "On the difficulty of training recurrent neural networks." ICML 2013
[4] http://colah.github.io/posts/2015-08-Understanding-LSTMs/
[5] http://r2rt.com/styles-of-truncated-backpropagation.html
[6] page 23 of Ilya Sutskever’s Ph.D. thesis http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf
[7] https://www.tensorflow.org/tutorials/recurrent#truncated-backpropagation
[8] Richard Socher's lecture slides on vanishing gradient problem at Stanford class CS224d https://cs224d.stanford.edu/lectures/CS224d-Lecture8.pdf
[9] Richard Socher's lecture slides on LSTM and GRU at Stanford class CS224d https://cs224d.stanford.edu/lectures/CS224d-Lecture9.pdf

Summary¶

Tensorflow-based backpropagation through time does not catch the error from the future time $t$ such that $t > num\_steps$ if the implementation is still the same as written in the Tensorflow tutorial documentation [7].
Back propagation is a "technique of calculating a gradient descent quickly" [1]. In my opinion, it is just a dynamic programming.
Visualization of chain rules using computation graph is helpful for intuitive understanding of vanishing gradient problem that occurs in normal RNN.
To relax the vanishing gradient problem, one approach is to use LSTMs or GRUs.

1. High-level overview of Recurrent Neural Networks (RNNs)¶

In one line, RNN is a neural network with data input, input from the previous unit, and an output.

$x_t$: Input. Typically it is a word vector, but it can also be a scalar.
$h_t$: Output. Typically it is a vector, but it can also be a scalar.
$A$: Recurrent unit or a recurrent function. It is a linear combination of $x_t$, $h_t$ and a bias $b$.

The following image is from [4]:

from IPython.display import Image
Image(filename='figures/RNN-rolled.png', width=200)

If we unroll this recurrence connection according to time $t$, then we get the following figure (from [4]):

from IPython.display import Image
Image(filename='figures/RNN-unrolled.png', width=800)

Training an RNN model¶

Capturing long range dependencies thorough Back Propagation Through Time (BPTT)¶

from IPython.display import Image
Image(filename='figures/RNN-longtermdependencies.png', width=600)

Truncation of Back Propagation Through Time¶

Tensorflow controls the turncation parameter through "num_steps" variable [7]. The following figure shows the Tensorflow implementation of BPTT with $num\_steps = 3$.

# Placeholder for the inputs in a given iteration.
import tensorflow as tf
batch_size = 1 # process each word at a time
num_steps = 3
words = tf.placeholder(tf.int32, [batch_size, num_steps])

from IPython.display import Image
Image(filename='figures/RNN_tf_truncated_backprop.png', width=600)

This style of BPTT have a problem of capturing dependencies beoyond "num_steps".

In Ilya Sutskever’s Ph.D. thesis [6] and [5], they talk about other variants of turncated back propagation.

from IPython.display import Image
Image(filename='figures/ilya_phd_thesis_bptt_truncate.png', width=600)

As mentioned in [5], the Tensorflow implementation is when $k_1 = k_2 = num\_steps$.

If you want to capture longer range dependencies than $num\_steps$, then one way is to set $k_2 << k_1$, for example $k_2 = 1$. The following figure is the case when $k_1 = 3$ and $k_2 = 1$ (which I borrowed from [5]):

from IPython.display import Image
Image(filename='figures/RNN_true_truncated_backprop.png', width=600)

I will borrow Ilya Sutskever's words from his PhD thesis [5]: "Consequently, its hidden states have been exposed to many timesteps and so may contain useful information about the far past, which would be opportunistically exploited."

Specifically, in this case, the 3rd RNN cell is exposed 2 more times than the Tensorflow implementaiton.

The other way to capture longer dependency is to simply increase $num\_steps$.

2. Intuitive Explanation of Back Propagation using Computation Graph¶

Why is the chain rule in derivatives important for training neural networks?¶

The chain rule for derivatives is applied when computing derivatives of a function composed by two or more functions (according to Wikipedia https://en.wikipedia.org/wiki/Chain_rule)

The chian rule is important when training a neural network because one nueron is dependent on another. This is the same case for RNNs. Let's be more precise and use a variable to define RNNs. Check out the following figure (from [1]) to get an intuitive idea of it.

from IPython.display import Image
Image(filename='figures/rnn-bptt1.png', width=600)

Where, $E_t$: is the error at time $t$. It is typically defined as categorical cross-entropy i.e. $$ E_t = -y_t \log(\hat{y_t}) $$ and we sum up the error and the total loss will be $$ E = \frac{1}{N}\sum_t^N -y_t \log(\hat{y_t}) $$

$s_2$ requires $s_1$, which is also dependent on a variable $s_0$.

Similarly, to compute the gradient for one example or one sentence, we sum up the gradient at all time $t$

$$ \frac{\partial E}{\partial{W}} = \sum_t \frac{\partial E_t}{\partial{W}} $$

We will see in few minutes that using the chain rule or the computation graph of the chain rule, this is a product of partial derivatives on dependent variables.

Visualization of Chian Rules using Computation Graph [2]¶

The visualiztion of chain rules useful the intuitive understanding of the vanishing gradient problem. In my opinion, there are two difficulties in understanding the theory behing gradient vanishing problem:

Computation of gradient using the cost function involves many paraemters that are dependent on each other, so we have to apply the chain rule.
Computation of gradient involves vector calculus which is harder to understand than scalar calculus.

To get ready to dive deep into to these two points, let's do a simple back propagation example on a simple function (folowing step by step presented in [1]).

Assume we are focusing on the following function (which corresponds to the cost function in the neural network training settings):

\begin{equation} f(c, d) = c d \\ c = a + b \\ d = b + 1 \\ \end{equation}

The compuation graph of chain rules is the following (image ref [1]):

from IPython.display import Image
Image(filename='figures/computation_graph_1.png', width=300)

What is good about this graph? The answer is, you can now just compute the derivative of each node by summing up all the product of derivatives!

from IPython.display import Image
Image(filename='figures/computation_grpah_2_with_derivatives_edges.png', width=300)

The chain rule applied to $f(c, d)$ is \begin{align} \frac{\partial f(c, d)}{\partial a} &= \frac{\partial f}{\partial c} \frac{\partial c}{\partial a} \ &= d * b \end{align} and

\begin{align} \frac{\partial f(c, d)}{\partial b} &= \frac{\partial f}{\partial d} \frac{\partial d}{\partial b} + \frac{\partial f}{\partial c} \frac{\partial c}{\partial b} \\ &= c * 1 + d * a \end{align}

Notice that it is simply following the all possible path in this graph (bottom to up) from input variables $a, b$.

Back propagation on the Computation Graph¶

Notice that when looking into the equations, there are derivatives that are shared among two equations, namely $\frac{\partial f}{\partial c} $? So if we naively compute the derivatives (or gradients in higher dimension) on each variable, it is computing the derivation on $\frac{\partial f}{\partial c} $ multiple times (well, twice in this case though).

Solution? Dynamic programming or in this particular application, there is a fancy name back propagation.

Now let's do the opposite i.e. going from up to bottom in this graph:

from IPython.display import Image
Image(filename='figures/computation_graph_3_back_prop.png', width=300)

CORRECTION: $\frac{\partial f}{\partial b} = c * 1 + d + a$.

So going from up to bottom and bottom to up is almost the same except that you traverse only once per edge for the top to bottom traversal. This is because the computational graph created from chain rule can have multiple children per node.

In the nueral network case, the bottom nodes (or the leaf nodes) are the weight parameters $W$ and the biases $b$. When the number of these parameters are large, it saves a LOT of time to compute the gradient according to each parameter.

3. Intuitive Explanation of math behind the Gradient Vanishing Problem¶

How is the back propagation related to the vanishing gradient on the standard RNNs?

I am not going into mathematical details since I infer it will covered in Jordan's class, but I can explain the intuitive understanding.

The example I borrowed from [1] is the case of scalar or one-dimensional vector with a pretty shallow graph. What is the compuational graph created by chain rule for the simple RNN?

Simple Recurrent Neural Network (RNN)¶

from IPython.display import Image
Image(filename='figures/rnn_cell.png', width=300)

List of variables and what they mean¶

$s_i$: State vector with dimension $d_s$
$x_i$: Input vector with dimension $d_x$
$y_i$: Output vector
$W^x$: Weight matrix for the input vector $x_i$ with dimension $d_x \times d_s$
$W^s$: Weight matrix for the state trainstion between state $s_{i-1}$ and $s_i$ with dimension $d_s \times d_s$
$R$: Recurrent function.
$b$: bias vector
$\sigma$: activation function (typically $tanh$ or ReLU)

Definition of $y_i$ and $s_i$¶

$s_i = R(x_i, s_{i-1}, b) = \sigma(W^x x_i + W^s s_{i-1} + b)$
$y_i = s_i$ or it is a softmax function for classification problems

When constructing the compuational graph for RNN, the key part is $s_{t-1} \rightarrow s_t$.

Image(filename='figures/computational_graph_4_rnn.png', width=600)

$J$ is a Jacobian matrix is the vector extension of first-order partial derivatives. (FYI, the second-order one is called Hessian matrix)

So remember from our fist example that when we backprogate from the top node to bottom node, we multiply the partial derivatives that corresponds to each edge. Also, notice that there is only one edge from $t-1$ nodes to $t$ nodes, namely $s_{t-1} \rightarrow s_{t}$.

In [3], there is a proof that the determinant of the Jacobian matrix betwen $s_{t-1}$ and $s_{t}$ is less than 1. In other words, every time we backpropagate through $s_{t-1} \leftarrow s_{t}$, the gradient value which we are currently holding will decrease!

$\therefore$ the standard RNN will not capture the long range dependencies.

If you want to know more about the ICML 2013 paper and cannot wait for Jodan's lecture in mid March, then I suggest you look into Richard Socher's slides [8] as a supplementary material for understanding what is going on.

4. Long Short Term Memoery (LSTM)¶

Overview of LSTM¶

Cleaner diagram from [4] but less intuitive for me¶

Image(filename='figures/LSTM3-chain.png', width=600)

Messier, but more useful figure (for me)¶

Image(filename='figures/lstm_cell.png', width=600)

where

$c_i$: memory cell vector
$h_i$: state vector
$g$: candidate ouput vector
$input\_gate$: input gate vector to do componentwise multiplication with the candidate output vector $g$.
$output\_gate$: output gate vector to do componentwise multiplication with the memory cell vector $c_i$.
$forget\_gate$: forget gate vector to do componentwise multiplication with the previous memory cell vector $c_{i-1}$.

Definitions of each gate¶

$input\_gate = \sigma(W^{x\_input} x_i + h_{i-1}W^{h\_input})$
$output\_gate = \sigma(W^{x\_output} x_i + h_{i-1}W^{h\_output})$
$forget\_gate = \sigma(W^{x\_forget} x_i + h_{i-1}W^{h\_forget})$

where $\sigma$ is a sigmoid function.

Definitions of the candidate output vector $g$¶

$g = tanh((W^{x\_candidate} x_i + h_{i-1}W^{h\_candiate}))$

The blue arrows represent the input filtered by either one of the following weighted matrices:

$W^{x\_input}$ with dimension $d_x \times d_h$
$W^{x\_forget}$ with dimension $d_x \times d_h$
$W^{x\_output}$ with dimension $d_x \times d_h$
$W^{x\_candidate}$ with dimension $d_x \times d_h$
$W^{h\_input}$ with dimension $d_h \times d_h$
$W^{h\_forget}$ with dimension $d_h \times d_h$
$W^{h\_output}$ with dimension $d_h \times d_h$
$W^{h\_candidate}$ with dimension $d_h \times d_h$

So notice that for example, if the input word vector $x_i$ is $300$ dimension, and $h_i$ is $100$ dimension. The dimesion of $W^{x\_input}$ is $300 \times 100 = 30,000$, so this single matrix has $30,000$ parameters to train. Furthermore, there are a total of $4$ similar matrices, so we have to train $30,000 \times 4 = 120,000$ parameters to train for the input word vector $x_i$.

Similarily, $W^{h\_input}$ has $100 \times 100 = 10,000$ parameters to train and there are total of $4$ such matrices, so we have $10,000 \times 4 = 40,000$ parameters to train for the hidden unit $h_i$.

To sum up, we have to trian $120,000 + 40,000 = 160,000$ parameters (plus the bias vector) to train when the input word vector is $300$ dimension and the hidden unit is in $100$ dimension.

This number of parameters is huge compated to the simple RNN such as

$W^x$ with dimension $d_x \times d_s = 300 \times 100 = 30,000$
$W^s$ with dimension $d_s \times d_s = 100 \times 100 = 10,000$
$30,000 + 10,000 = 40,000$ parameters ($4$ times less than the number of LSTM parameters).

Why does this solve the problem of gradient vanishing problem?¶

In case when the forget gate vector = $(1, 1, ..., 1)$, then it will simply copy your cell vector fro mthe previous time. This is different from the derivative edge between $s_t$ and $s_{t-1}$ having $< 1$.

"Most useful when you have lots and lots of data" [9]

5. Gated Recurrent Unit (GRU)¶

So having $4$ times the number of parameters of the simple RNN seems to be a bit expensive.

The simpler version of LSTM and a bit complex than the simple RNN is the GRU (following image from [8]).

Image(filename='figures/gru_socher_slide.png', width=600)

Now, the weight matrices (or parameter matrices) are

$W^{x\_reset}$ with dimension $d_x \times d_h$
$W^{x\_update}$ with dimension $d_x \times d_h$
$W^{x\_proposed\_update}$ with dimension $d_x \times d_h$
$W^{h\_input}$ with dimension $d_h \times d_h$
$W^{h\_forget}$ with dimension $d_h \times d_h$
$W^{h\_proposed\_update}$ with dimension $d_h \times d_h$

Since there is one less matrix than LSTM, the parameters are less:

$W^x$ with dimension $d_x \times d_h = 300 \times 100 = 30,000$
$W^h$ with dimension $d_h \times d_h = 100 \times 100 = 10,000$
Total of $30,000 * 3 + 10,000 * 3 = 120,000$ parameters