Recurrent Neural Networks (RNN)

Vanilla Recurrent Neural Networks (RNN)

We start with a two layer feed forward network, where input x is a n-D vector representing the sequence elements, output y is C classes, and hidden layer s has h hidden nodes.

Then we have s=g1(wxsx+bs),y=g2(wssys+by)=g2(wssy(g1(wxsx+bs))+by)s=g_1(w_{xs}x+b_s), y=g_2(w_{s_sy}s+b_y)=g_2(w_{s_sy}(g_1(w_{xs}x+b_s))+b_y) , where g1,g2g_1, g_2 are activation functions. They are usually tanh and softmax functions respectively.

image-20220708160419736

Now we add in a recurrent loop, this network is called Elman Network. But we noticed x, y, and s are indexed by sequence position i. We brushed under the rug that the state is delayed, and there must be memory to store previous state.

Then we have si=g1(wXSxi+wsssi1+bs),yi=g2(wssysi+by)s_i=g_1(w_{XS}x_i+w_{ss}s_{i-1}+b_s), y_i=g_2(w_{s_sy}s_{i}+b_y), note that s0s_0 state must be initialized, usually with a vector of zeros.

image-20220708160910950

Training and inference

How do we train RNN? We use backdrop with a trick, we initialize state s0=0s_0=0, with the first input x1x_1 we have s1=g1(wXSx1+wsss0+bs),y1=g2(wssys1+by)s_1=g_1(w_{XS}x_1+w_{ss}s_0+b_s), y_1=g_2(w_{s_sy}s_1+b_y) ; similarly with the second input x2x_2 we have s2=g1(wXSx2+wsss1+bs),y2=g2(wssys2+by)s_2=g_1(w_{XS}x_2+w_{ss}s_1+b_s), y_2=g_2(w_{s_sy}s_2+b_y) ; and with i th input x2x_2 we have si=g1(wXSxi+wsssi1+bs),yi=g2(wssysi+by)s_i=g_1(w_{XS}x_i+w_{ss}s_{i-1}+b_s), y_i=g_2(w_{s_sy}s_i+b_y) . We unroll network over time for sequence of length T, remember all weights wss,wxs,wsyw_{ss}, w_{xs}, w_{sy} are shared.

image-20220708162501641

Considering the task of sequence classification, the training pair is

  • Input: sequence x=x1,x2,...,xTxx=x_1, x_2,...,x_{T_x}
  • Training label: y is a class represented as a 1-hot vector

We use cross entropy as the loss function. This looks like a feedforward network. We train it using Stochastic Gradient Descent with Adam/RMSprop. The training procedure is called Back Propogation Through Time (BPTT), the unrolled network is deep with number of layers equals Tx+1T_x+1. And if loss function is defined over full output sequence, it’s still trainable by backdrop (BPTT).

This unrolling looks complicated to implement, yet it’s already implemented as part of training in the deep learning frameworks (Tensorflow, PyTorch, etc.)

Statistical Language models

A statistical language model is a probability distribution over sequence of words computed from some corpus. Consider a one word sequence, P(x1)P(x_1) is just the probability of each word, which can be represented as a histogram. Now consider a two word sequence, P(x1,x2)P(x_1, x_2) is the probability of ordered word pairs.

  • Is P(x1,x2)=P(x2,x1)P(x_1, x_2) = P(x_2, x_1)?

    No, It’s not symmetric as P(Two,Sigma)!=P(Sigma,Two)P('Two', 'Sigma') != P('Sigma', 'Two')

  • Is P(x1,x2)=P(x1)P(x2)P(x_1, x_2) = P(x_1)P(x_2)?

    No, the two words are not independent either

Keep going bigger, P(x1,x2,x3,x4,x5)P(x_1, x_2, x_3, x_4, x_5) is the probability of a sequence of five words being used in the language. Examples are P(‘the’, ‘cat’, ‘in’, ‘the’, ‘hat’) is the probability of the “The cat in the hat” appearing in a corpus of English text. P(‘a’, ‘q’, ‘t’, ‘i’, ‘u’) is the probability of the letter sequence “aqtiu” appearing in a corpus of English text. To fully represented English language, we would need a histogram with a exponential number of cells, say 10,000510,000^5 for 5 word sequence.

Let’s visit an example of using Character RNN implementing a language model. We assume x is an n-dimensional one-hot encoding of characters and includes punctuation, space (end of word). y^\hat y is also n-dimensional, but values are not limited to 0/1. In the RNN, final activation function is softmax, each output comprising y^\hat y can be viewed as the conditional probability of the next character. When a sequence x1,x2,...,xnx_1, x_2,..., x_n has been input to a trained network, the output y^n\hat y_n is the probability distribution P(xn+1x1,x2,...,xn)P(x_{n+1}|x_1, x_2,..., x_n)

image-20220712155021194

The network is trained using cross entropy loss to compare the next character from the training set xi+1x_{i+1} to the output y^i\hat y_i from the network. The training process takes sequences of characters, and put them in one at a time, comparing the predicted character to the next one.

How do we sample the output y^i\hat y_i? Recall y^i\hat y_iis a vector of dimension equal to dictionary size, we will follow the steps:

  1. Take largest value (arg max)
  2. Randomly sample based on distribution.
  3. Blend using “temperature” of softmax function.
image-20220712155706636

And how complicated is to code this up?

image-20220712155752298

Teacher Forcing

The question raised from the chart above, what should x2x_2 be? We have several choices:

  1. Raw output y^1\hat y_1 of RNN
  2. A sampling of y^1\hat y_1 e.g. 1-Hot with one at arg max y^1\hat y_1 position.
  3. Teacher forcing: Use the actual y1y_1 rather than a sampling of y^1\hat y_1. In this example, x2x_2 is ‘e’.
image-20220712161935452

The advantages of Teacher Forcing is that it makes training more robust since errors don't accumulate, and the training converges faster. The disadvantage is that Inference is different than training which leads to exposure bias. Some alternatives to Teacher Forcing include Curriculum Learning and Professor Forcing.

Vanishing and exploding gradients

The problems with Vanilla RNN are mainly vanishing and exploding gradients:

  • Exploding gradients

    Solution: Gradient clipping

  • Vanishing gradients

    Solution: Regularization

  • Information doesn't propagate far back in time and easier elements of input sequence don't affect output.

    Solution: Tastier network such as LSTM and GRU, these networks are also less prone to vanishing/exploding gradient problems.