GRU, LSTM, and RNN application architectures

Gated RNNs

GRU

Gates are used in GRU and LSTM, these are not digital gates here. They generate gate signals Γ\Gamma ranging from 0 to 1. Usually Γ\Gamma is produced by sigmoid function σ(x)\sigma(x) whose output ranges from 0 to 1. It is used to mix two signals a and b as convex combination: Γa+(1Γ)b\Gamma a+(1-\Gamma)b or scale a signal (considering b to be 0) Γa\Gamma a.

In order to introduce GRU, we first redraw the Jordan RNN where s~i=tanh(wc[xi,si1]),yi=si=g2(wsys~i)\tilde s_i=tanh(w_c[x_i, s_{i-1}]), y_i=s_i=g_2(w_{sy}\tilde s_i) like below:

image-20220713113017946

If we add an update gate Γu\Gamma_u in parallel to the hidden state, then we have a simple GRU where Where Γu=σ(wu[si1,xi]+bu),s~i=tanh(wc[xi,si1]+bc),yi=si=Γus~i+(1Γu)si1\Gamma_u=\sigma(w_u[s_{i-1},x_i]+b_u), \tilde s_i=tanh(w_c[x_i, s_{i-1}]+b_c), y_i=s_i=\Gamma_u\tilde s_i+(1-\Gamma_u)s_{i-1}

image-20220713115405550

Further adding a reset gate Γr\Gamma_r and we formulate the full GRU, where Γu=σ(wu[si1,xi]+bu),Γr=σ(wr[si1,xi]+br)\Gamma_u=\sigma(w_u[s_{i-1},x_i]+b_u), \Gamma_r=\sigma(w_r[s_{i-1},x_i]+b_r) s~i=tanh(wc[xi,Γrsi1]+bc),yi=si=Γus~i+(1Γu)si1\tilde s_i=tanh(w_c[x_i, \Gamma_r s_{i-1}]+b_c), y_i=s_i=\Gamma_u\tilde s_i+(1-\Gamma_u)s_{i-1}

image-20220713115431079
LSTM networks

LSTM stands for long short term memory, and the network has feedback for both state and output. LSTM differs from GRU where there are three gates for LSTM and two for GRU; More subtly, LSTM has 2 “state channels”, 4 layers with four weight matrices: Wf,Wc,Wi,WoW_f, W_c, W_i, W_o, and 3 gates.

image-20220726162230634

Now let’s look at the different components that make up an LSTM. The two state channels hth_t which the output hth_t is also fed back to the LSTM unit, CtC_t which represents additional state information. The cell state CtC_t is carried forward from one time instance to the next, CtC_t is modified along the way. Note that state isa vector not a scalar.

image-20220726162826945

An LSTM “gate” makes an element wise decision on what to allow through. It is a fully connected layer followed by a sigmoid with outputs between 0 and 1. It’s like one of the gates in a GRU.image-20220726163007846

The forget fate decides what elements of the previous cell sate should be forgotten. When an element of ftf_t is zero, the output of date is zero (forgotten).

image-20220726163419108

The input gate iti_t determines what new information from C~t\tilde C_t should be added to create the new cell state, where C~t\tilde C_t is the output of full connected layer gives proposed next state change. This new information is linearly transformed and then squashed by the tanh function which forces the output to lie between -1 and 1.

image-20220726163630758

Outputs from the forget gate and the input gate combine to update the current cell state. Note that GRU took convex combination of C and C~\tilde C.

image-20220726164026997

The output gate decides what information from the cell state should be output by the LSTM.

image-20220726164119863

In general, LSTM has 3 gates, more parameters and have been more proven, while GRU has 2 gates, fewer parameters and is better trained on small datasets. Both are implemented in major frameworks and easy to use.

image-20220726170931311

Bidirectional RNN, Multilayer RNN

Bidirectional RNN are formed by two independent RNNs (can be any type), in which the forward direction is as usual, the reverse direction starts at end of the sequence and goes backward to start. Once forward and reverse RNNs are complete, outputs for each time step are concatenated, and may be passed through fully connected layer.

image-20220816171858007

Multilayer RNN can be any RNN type like LSTM or GRU, it can be directional, it can have any number of layers (usually 2 or 3 layers). Sometimes it is called “Deep RNN”, its backdrop depth is really governed by unrolled sequence length, although number of parameters by layers.

image-20220816172158770

RNN Application Architectures

image-20220817095632614
  • One to Many

    The one to many architecture applies to sequence generation like audio or speech generation, or image caption generation. Its loss function is defined over all outputs.

  • Many to One

    The Many to one architecture applies to sequence classification like sentiment analysis, video classification or stock market “Buy/Sell” decision. Its loss function is defined over single output and is usually cross entropy.

  • Many to Many (same length)

    It is usually used for classifying elements of a sequence such as Named Entity classification, video segmentation/activity labeling, statistical language model (char RNN), and some approaches to speech recognition. Its loss function is defined over all outputs and is usually sum of losses for each output.

  • Many to Many (different length)

    This architecture is often called Encoder-Decoder architecture with two RNNs. The input sequence is encoded into a vector, the output sequence is generated from such vector. It is widely used in language translation, video captioning, and Question-Answering.

    image-20220817103524752

In sequence generation or sequence-to-sequence problems, we want to find Y^=argmaxyiDict P([y1,y2,...,yT]x)\hat Y = argmax_{y_i \in Dict}\ P([y_1, y_2,...,y_T]|x) where x is the input (an image, a sequence etc.), and [y1,y2,...,yT][y_1, y_2,...,y_T] are over possible sequences where elements are from the Dictionary. In sequence to sequence architectures x is often the state s output by the encoder. Finding the maximum P([y1,y2,...,yT]x)P([y_1, y_2,...,y_T]|x) requires enumerating all sequences, which is exponential in sequence length.

We proposed earlier an inference method: y^1=argmax P(y1x)\hat y_1 = argmax\ P(y_1|x) ,y^2=argmax P(y2y^1,x)P(y^1x)\hat y_2 = argmax\ P(y_2|\hat y_1,x)P(\hat y_1|x) , …, y^i=argmax P(yiy^1,...,y^i1,x)P(y^i1y^1,...,y^i2,x)...\hat y_i = argmax\ P(y_i|\hat y_1,...,\hat y_{i-1}, x)P(\hat y_{i-1}|\hat y_1,...,\hat y_{i-2},x)...

This is NOT optimal, it’s a greedy “best first” search.

Inference can be viewed as searching a tree where the number of children of each node is the size of the Dictionary. Inference is a path through such tree. Exhaustive search is prohibitive due to the exponential complexity. Beam Search with beam size β\beta, goes layer by layer and explores $\beta $ paths through the tree with highest probability. Search ends when <EOS> is selected or max length is reached. This term is introduced for speech recognition by Raj Reddy in 1977.

image-20220818164805218

Let’s take Beam Search with β=2\beta =2 as an example. We evaluate p(dix)p(d_i|x) for all did_i, retain two did_i with largest p(dix)p(d_i|x) . Then we evaluate p(dix,y^1)p(y^1x)p(d_i|x, \hat y_1)p(\hat y_1|x) for each y^1\hat y_1 selected above and for all did_i, retain 2 paths [y^1,y^2][\hat y_1, \hat y_2] with largest value. Similarly, we evaluate p(dix,y^1,y^2)p(y^2y^1,x)p(y^1x)p(d_i|x, \hat y_1, \hat y_2)p(\hat y_2|\hat y_1, x)p(\hat y_1|x) for each y^1,y^2\hat y_1, \hat y_2 selected above and for all did_i, retain 2 paths [y^1,y^2,y^3][\hat y_1, \hat y_2, \hat y_3] with largest value. We continue to deeper levels until <EOS> or max length etc.

image-20220818171254405