Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many sequential data and among others used by Apples Siri and Googles Voice Search. Their great advantage is that the algorithm remembers its input, due to an internal memory. But despite their recent popularity there exists a limited number of resources. The goal of this article is to throughly explain how RNNs work, their biggest issues, how to solve them and introduce GRUs describing in detail their basic components.

Table of Contents:

  • Introduction
  • What a RNN is?
  • What their issues are?
  • What a Gated Recurrent Unit (GRU) is?
  • Basic components of a GRU
  • Conclusion
  • References
Those moments in life when you understand that in front of your eyes is happening a rare show of beauty. 
Sunset in Oia Santorini was one of those moments. Magical Greece
Photo by Tom Grimbert (@tomgrimbert) / Unsplash


Recurrent Neural Networks (RNN) is a class of artificial neural network and belong to the most promising algorithms out there at the moment due to their internal memory capability.

Recurrent neural networks were based on David Rumelhart’s work in 1986, but only showed their real potential since a few years, because of the increase in available computational power, the massive amounts of data that we have nowadays and the invention of LSTM in the 1990’s.

Their internal memory capability allow them to exhibit temporal dynamic behaviour for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them ideal for application such as speech recognition, time-series forecasting, image/music/dance generation and conversation modeling/question answering.

Therefore are suitable whenever there is a sequence of data and that temporal dynamics that connects the data is more important than the spatial content of each individual frame as said by Lex Fridman (MIT).

What a RNN is?

In a feed-forward network you have: an input, a fully-connected layer with the input, second layer fully-connected with the first layer.. certain number of layers and then an output comes out of the last layer.

RNs are similar but different; each layer represented here by this block with four nodes actually have two sources of inputs: the input for that time period (X1,X2,X3..) but it also has this context or state vector called (A0,A1,A2…). The output is produced by joining vector A together the inputs X. At the same time an updated vector A is produced which being fed to the next layer. So in a way the same model will run again on the output that was produced previously by itself.

In a nutshell it’s like a feed-forward NN with many layers but each layer corresponds to a different time period. It’s the same model at each layer but new data are fed in as well as a summary of everything that gone before. Thus, the next state depends on the internal state and the new data.

As there is nothing before where the first a0 comes from?

It usually set up to zero. The vector A can also be described as the hidden state of a dynamic system describing all its state variables. For example in the case of time series forecasting could describe the season, day of week, year etc.

What their issues are?

In the case of long time series for example 200 or more days the RNN will look like a 200 layer deep neural network. This creates some disadvantages such as:

  • Hard to train
  • Suffer from vanishing gradients, exploding gradients
  • Not easy to work with
  • Hard to remember values from long way in the past

Let me elaborate on the last disadvantage using an example of a time series. If for example X1 on the figure above is a certain piece of useful information that will be reflected in the state A1 but as this state being passed to the next sequences the effect of that X1 on the output in hundred time periods later is going to be negligible. Thus, it is hard to make it propagate a long way and effects like yearly, quarterly effect might be diluted by the model.

What a Gated Recurrent Unit (GRU) is?

Introduced by Cho, et al. in 2014, GRU (Gated Recurrent Unit) aiming to solve the vanishing gradient problem which comes with a standard recurrent neural network. GRU can also be considered as a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results.

A GRU instead of having a simple neural network with four nodes as the RNN had previously has a cell containing multiple operations (green box in the figure). Now the model that is being repeated every sequence is the green box containing three models (yellow boxes) where each one of those could be a neural network.

GRU uses the so called, update gate and reset gate. The Sigma notation above represent those gates: which allows a GRU to carry forward information over many time periods in order to influence a future time period. In other words, the value is stored in memory for a certain amount of time and at a critical point pulling that value out and using it with the current state to update at a future date.

Basic components of a GRU

  1. Update gate

This gate calculate the update gate Zt at time step t executing the following steps:

  • The input Xt is multiplied by a weight Wzx
  • The previous output Ht-1 which hold information from previous units multiplied by a weight Wzh
  • Both are added together and a sigmoid function is applies to squeeze the output between 0 and 1.

Given this gate the issue of the vanishing gradient is eliminated since the model on its own learn how much of the past information to pass to the future.

2. Reset gate

This gate calculate the update gate Rt at time step t executing the following steps:

  • The input Xt is multiplied by a weight Wrx
  • The previous output Ht-1 which hold information from previous units multiplied by a weight Wrh
  • Both are added together and a sigmoid function is applies to squeeze the output between 0 and 1.

The formula is very similar with the above gate and differs only in the weights and the gate’s usage.

This gate has the opposite functionality in comparison with the update gate since it is used by the model to decide how much of the past information to forget.

3. Current memory content

The calculation involve the following steps:

  • The input Xt is multiplied by a weight Wx
  • Apply element-wise multiplication to the reset gate Rt and the previous output Ht-1; this allows to pass only the relevant past information.
  • Both are added together and a tanh function is applied.

4. Final memory at current time step

Last but not least the unit has to calculate the Ht vector which holds information for the current unit and it will pass it further down to the network. Key role in this process plays the update gate Zt.

The calculation involve the following steps:

  • Apply element-wise multiplication to the update gate Zt and Ht.
  • Apply element-wise multiplication to one minus the update gate 1-Zt and H’t.
  • Both are added together.

If the vector Zt is close to 0 a big portion of the current content will be ignored since it is irrelevant for our prediction. At the same time since Zt will be close to 0 at this time step, 1-Zt will be close to 1 allowing the majority of the past information to be kept. For example, we may predicted the Saturday sales of a shop and probably we weight more the sales of last Saturday in comparison with yesterday sales(Friday’s).


GRUs using the internal memory capability are valuable to store and filter information using their update and reset gates. That said the issues faced by RNNs (vanishing gradient problem) are eliminated offering us a powerful tool to handle sequence data.

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.


Originally published at : on Feb 5, 2019.