Convolutional networks can easily handle dependencies over a window, but cannot handle dependencies that go arbitrarily far in time because they have no mechanism to store information. This is a deficiency that prevents them from effectively tackling many applications including processing text, detecting combination of events, learning finite state machines, and so on.
Recurrent neural networks (RNNs) brought the promise of lifting this limitation by allowing the system to store states using recurrent loops. However, RNNs suffer from a basic limitation pertaining to training them using gradient descent. To store states robustly in a recurrent loop, the state must be stable to small state deviations, or noise. However, if a RNN robustly stores the state, then training it with gradient descent will result in gradients vanishing in time and so training is difficult.
In the past, two ways of circumventing this training issue were developed. One way is to build an architecture that makes it easy to keep the eigenvalues very close to 1 (i.e., using gating functions computed by a sigmoid which are almost 1 when saturated). Another way is to cheat on gradient descent using common known tricks such as gradient capping, truncated gradient, gradient normalization through regularization, and so on. Long Short Term Memory (LSTM) and Gate Recurrent Unit (GRU) systems are examples of previous schemes that took advantage of both of these circumventing methods in an attempt to overcome the training issue.