Self-attentive feed-forward sequence models such as the Transformer have been shown to achieve impressive results on sequence modeling tasks including machine translation, image generation and constituency parsing, presenting a compelling alternative to recurrent neural networks, the de facto standard architecture for many sequence modeling problems. Despite these successes, however, the Transformer fails to generalize in some tasks recurrent models handle with ease. This includes copying strings or simple logical inference when the string or formula lengths exceed those observed at training time.
The Transformer model is described in Vaswani et al., Attention Is All You Need, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, Calif., USA, available at https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf. This paper is incorporated here by reference.