Decoding and generating output from an autoregressive model is, by nature, sequential because the model has to be supplied with its own, previous predictions. This makes large autoregressive models potentially difficult to apply in production environments, and particularly in low-latency environments.
Three currently related approaches to overcoming this difficulty may be mentioned. Each of them share the problem that, while they are faster, they also deteriorate in quality significantly.
The first approach is predicting fertilities and noisy parallel decoding. This approach is described in Gu et al., Non-Autoregressive Neural Machine Translation, published as a conference paper at the Sixth International Conference on Learning Representations 2018, available at https://arxiv.org/pdf/1711.02281.pdf.
The second approach is iterative refinement of independent predictions. This approach is described in Lee et al., Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement, Apr. 17, 2018, available at https://arxiv.org/pdf/1802.06901.pdf.
The third approach is predicting a sequence of discrete latents sequentially, and then predicting the final sequence in parallel. This approach is described in Kaiser et al., Fast Decoding in Sequence Models Using Discrete Latent Variables, Apr. 29, 2018, available at https://arxiv.org/pdf/1803.03382.pdf.
While several common architecture classes including recurrent, convolutional, and self-attention networks make different trade-offs between the amount of computation needed per layer and the length of the critical path at training time, inference for novel inputs still remains an inherently sequential process.
Neural autoregressive sequence-to-sequence models have become the de facto standard for a wide variety of tasks including machine translation, summarization, and speech synthesis. Different novel network architectures now allow for increased parallelization during training. A much better fit for today's massively parallel hardware accelerators, these architectures require significantly less time to train. Performance at generation time, however, still poses a significant challenge when deploying such models for many practical applications.
As a result, a growing body of work is concerned with different approaches to accelerating generation from autoregressive models. These include probability density distillation, subscaling, and decomposing the problem into the autoregressive generation of a short sequence of discrete latent variables followed by a parallel generation step conditioned on the discrete latents. Some of these techniques are at least somewhat application specific, such as the non-autoregressive Transformer for machine translation. While some techniques achieved speed-ups of multiple orders of magnitude for speech synthesis, to the best of our knowledge, the largest published wall-clock time improvement for non-batched decoding in machine translation was approximately 4×, at a significant loss in quality.