Drawing inspiration from biology and neurophysiology, recent progress for many natural language processing (NLP) tasks has been remarkable. Deep neural network-based approaches have made progress in areas such as: machine translation, sentence summarization, dialogue agents, speech recognition, and conversational bots. Such approaches often employ a neural language model (NLM) as a decoder at inference time to generate a sequence of tokens, e.g., words, given an input, typically via beam search.
One long-recognized issue of decoding using NLMs is the computational complexity, which easily becomes a bottleneck when the vocabulary size is large. Consider a beam search decoder using a NLM. At each decoding step, a recurrent neural network first generates a context vector based on each partial hypothesis in the beam. The beam search decoder may then use a softmax layer to compute a normalized word probability distribution over the vocabulary. The softmax layer may include an inner product operator that projects the context vector into a vocabulary-sized vector of scores, followed by a softmax function that transforms a vocabulary-sized logits into a vector of probabilities. Finally, the beam search decoder selects the top-K words with the highest probabilities given the context, e.g., the top-K maximum subset of inner product, and stores the expended hypotheses and their probabilities in the beam. The most computationally expensive part in this process is the softmax layer, where the complexity is linear with respect to the vocabulary size.
Many techniques have been proposed to speed up the softmax layer in training, such as hierarchical softmax and sampling-based approaches. These approaches, however, cannot be directly applied to decoding because the approaches rely on knowing the words to be predicted and need to calculate the probability of all words to find the most likely prediction during decoding. Other works speed up softmax inference in training and decoding by reducing the cost of computing each word's probability using some approximation. However, the complexity of softmax is still linear with respect to the size of the vocabulary.