Traditionally, neural machine translation (NMT) models (or any natural language generation models) use one-hot representations for each word in the output vocabulary V. More formally, each word w is represented as a unique vector o(w)∈62/{0,1}V where V is the size of the output vocabulary and only one entry id(w) (which corresponds to the word ID of w in the vocabulary) in o(w) is 1 and the rest are set to 0. The models produce a distribution pt over output vocabulary at every step t using the softmax function. In these conventional NMT models, every decoding step involves a large matrix multiplication followed by a softmax layer, which is both computationally expensive and parameter heavy.