Speech can be automatically recognized through automatic speech recognition (“ASR”) systems. Words and other non-linguistic sounds can form utterances. Utterances can refer to separate pieces of audio between pauses. As used herein, utterances can refer to full sentences separated by pauses or segments of sentences separated by pauses. In some models used in ASR, a part of a word sound, or a phone, can be used as the modeling unit to match the sound to a word. Recognition may increase by considering the context (neighbors) of the phone together with the phone, thus forming a diphone or triphone. Triphones are usually modeled using a three-state Hidden Markov Model (HMM). Similar states can be grouped together and shared across different triphones. These shared states, often called senones, are the main modeling units in ASR.