Computing devices can be configured to process a user's spoken commands, requests, and other utterances into written transcriptions. Models representing data relationships and patterns, such as functions, algorithms, systems, and the like, may accept audio data input (sometimes referred to as an input vector), and produce output (sometimes referred to as an output vector) that corresponds to the input in some way. In some implementations, a model is used to generate a probability or set of probabilities that the input corresponds to a particular language unit (e.g., phoneme, phoneme portion, triphone, word, n-gram, part of speech, etc.). For example, an automatic speech recognition (“ASR”) system may utilize various models to recognize speech, such as an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance.
Models are typically based on a lexicon. A lexicon generally refers to a compendium of words and associated pronunciations. Words in the lexicon may be manually annotated with the pronunciation information by, for example, a professional linguist. As this process can be resource intensive (e.g., time, labor, expense), some words may be automatically annotated using pronunciation prediction. In some implementations, the prediction may be based on a grapheme-to-phoneme (G2P) model. Given the volume of new words that may be included in a given lexicon and the accuracy of the G2P model, a need exists to efficiently and accurately identify words which, through manual annotation, can improve the overall system performance.