Acoustic models have been used to transcribe audio data (e.g., digital voice recordings), such as generating text transcripts of voicemail messages. Acoustic models can map portions of speech, such as phonemes (smallest unit of sound used for identifying meaningful contrasts between utterances in a spoken language), to audio data within a particular audio feature space. A feature space includes ranges of audio attributes (e.g., range of pitch) that bound audio data within the feature space. Audio data can be transformed into different feature spaces to make the same portions of speech (e.g., the same phonemes) uttered by different speakers appear more similar. For example, various transformations (e.g., Linear Discriminant Analysis (LDA), Vocal Tract Length Normalization (VTLN), Constrained Maximum Likelihood Linear Regression (CMLLR)) can be applied to audio data so that phonemes (e.g., the /a/ phoneme) uttered by a first speaker with a high pitched voice appear similar to the same phonemes as uttered by a second person with a low pitched voice.
Different acoustic models can be used for different languages (e.g., English and French) and/or for different dialects in the same language (e.g., U.S. English and U.K. English).