Acoustic models have been used to transcribe speech data (e.g., digital voice recordings), such as generating textual transcripts of voicemail messages. Acoustic models can map linguistic features, such as phonemes (smallest unit of sound used for identifying meaningful contrasts between utterances in a spoken language), to utterances in speech data. To generate (or train) an acoustic model for transcribing audio data in a particular language, training data in the particular language can be used. Training data can include speech data (e.g., speech samples) and textual transcripts that map particular portions of the speech data to text (e.g., words, portions of words). Speech data collection prompts, such as scripts and/or scenarios, have been manually generated and provided to users to read aloud to generate training data.