Text-to-speech (TTS) systems convert raw text into sound using a process sometimes known as speech synthesis. In a typical implementation, a TTS system first preprocesses raw text input by disambiguating homographs, expanding abbreviations and symbols (e.g., numerals) into words, and the like. The preprocessed text input can be converted into a sequence of words or subword units, such as phonemes. The resulting phoneme sequence is then associated with acoustic features of a number small speech recordings, sometimes known as speech units. The phoneme sequence and corresponding acoustic features are used to select and concatenate speech units into an audio representation of the input text.
Different voices may be implemented as sets of speech units and data regarding the association of the speech units with a sequence of words or subword units. Speech units can be created by recording a human while the human is reading a script. The recording can then be segmented into speech units, which can be portions of the recording sized to encompass all or part of words or subword units. In some cases, each speech unit is a diphone encompassing parts of two consecutive phonemes. Different languages may be implemented as sets of linguistic and acoustic rules regarding the association of the language phonemes and their phonetic features to raw text input. During speech synthesis, a TTS system utilizes linguistic rules and other data to select and arrange the speech units in a sequence that, when heard, approximates a human reading of the input text. The linguistic rules as well as their application to actual text input are typically determined and tested by linguists and other knowledgeable people during development of a language or voice used by the TTS system.