Text-to-speech (TTS) systems convert raw text into sound using a process sometimes known as speech synthesis. In a typical implementation, a TTS system first preprocesses raw text input by disambiguating homographs, expanding abbreviations and symbols (e.g., numerals) into words, and the like. The preprocessed text input can be converted into a sequence of words or subword units, such as phonemes or diphones. The resulting sequence of words or subword units is then associated with acoustic features of speech segments, which may be small recorded or synthesized speech files. The phoneme sequence and corresponding acoustic features are used to select and concatenate speech segments into an audio presentation of the input text.
Different voices for a TTS system may be implemented as sets of speech segments and data regarding the association of the speech segments with a sequence of words or subword units. Speech segments can be created by recording a human while the human is reading a script. The recording can then be separated into segments sized to encompass all or part of words or subword units.
TTS systems may be deployed onto a variety of devices, ranging from servers and desktop computers to electronic book readers and mobile phones. In a typical deployment, a TTS engine and voice data for one or more voices may be distributed to the device via a disk or via network download. In some cases, the TTS engine and voice data may be preinstalled on the device.