Text-to-speech (TTS) systems convert raw text into sound using a process sometimes known as speech synthesis. In a common implementation, a TTS system may be installed on a client device, such as a desktop computer, electronic book reader, or mobile phone. Software applications on the client device, such as a web browser, may employ the TTS system to generate an audio file or stream of synthesized speech from a text input.
In a typical implementation, a TTS system first preprocesses raw text input by disambiguating homographs, expanding abbreviations and symbols (e.g., numerals) into words, and the like. The preprocessed text input can be converted into a sequence of words or subword units, such as phonemes. The resulting phoneme sequence is then associated with acoustic features of a number small speech recordings, sometimes known as speech units. The phoneme sequence and corresponding acoustic features are used to select and concatenate speech units into an audio presentation of the input text.
Different voices (e.g., male American English, female French, etc.) may be implemented as sets of recorded speech units and data regarding the association of the speech units with a sequence of words or subword units. The amount of storage space required to store the data required to implement the voice (e.g., the recorded speech units) may be substantial, particularly in comparison with the limited storage capabilities of some client devices, such as electronic book readers and mobile phones.