Text-to-speech (TTS) systems convert raw text into sound using a process sometimes known as speech synthesis. In a common implementation, a TTS system may comprise a computing device configured to receive text input and provide an audio presentation of the text input. Some TTS systems provide a number different language modules and voice modules. Language modules enable a TTS system to receive and process text in a written language, such as American English, German, or Italian. Voice modules enable a TTS system to output an audio presentation in a specific voice, such as French female, Spanish male, or Portuguese child.
TTS systems first preprocess raw text input by disambiguating homographs, expanding abbreviations and symbols (e.g., numerals) into words, and other such operations. The preprocessed text input can be converted into a sequence of words or subword units, such as phonemes or diphones. The resulting sequence is then associated with acoustic and/or linguistic features of a number small speech recordings, also known as speech segments. The phoneme sequence and corresponding acoustic and/or linguistic features are used to select and concatenate recorded and synthetic speech segments into an audio presentation of the input text.
TTS systems may be configured to generate audio presentations from message text, such as electronic mail (email) and text messages, and play back the audio presentations to a user. Some applications that include TTS functionality facilitate entry of network addresses of content, such as uniform resource locators (URLs). Such applications may be configured to retrieve text content from the location corresponding to the entered URL, generate an audio presentation of the content, and transmit or playback the audio presentation to a user.