Conventional text-to-speech synthesizers can be used to convert text into corresponding audio. For example, a text-to-speech synthesizer can receive a set of text to be converted into corresponding audio. Depending on a respective configuration, the text-to-speech synthesizer can implement any number of different conventional algorithms to convert the received set of text into corresponding equivalent audio.
One conventional algorithm to convert text into audio output symbol representation is a so-called lexicon lookup. The lexicon lookup can include a complete listing of words and/or morphemes (e.g., subparts of words) for a particular language. Each of the words and/or morphemes in the lexicon lookup maps to a corresponding audio output symbol representation equivalent. Via a conventional lexicon lookup for each word in a received set of text, a text-to-speech synthesizer produces a proper audio output symbol representation output.
Typically, a conventional text-to-speech synthesizer is able to perform a lexicon lookup for most words in a received set of text. However, certain words that are not found during the lexicon lookup are called out-of-vocabulary words. Out-of-vocabulary words represent words in which the text-to-speech synthesizer is less certain how to generate a proper audio output symbol representation equivalent.
Another conventional algorithm that can be used by a respective text-to-speech synthesizer to convert text is a so-called grapheme-to-phoneme or G2P algorithm. G2P refers to grapheme-to-phoneme conversion. In general, this is the process of using grapheme-to-phoneme rules to generate a pronunciation for received text. Grapheme-to-phoneme rules can be created by automated statistical analysis of a pronunciation dictionary.
Conventional grapheme-to-phoneme algorithms can be used to generate a most probable sound for words (e.g., so-called out-of-vocabulary words) that are not found by a lexicon lookup algorithm. As mentioned above, lexicon lookup and corresponding generation of audio output symbol representation for a word is preferred because it is typically quite accurate. Generation of an audio output symbol representation for an out-of-vocabulary word using a grapheme-to-phoneme algorithm is typically much less accurate and may be incorrect as use of the grapheme-to-phoneme algorithm is merely based on best efforts. In other words, the grapheme-to-phoneme algorithm does its best to produce a proper pronunciation of a given word, although the resulting output may be inaccurate.
Text-to-speech synthesis can also include so-called text normalization. Conventional text normalization includes transforming text into a single canonical form. Normalizing text before storing or processing it allows for separation of concerns, since the input is guaranteed to be consistent before operations are performed on it. Typically, text normalization in text-to-speech applications requires being aware of what type of text is to be normalized and how the text is to be expanded upon text-to-speech conversion. As a more specific example, the word “vi” may have different meanings in different contexts. Text normalization involves tuning a text-to-speech synthesizer to produce a different audio out for this non-standard expression depending on a context in which it is used. A text-to-speech synthesizer may pronounce the textual word “vi” as “vie”, “vee”, or “sixth” depending on a textual context in which the expression is used.