Text to speech systems (TTS) create computer-generated or synthesized speech directly from text input. Concatenative text to speech systems rely on linguistic building blocks called phonemes or phonetic elements and arrange sequences of recorded phonemes (also called speech units at times in the following description) in order to create a voiced representation of a given text. The word ‘school’, for example, contains four phonemes that are referred to as S, K, OO and L. Languages differ in the number of phonemes they contain. English makes use of about forty distinct phonemes, whereas Japanese has about twenty-five and German forty-four. Just as typesetters once sequenced letters of metal type in trays to create printed words, current text to speech systems sequence recorded speech units to create spoken words.
A concatenative text to speech system is described in Scientific American, June 2005, pages 64 to 69. The article describes a TTS system including a database that contains an average of 10,000 recorded samples, the speech units, of each of the approximately 40 phonemes in the English language. This database was created by recording more than 10,000 sentences voiced by dozens of candidate speakers. The sentences were picked in part for their relevance to real world applications and in part for their diverse phoneme content, which ensured that many examples of all English phonemes were captured in different contexts. When words are combined into sentences, the relative loudness and pitch of each sound changes, based on the speaker's mood, what he or she wants to emphasize, and the type of sentence, e.g. a question or an exclamation. Hence the phoneme samples derived from the sentences can vary significantly, which is reflected in the database.
In order to convert a text into synthesized speech, the above-mentioned TTS system translates the text into the corresponding series of words, whereby ambiguities such as multiple ways of integrating abbreviations, e.g., ‘St.’ can be an abbreviation for ‘Saint’ and for ‘Street’, are resolved. With the sequence of words established, the TTS system determines how the words are to be voiced. For some words, pronunciation depends on the part of speech. For instance, the word ‘permits’ is spoken with emphasis on the first syllable when it is used as a noun and on the second syllable when it is used as a verb. Synthesizers are able to handle all the ideal syncratic pronunciations of English, such as silent letters, proper names and words like ‘permits’ that can be pronounced in multiple ways.
In order to determine the part of speech of each word, the above-described TTS system uses a grammar parser. For example, the sentence ‘permits cost $80/yr.’ is parsed to: permits (noun) cost (verb) 80 (adjective) dollars (noun) per (preposition) year (noun). This sequence of words is then converted into phonemes to be used in proper selection of the corresponding speech units. The phonemes are referred to in the following as target phonetic elements.
Determining which recorded speech unit to select from the approximately 10,000 speech units stored for a target phonetic element in order to synthesize the corresponding part of the text is challenging. Each sound in a sentence varies slightly, depending on the sounds that precede and follow it, a phenomenon called co-articulation. The ‘permits’ example contains six individual phonemes. Because each of these six phonemes has about 10,000 original samples to choose from, about 10,0006 possible combinations would be possible. The enormous number of possible combinations makes it impossible to take all combinations into account and to determine the best matched combination of speech units, even in modem and fast computer systems.
The above-described TTS system therefore exploits a technique called dynamic programming to search the database efficiently and to determine the best fit. In order to correct any mismatch that occurs between adjacent phonetic elements or phonemes, the TTS system makes small pitch adjustments to correct the mismatch and thereby bends the pitch up or down at the edge of each sample in order to fit the phonetic element to that of its neighbor.
The TTS system determines and selects the speech units from the database by use of a cost function. That is, costs are calculated that define how closely a speech unit in question matches the target phonetic element predicted by the TTS system by determining the phoneme for a particular segment of the text or of the word. One part of these costs is based on segmental criteria such as phones and phone context. This part is referred to as (segmental) unit costs. Another part, the so called concatenation cost, is used to measure how closely a speech unit in question matches its adjacent speech units. The speech unit that provides the lowest cost is then selected for a target phonetic element.
The above-described TTS system provides good segment quality as the above-mentioned cost function ensures that the selected speech unit is the best match to the corresponding target phonetic element. However, prosody (patterns of alternating stressed and unstressed elements) and intonation in human speech is normally supra-segmental (that is, extends over more than one sound segment) with respect to phonemes and thus with respect to target phonetic elements and corresponding selected speech units. The prosody of the concatenated speech units therefore is still not optimal in comparison with human speech.