Speech is used to communicate information from a speaker to a listener. In a computer-user interface, the computer generates synthesized speech to convey an audible message to the user rather than just displaying the message as text with an accompanying “beep.” There are several advantages to conveying audible messages to the computer user in the form of synthesized speech. In addition to liberating the user from having to look at the computer's display screen, the spoken message conveys more information than the simple “beep” and, for certain types of information, speech is a more natural communication medium. Speech synthesis may also be useful in bulk output applications (e.g., reading aloud a document).
Generating natural sounding synthesized speech has long been the ultimate challenge for text-to-speech (TTS) systems. Not only is naturalness more aesthetically pleasant, but it affects intelligibility as well. The more closely synthetic speech models natural speech, the more richly and redundantly the content and structure of the information will be represented in the acoustic signal. This in turn means that it will be easier for the listener to recover the intended meaning from the signal—i.e., the cognitive load associated with this task will be lower. Consequently, the task of understanding the speech will interfere less with other tasks the user is performing when using the computer system. More natural TTS will thereby support a wider range of applications.
One important component of naturalness in synthesized speech is generating the correct prominence contour for each spoken sentence. As used herein, the phrase “prominence contour” refers to the relative perceptual salience or emphasis of each of the words in each spoken sentence. This is sometimes described as some words being intentionally spoken in such a way as to stand out to the listener more than other words in the same sentence. In natural speech, more or less prominence is assigned to the different words of a sentence depending on a variety of factors, including word type (e.g., function word or content word), syntactic category (e.g., noun or verb), and the semantic role (e.g., the difference between “French teachers” meaning people who teach the French language, regardless of where they come from—versus “French teachers”—meaning teachers of any subject who happen to come from France). These factors are lexical properties of the words or noun compounds, and can usually be found in a dictionary. However, a more important function of the relative prominence of words in a sentence is to convey how the overall information is structured, and how the concepts that are conveyed by the individual words relate to each other and to the overall contextual meaning of the message as a whole. One particularly important role of relative prominence is to convey whether a word is introducing a new concept to the current discourse, or whether it is merely referring to a concept that has already been introduced earlier in the discourse. This role is often referred to as “given versus new” information. In synthesized speech (or, for that matter, natural speech), if any word is assigned the wrong prominence, the spoken sentence becomes distorted, resulting in anything from a mildly misleading change in emphasis, to the distraction of a complete shift in meaning, to the perception of a foreign accent, to an unnatural delivery affecting understandability, and thereby interfering with usability of the technology. For this reason the perceived quality of text-to-speech (TTS) systems is heavily dependent on word prominence assignment.
Most existing TTS systems use simple rules to carry out word prominence assignment. For example, function words (such as “the,” “for,” or “in”) are not, ordinarily, emphasized; all other things being equal, nouns are assigned more prominence than verbs; and, in some recent and more sophisticated systems, new information is accentuated more than information that was previously given. In the vast majority of cases, the first two rules are easily implemented, as it is straightforward to devise a list of function words, and only slightly more challenging to maintain a list of possible parts of speech for each word. It is, however, considerably more difficult in practice to determine what constitutes “new” versus “given” information.
Some of the most recent state-of-the-art TTS systems use a simple rule for prominence assignment: give less prominence to those words that have already been seen in previous sentences (within some well-defined domain such as a paragraph, discourse segment, or document), because they refer to “given” information. However, even words that have not already been seen in previous sentences may refer to given information. What constitutes given information is more accurately measured in terms of the underlying concepts to which the words refer, rather than merely whether the words have already been seen. Since many different words can be used to express the same concept, once a concept has been introduced, all words referring to the concept should be assigned less prominence, and not just the previously used word. Determining which words express the same concept involves not only words that are synonyms, but more generally, words that are semantically related to one another. To better understand the distinction between synonyms and semantically related words, consider the following question “Has John read Lord of the Rings?” and the accompanying answer “John doesn't read books.” The word “books” has little or no prominence in this context because it is semantically related to (although not a synonym for) “Lord of the Rings.” If this answer were not preceded by the above question, then “books” would have greater prominence. Determining which words are semantically related is, however, very complex due to the multi-faceted nature of semantic relationships.
For example, recited below are two versions of a simple dialog with the same answer:
Why did you decide to spend your vacation in Tennessee?
(1)
My mama lives in Memphis.                (2)and        
You're gonna visit your mother when you're in Nashville?
(3)
My mama lives in Memphis.                (4)        
Using the simple rules of word prominence, a prior art TTS system would generate the words mama and Memphis in both sentences (2) and (4) with about the same prominence, since neither mama nor Memphis are present in the previous sentences (1) and (3). In natural speech, however, mama and Memphis are spoken with about the same prominence only in sentence (2), while in sentence (4) mama is spoken with markedly less prominence than Memphis. This phenomenon is explained in terms of which words represent “new” information and which do not. In both sentences (2) and (4), Memphis is not only semantically related to a word in the preceding question, Tennessee or Nashville, but also adds new information (the exact location in the first answer, and the correct location in the second answer). In contrast, mama in sentence (4) is semantically related to the word mother in (3), but adds no new information since mama is a strict synonym for mother. Thus, in natural speech, the word mama is treated as a representative of a previously given concept and, accordingly, is spoken with comparatively less prominence.
The challenge, therefore, is to provide a principled way to obtain a semantically-driven prominence assignment that is consistent with the way humans assign word prominence in natural speech, in order to more redundantly convey meanings and, therefore, to generate synthesized text that is more easily understood. Doing so should result in a more natural-sounding synthetic speech with a perceptively better quality than provided by prior art TTS systems.