In using a typical TTS system, a person inputs text, for example, via a computer system. The text is transmitted to the TTS system. Next, the TTS system analyzes the text and generates a synthesized speech signal that is transmitted to an acoustic output device. The acoustic output device outputs the synthesized speech signal.
The creation of the generated speech of TTS systems has focused on two characteristics, namely intelligibility and naturalness. Intelligibility relates to whether a listener can understand the speech produced (i.e., does "dog" really sound like "dog" when it is generated or does it sound like "dock"). However, just as important as intelligiblity is the human-like quality, or naturalness, of the generated speech. In fact, it has been demonstrated that unnaturalness can affect intelligibility.
Previously, many have attempted to generate natural sounding speech with TTS systems. These attempts to generate natural sounding speech addressed a variety of issues.
One of these issues is the need to assign appropriate intonation to the speech. Intonation includes such intonational features, or "variations," as intonational prominence, pitch range, intonational contour, and intonational phrasing. Intonational phrasing, in particular, is "chunking" of words in a sentence into meaningful units separated by pauses, the latter being referred to as intonational phrase boundaries. Assigning intonational phrase boundaries to the text involves determining, for each pair of adjacent words, whether one should insert an intonational phrase boundary between them. Depending upon where intonational phrase boundaries are inserted into the candidate areas, the speech generated by a TTS system may sound very natural or very unnatural.
Known methods of assigning intonational phrase boundaries are disadvantageous for several reasons. Developing a model is very time consuming. Further, after investing much time to generate a model, the methods that use the model simply are not accurate enough (i.e., they insert a pause where one should not be present and/or they do not insert a pause where one should be present) to generate natural sounding synthesized speech.
The pauses and other intonational variations in human speech often have great bearing on the meaning of the speech and are, thus, quite important. For example, with respect to intonational phrasing, the sentence "The child isn't screaming because he is sick" spoken as a single intonational phrase may lead the listener to infer that the child is, in fact, screaming, but not because he is sick. However, if the same sentence is spoken as two intonational phrases with an intonational phrase boundary between "screaming" and "because," (i.e., "The child isn't screaming, because he is sick") the listener is likely to infer that the child is not screaming, and the reason is that he is sick.
Assigning intonational phrasing has previously been carried out using one of at least five methods. The first four methods have an accuracy of about 65 to 75 percent when tested against human performance (e.g., where a speaker would have paused/not paused). The fifth method has a higher degree of accuracy than the first four methods (about 90 percent) but takes a long time to carry out the analysis.
A first method is to assign intonational phrase boundaries in all places where the input text contains punctuation internal to a sentence (i.e., a comma, colon, or semi-colon, but not a period). This method has many shortcomings. For example, not every punctuation internal to the sentence should be assigned an intonational phrase boundary. Thus, there should not be an intonational phrase boundary between "Rock" and "Arkansas" in the phrase "Little Rock, Arkansas." Another shortcoming is that when speech is read by a person, the person typically assigns intonational phrase boundaries to places other than internal punctuation marks in the speech.
A second method is to assign intonational phrase boundaries before or after certain key words such as "and," "today," "now," "when," "that," or "but." For example, if the word "and" is used to join two independent clauses (e.g. "I like apples and I like oranges"), assignment of an intonational phrase boundary (e.g., between "apples" and "and") is often appropriate. However, if the word "and" is used to join two nouns (e.g., "I like apples and oranges"), assignment of an intonational phrase boundary (e.g., between "apples" and "and") is often inappropriate. Further, in a sentence like "I take the `nuts and bolts` approach," the assignment of an intonational phrase boundary between "nuts" and "and" would clearly be inappropriate.
A third method combines the first two methods. The shortcomings of these types of methods are apparent from the examples cited above.
A fourth method has been used primarily for the assignment of intonational phrase boundaries for TTS systems whose input is restricted by its application or domain (e.g., names and addresses, stock market quotes, etc . . .). This method has generally involved using a sentence or syntactic parser, the goal of which is to break up a sentence into subjects, verbs, objects, complements, etc. . . . Syntactic parsers have shortcomings for use in the assignment of intonational phrase boundaries in that the relationship between intonational phrase boundaries and syntactic structure has yet to be clearly established. Therefore, this method often assigns phrase boundaries incorrectly. Another shortcoming of syntactic parsers is their speed (or lack thereof), or inability to run in real time. A further shortcoming is the amount of memory needed for their use. Syntactic parsers have yet to be successfully used in unrestricted TTS systems because of the above shortcomings. Further, in restricted-domain TTS systems, syntactic parsers fail particularly on unfamiliar input and are difficult to extend to new input and new domains.
A fifth method that could be used to assign intonational phrase boundaries would increase the accuracy of appropriately assigning intonational phrase boundaries to about 90 percent. This is described in Wang and Hirschberg, "Automatic classification of intonational phrase boundaries," Computer Speech and Language, vol. 6, pages 175-196 (1992). The method involves having a speaker read a body of text into a microphone and recording it. The recorded speech is then prosodically labelled. Prosodically labeling speech entails identifying the intonational features of speech that one desires to model in the generated speech produced by the TTS system.
This method also has significant drawbacks. It is expensive because it usually entails the hiring of a professional speaker. A great amount of time is necessary to prosodically label recorded speech, usually about one minute for each second of recorded speech and even then only if the labelers are very experienced. Moreover, since the process is time-consuming and expensive, it is difficult to adapt this process to different languages, different applications, different speaking styles.
More specifically, a particular implementation of the last-mentioned method used about 45 to 60 minutes of natural speech that was then prosodically labeled. Sixty minutes of speech takes about 60 hours (e.g., 3600 minutes) just for prosodic labeling the speech. Additionally, there is much time required to record the speech and process the data for analysis (e.g., dividing the recorded data into sentences, filtering the sentences, etc . . . ). This usually takes about 40 to 50 hours. Also, the above assumes that the prosodic labeler has been trained; training often takes weeks, or even months.