Speech is used to communicate information from a speaker to a listener. In a computer-user interface, the computer generates synthesized speech to convey an audible message to the user rather than just displaying the message as text with an accompanying “beep.” There are several advantages to conveying audible messages to the computer user in the form of synthesized speech. In addition to liberating the user from having to look at the computer's display screen, the spoken message conveys more information than the simple “beep” and, for certain types of information, speech is a more natural communication medium.
Due to the nature of computer systems, the same message may occur many times. For example, the message “Attention! The printer is out of paper” may be programmed to repeat several times over a short period of time until the user replenishes the printer's paper tray. Or the message “Are you sure you want to quit without saving?” may be repeated several times over the course of using a particular program. In human speech, when a person says the same words over and over again, he or she does not produce exactly the same acoustic signal each time the words are spoken. In synthesized speech, however, the opposite is true; a computer generates exactly the same acoustic signal each time the message is spoken. Users inevitably become annoyed at hearing the same predictable message spoken each time in exactly the same way. The more often a particular message is spoken in exactly the same way, the more unnaturally mechanical it sounds. In fact, studies have shown that listeners tune out repetitive sounds and, eventually, a repetitive spoken message will not be noticed.
One way to overcome the problems of sound repetition is to alter the way the computer produces the acoustic signal each time the message is spoken. Altering a computer-generated sound each time it is produced is known in the art. For example, alteration of the sound can be achieved by changing the sample playback rate, which shifts the overall spectrum and duration of the acoustic signal. While this approach works well for non-speech sounds, it does not work well when applied to speech sounds. In human speech, the overall spectrum of sound stays the same because a human speaker's vocal tract length does not vary. Thus, in order to sound like human speech, the overall spectrum of the sound of synthesized speech needs to stay the same as well. Another prior art example of altering a computer-generated sound each time it is produced is found in computer-generated music. In computer music a small random variation in the timing of each note is sometimes made to achieve a less mechanical sound. However, as with changing the sample playback rate, changing the timing of the components of speech does not work well for speech sounds because, unlike music, speech does not consist of easily identifiable note-onset and note-duration events. Rather, speech consists of tonal patterns of pitch, syllable stresses, overlapped gestures of the articulators (tongue, lips, jaw, etc.), and timing to form the rhythmic speech patterns that comprise the spoken message. Thus, it is not so clear exactly what parameters in speech synthesis should be varied to achieve a more natural sound. A more detailed analysis of the components of speech is required.
Speech is the acoustic output of a complex system whose underlying state consists of a known set of discrete phonemes that every human speaker produces. A phoneme is the basic theoretical unit for describing how speech conveys linguistic meaning. As such, the phonemes of a language comprise a minimal theoretical set of units that are sufficient to convey all meaning in the language. For American English, there are approximately 40 phonemes, which are made up of vowels and consonants. Each phoneme can be considered to be a code that consists of a unique set of articulatory gestures.
If speakers could exactly and consistently produce these phoneme sounds, speech would amount to a stream of underlying discrete codes. However, because of many different factors including, for example, agents, gender, and coarticulatory effects, every phoneme has a variety of acoustic manifestations in the course of flowing speech. Thus, from an acoustical point of view, the phoneme actually represents a class of sounds that convey the same meaning.
The variations in the way the phonemes are produced between people and even between utterances of the same person are referred to as prosody. Examples of prosody include tonal and rhythmic variations in speech, which provide a significant contribution to the formal linguistic structure of speech communication and are referred to as the prosodic features. The acoustic patterns of prosodic features are heard in changes in the duration, intensity, fundamental frequency, and spectral patterns of the individual phonemes that comprise the spoken message.
There are two distinctive components of prosody—i.e., linguistic components of prosody and paralinguistic components of prosody. The linguistic components of prosody are those that can change the meaning of a spoken phrase. In contrast, paralinguistic components of prosody are those that do not change the meaning of a series of spoken words. For example, when speaking the phrase “it's raining,” a rising intonation asks for a confirmation and, perhaps, conveys surprise or disbelief. On the other hand, a falling intonation may express confidence that the rain is indeed falling. The distinction between the rising and falling intonations is an example of varying a linguistic prosodic feature. By contrast, one could speak the phrase “it's raining” with a somewhat higher (or lower) overall pitch range, depending upon whether the listener is far away (or nearby), and this change in overall pitch range does not change the meaning of the spoken words. Such a change in pitch without altering meaning is an example of a paralinguistic prosodic feature.
The fundamental frequency contours of speech have been classified according to their communicative function. In English, a rising contour generally conveys to the listener that a question has been posed, that some response from the listener is required, or that more information is implied to follow within the current topic. Conversely, a falling contour generally conveys the opposite. Numerous subtle and not-so-subtle variations in the fundamental frequency contours signal other information to the listener as well, such as sarcasm, disbelief, excitement or anger. Unlike the phonemes, the prosodic features reflected in the acoustic patterns may not be discrete. In fact, it is often difficult or impossible to determine which features of prosody are discrete and which are not.
The human ear is extremely sensitive to minor changes in certain components of speech, and remarkably tolerant of other changes. For example, the tonal and rhythmic variations of speech are finely controlled by humans and, as noted above, convey considerable linguistic information. Thus, random variations in the pitch or duration of each phoneme, syllable or word of a spoken message can destructively interfere with the overall tonal and rhythmic pattern of the speech, i.e. the prosody. Even a 9-millisecond difference in the closure duration of an inter-vocal stop can shift the perception from voiced to voiceless, changing for example the word “rapid” into “rabid.” Therefore, simply changing the parameters for the timing of sound components may result in undesirable alterations in the prosodic features of the phonemes that comprise the speech and cannot be successfully applied to speech synthesis.
Another example of altering computer-generated sounds is disclosed in U.S. Pat. No. 5,007,095 to Nara et al., which describes a system for synthesizing speech having improved naturalness.