In the art of speech synthesis, a great deal of data is required for the speech style to be emulated in order to approximate a human-like synthesis. The problem can be illustrated by reference to a rudimentary, and generally familiar means for producing a voiced response to a textual or keyboard input--specifically those systems which provide a voiced response (generally comprised of concatenated prerecorded digits corresponding to an electronically stored number or confirming a number entered via a keyboard or keypad) to various telephone inquires, such as a request to a directory assistance operator or an interface with an automated banking function. As is well known, such systems are characterized by a very limited vocabulary--often only the digits from 0 to 9, a staccato delivery style, generally very brief speech response, and the necessity that each "word" in the system's vocabulary be prerecorded and stored. In this respect, it is readily seen that such rudimentary voice response systems do not provide true speech synthesis inasmuch as the only synthesis involved is the stringing together of a series of prerecorded numerals, words or phrases.
For speech synthesis systems operating on open input, such as a system for translating a computer text file for a sight impaired user, the limitations described above will generally be intolerable. For example, the working vocabulary of such a system must be at least in the tens of thousands of words. And, many of those words will require different inflection, accentuation and/or syllabic stress, depending on context. It will readily be appreciated that the task of recording, storing and recalling the necessary vocabulary of words (as well as the task of recognizing which stored version of a particular word is required by the immediate context) would require immense human and computational resources, and as a practical matter could not be implemented. Similarly, in order to make synthesized speech of more than a few words acceptable to users, it must be as human-like as possible. Thus, the synthesized speech must include appropriate pauses, inflections, accentuation and syllabic stress. Obviously, the staccato delivery style of the rudimentary system would be unacceptable.
Put somewhat differently, speech synthesis systems which can provide a human-like delivery quality for non-trivial input textual speech must not only be able to handle the necessary vocabulary size but also must be able to correctly pronounce the "words" read, to appropriately emphasize some words and de-emphasize others, to "chunk" a sentence into meaningful phrases, to pick an appropriate pitch contour and to establish the duration of each phonetic segment, or phoneme--recognizing that a given phoneme should be longer if it appears in some positions in a sentence than in others. Broadly speaking, such a system will operate to convert input text into some form of linguistic representation that includes information on the phonemes to be produced, their duration, the location of any phrase boundaries and the pitch contour to be used. This linguistic representation of the underlying text can then be converted into a speech waveform.
We believe that the state of the art in speech synthesis is represented by a text to speech (TTS) synthesis system developed by AT&T Bell Laboratories and described in Olive, J. P. and Sproat, R. W., "Text-To-Speech Synthesis", AT&T Technical Journal, 74: 35-44, 1995. We will refer to that AT&T TTS System from time-to-time herein as a typical speech synthesis embodiment for the application of our invention.
It is not necessary to describe in detail the operation of such speech synthesis systems, which, in general, are known in the art, but a functional description of such systems will aid in the understanding of our invention. In FIG. 1 such a system is depicted in broad functional form. As shown in the figure, input text is first operated on by a Text Analysis function, 1. That function essentially comprises the conversion of the input text into a linguistic representation of that text. Included in this text analysis function are the subfunctions of identification of phonemes corresponding to the underlying text, determination of the stress to be placed on various syllables and words comprising the text, application of word pronunciation rules to the input text, and determining the location of phrase boundaries for the text and the pitch to be associated with the synthesized speech. Other, generally less important functions may also be included in the overall text analysis function, but they need not be further discussed herein.
Following application of the text analysis function, the system of FIG. 1 performs the function depicted as Acoustic Analysis 5. This function will be concerned with various acoustic parameters, but of particular importance to the present invention, the Acoustic Analysis function determines the duration of each phoneme in the synthesized speech in order to closely approximate the natural speech being emulated. This phoneme duration aspect of the Acoustic Analysis function represents the portion of a speech synthesis system to which our invention is directed and will be described in more detail below.
The final functional element in FIG. 1, Speech Generation, 10, operates on data and/or parameters developed by preceding functions in order to construct a speech waveform corresponding to the text being synthesized into speech. For purposes of our discussion, it is important to note that the Speech Generation function operates to assure that the speech waveform for each phoneme corresponds to the duration for that phoneme determined by the Acoustic Analysis function.
It is well known that, in natural speech, the duration of a phonetic segment varies as a function of contextual factors. These factors include the identities of the surrounding segments, within-word position, word prominence, presence of phrase boundaries, as well as other factors. It is generally believed that for synthetic speech to sound natural, these durational patterns must be mimicked. To realize these durational patterns in a synthesizer, the Acoustic Analysis function operates on parameters derived from test speech read by a selected speaker. From an analysis of such test speech, and particularly phoneme duration data obtained therefrom, speech synthesis systems can be constructed to essentially emulate the durational patterns of the selected speaker.
The test speech will contain a number of preselected sentences read by the selected speaker and recorded. This recorded test speech is then analyzed in terms of the durations of the individual phonemes contained in the spoken test sentences. From this data, rules are developed for predicting the durations of such phonemes in text which is to be synthesized into speech, given a context in which the words containing such phonemes appear. While the general character of such rules is known for at least the major languages, based on a large body of prior research into speech characteristics--which research has been widely reported and will be well known to those skilled in the art of speech synthesis, it is necessary to adapt those general rules to the durational patterns of the selected speaker in order to cause the synthesizer to mimic that speaker. Such adaptation is accomplished through the valuation of parameters contained in the rules, and this parameter valuation is based on the phoneme duration data derived from the test speech.
Now we reach the crux of the problem addressed by our invention. Because the phoneme durations determined from the test speech are themselves a function of context, the text selection methods available in the art for determining the content and scope of the test sentences require, at best, several thousand observed durations to cover enough contexts for parameter estimation. This large number of observations, and the corresponding large number of sentences which would comprise the test speech, significantly handicaps the estimation of duration parameters for a text-to-speech synthesizer, due to the substantial amount of time required for the recording of the test speech and the huge amount of phoneme data which must be analyzed in such test speech. Additionally, such a large body of test speech renders impossible any reprogramming of such a synthesizer by a user desiring to create a synthesized speech style more in keeping with a speech style familiar to and/or preferred by such a user.
We will show hereafter a system and method for determining test speech sentences which provides an order of magnitude reduction from the prior art in the number of sentences required for reliably estimating the duration parameters. We will also show that, within the constraints of presently known analytic processes, the method of our invention produces the practical minimum number of sentences needed for such estimation of those duration parameters.