Speech can be synthesized using a number of very different approaches. For example, digitized recordings of words can be reassembled into sentences to produce a synthetic utterance of a telephone number. Alternatively, a phonetic representation of the telephone number can be produced using phonemes for each sound comprising the utterance. Perhaps the dominant technique used in speech synthesis is linear predictive coding (LPC), which describes short segments of speech using parameters that can be transformed into positions (frequencies) and shapes (bandwidths) of peaks in the spectral envelope of the speech segments. In a typical 10th order LPC model, ten such parameters are determined, the frequency peaks defined thereby corresponding to resonant frequencies of the speaker's vocal tract. The parameters defining each segment of speech (typically, 10-20 milliseconds per segment) represent data that can be applied to conventional synthesizer hardware to replicate the sound of the speaker producing the utterance.
It can be shown that for a given speaker, the shape of the front cavity of the vocal tract is the primary source of linguistic information. The LPC model includes substantial information that remains approximately constant from segment to segment of an utterance by a given speaker (e.g., information reflecting the length of the speaker's vocal chords). As a consequence, the data representing each segment of speech in the LPC model include considerable redundancy, which creates an undesirable overhead for both storage and transmission of that data.
It is desirable to use the smallest number of parameters required to represent a speech segment for synthesis, so that the requirements for storing such data and the bit rate for transmitting the data can be reduced. Accordingly, it is desirable to separate the speaker-independent linguistic information from the superfluous speaker-dependent information. Since the speaker-independent information that varies with each segment of speech conveys the data necessary to synthesize the words embodied in an utterance, considerable storage space can potentially be saved by separately storing and transmitting the speaker-dependent information for a given speaker, separate from the speaker-independent information. Many such utterances could be stored or transmitted in terms of their speaker-independent information and then synthesized into speech by combination with the speaker-dependent information, thereby greatly reducing storage media requirements and making more channels in an assigned bandwidth available for transmittal of voice communications using this technique. Furthermore, different speaker-dependent information could be combined with the speaker-independent information to synthesize words spoken in the voice of another speaker, for example, by substituting the voice of a female for that of a male or the voice of a specific person for that of the speaker. By reducing the amount of data required to synthesize speech, data storage space and the quantity of data that must be transmitted to a remote site in order to synthesize a given vocalization are greatly reduced. These and other advantages of the present invention will be apparent from the drawings and from the Detailed Description of the Preferred Embodiment that follows.