In the field of telecommunications, the synthesis of speech is of particular interest. It permits people unskilled in computer technology to receive so-called canned messages, e.g. by telephone, without the necessity of employing full-time human operators or of using costly subscriber terminals. Such messages may inform a calling subscriber of congestion at an exchange, of the cost and duration of a call, and of a changed directory number.
A digital system for synthesizing speech stores words or portions of words in coded form, a decoder being necessary to convert the digitally encoded signals into voice signals suitable for conventional transduction into sound waves. One particular system for the synthesis of speech elements stores PCM-coded waveform samples of diphones, i.e. phoneme pairs. Such a system generates a staccato-sounding speech and has the further disadvantage of requiring a large memory.
In an attempt to achieve natural-sounding synthesis, coding techniques have been developed on the basis of mathematical models simulating the production of speech by a human vocal tract. According to one model, the vocal tract is replaced by the combination of an excitation generator and a time-variable filtering system consisting of the resonant cavities of an acoustic tube having a variable cross-section. The excitation may be a sequence of periodic or pseudorandom pressure variations, depending on whether the output is to correspond to a voiced or an unvoiced sound. The filter has coefficients which represent the effects of reflection between different cavities of the tube and are continuous functions of time; the coefficient values, however, may be considered to be constant during sufficiently short time intervals, e.g. on the order of 10 msec. Furthermore, the filter can be controlled to have a variable gain corresponding to a varying sound intensity.
Thus, an element of synthesized speech may be represented by a set of parameters coding the duration of the element, the kind of excitation (whether voiced or unvoiced), filter gain, weighting coefficients and, in the case of voiced sound, the recurrence period of the excitation pulses. These parameters are obtained by analyzing human speech in accordance with the selected model. Such an analysis is described by P. M. Bertinetto, C. Miotti, S. Sandri and E. Vivalda in a paper titled "An Interactive Synthesis System for the Detection of Italian Prosodic Rules", CSELT Technical Reports, vol. V, No. 5, December 1977. Prior synthesizers operating according to this model, however, vary the coefficients at constant intervals, thereby producing a degree of unnaturalness in the synthesized speech.