This invention relates to a speech analyzer, which is useful, among others, in speech communication.
Band-compressed encoding of voice or speech sound signals has been increasingly demanded as a result of recent progress in multiplex communication of speech sound signals and in composite multiplex communication of speech sound and facsimile and/or telex signals through a telephone network. For this purpose, speech analyzers and synthesizers are useful.
As described in an article contributed by B. S. Atal and Suzanne L. Hanauer to "The Journal of the Acoustical Society of America," Vol. 50, No. 2 (Part 2), 1971, pages 637-655, under the title of "Speech Analysis and Synthesis by Linear Prediction of the Speech Wave," it is possible to regard speed sound as a radiation output of a vocal tract that is excited by a sound source, such as the vocal cords set into vibration. The speech sound is represented in terms of two groups of characteristic parameters, one for information related to the exciting sound source and the other for the transfer function of the vocal tract. The transfer function, in turn, is expressed as spectral distribution information of the speech sound.
By the use of a speech analyzer, the sound source information and the spectral distribution information are extracted from an input speech sound signal and then encoded either into an encoded or a quantized signal for transmission. A speech synthesizer comprises a digital filter having adjustable coefficients. After the encoded or quantized signal is received and decoded, the resulting spectral distribution information is used to adjust the digital filter coefficients. The resulting sound source information is used to excite the coefficient-adjusted digital filter, which now produces an output signal representative of the speech sound.
As the spectral distribution information, it is usually possible to use spectral envelope information that represents a macroscopic distribution of the spectrum of the speech sound waveform and thus reflects the resonance characteristics of the vocal tract. It is also possible to use, as the sound source information, parameters that indicate classification into or distinction between a voiced sound produced by the vibration of the vocal cords and a voiceless or unvoiced sound resulting from a stream of air flowing through the vocal tract (a fricative or an explosive), an average power or intensity of the speech sound during a short interval of time, such as an interval of the order of 20 to 30 milliseconds, and a pitch period for the voiced sound. The sound source information is band-compressed by replacing a voiced and an unvoiced sound with an impulse response of a waveform and a pitch period analogous to those of the voiced sound and with white noise, respectively.
On analyzing speech sound, it is possible to deem the parameters to be stationary during the short interval mentioned above. This is because variations in the spectral distribution or envelope information and the sound source information are the results of motion of the articulating organs, such as the tongue and the lips, and are generally slow. It is therefore sufficient in general that the parameters be extracted from the speech sound signal in each frame period of the above-exemplified short interval. Such parameters serve well for the synthesis or production of the speech sound.
It is to be pointed out in connection with the above that the parameters indicative, among others, of the pitch period and the distinction between voiced and unvoiced sounds are very important for the speech sound analysis and synthesis. This is because the results of analysis for deriving such information have a material effect on the quality of the synthesized speech sound. For example, an error in the measurement of the pitch period seriously affects the tone of the synthesized sound. An error in the distinction between voiced and unvoiced sounds renders the synthesized sound husky and crunching or thundering. Any of such errors thus harms not only the naturalness but also the clarity of the synthesized sound.
On measuring the pitch period, it is usual to derive at first a series or sequence of autocorrelation coefficients from the speech sound to be analyzed. As will be described in detail later with reference to one of several figures of the accompanying drawing, the series consists of autocorrelation coefficients of a plurality of orders, namely, for various delays or joining intervals. By comparing the autocorrelation coefficients with one another, the pitch period is decided to be one of the delays that gives a maximum or greatest one of the autocorrelation coefficients.
As described in an article that Bishnu S. Atal and Lawrence R. Rabiner contributed to "IEEE Transactions on Acoustics, Speech, and Signal Processing," Vol. ASSP-24, No. 3 (June 1976), pages 201-212, under the title of "A Pattern Recognition Approach to Voiced-Unvoiced-Silence Classification with Applications to Speech Recognition," it is possible to use various criterion or decision parameters for the classification or distinction that have different values according as the speech sounds are voiced and unvoiced. Typical decision parameters are the average power, the rate of zero crossings, and the maximum autocorrelation coefficient indicative of the delay corresponding to the pitch period. Amongst such parameters, the maximum autocorrelation coefficient is useful and important.
The pitch period extracted from the autocorrelation coefficients is stable and precise at a stationary part of the speech sound at which the speech sound waveform is periodic during a considerably long interval of time as in a stationarily voiced part of the speech sound. The waveform, however, has only a poor periodicity at that part of transit of the speech sound at which a voiced and an unvoiced sound merge into each other as when a voiced sound transits into an unvoiced one or when a voiced sound builds up from an unvoiced one. It is difficult to extract a correct path period from such a transient part because the waveform is subject to effects of ambient noise and the formants. Classification into voiced and unvoiced sounds is also difficult at the transient part.
More particularly, the maximum autocorrelation coefficient has as great a value as from about 0.75 to 0.99 at a stationary part of the speech sound. On the other hand, the maximum value of autocorrelation coefficients resulting from the ambient noise and/or the formants is only about 0.5. It is readily possible to distinguish between such two maximum autocorrelation coefficients. The maximum autocorrelation coefficient for the speech sound, however, decreases to about 0.5 at a transient part. It is next to impossible to distinguish the latter maximum autocorrelation coefficient from the maximum autocorrelation coefficient resulting either from the ambient noise of the formants. Distinction between a voiced and an unvoiced sound becomes ambiguous if based on such maximum value.