The present invention relates generally to the field of speech coding, and more particularly to encoding methods for estimating pitch and voicing parameters.
Various methods have been developed for digital encoding of speech signals. The encoding enables the speech signal to be stored or transmitted and subsequently decoded, thereby reproducing the original speech signal.
Model-based speech encoding permits the speech signal to be compressed, which reduces the number of bits required to represent the speech signal, thereby reducing data transmission rates. The lower data rates are possible because of the redundancy of speech and by mathematically simulating the human speech-generating system. The vocal tract is simulated by a number of xe2x80x9cpipesxe2x80x9d of differing diameter, and the excitation is represented by a pulse stream at the vocal chord rate for voiced sound or a random noise source for the unvoiced parts of speech. Reflection coefficients at junctions of the pipes are represented by coefficients obtained from linear prediction coding (LPC) analysis of the speech waveform.
The vocal chord rate, which as stated above, is used to formulate speech models, is related to the periodicity of voiced speed, often referred to as pitch. In an analog time domain plot of a speech signal, the time between the largest magnitude positive or negative peaks during voiced segments is the pitch period. Although speech signals are not perfectly periodic, and in fact, are quasi-periodic or non-stationary signals, an estimated pitch frequency and its reciprocal, the pitch period, attempt to represent the speech signal as truly as possible.
For speech encoding, an estimation of pitch is made, using any one of a number of pitch estimation algorithms. However, none of the existing estimation algorithms have been entirely successfully in providing robust performance over a variety of input speech conditions.
Another parameter of the speech model is a voicing parameter, which indicates which portions of the speech signal are voiced and which are unvoiced. Voicing information may be used during encoding to determine other parameters. Voicing information is also used during decoding, to switch between different synthesis processes for voiced or unvoiced speech. Typically, coding systems operate on frames of the speech signal, where each frame is a segment of the signal and all frames have the same length. One approach to representing voicing information is to provide a binary voiced/unvoiced parameter for each entire frame. Another approach is to divide each frame into frequency bands and to provide a binary parameter for each band. However, neither approach provides a satisfactory model.
One aspect of the invention is a multi-stage method of estimating the pitch of a speech signal that is to be encoded. In a first stage of the method, a set of candidate pitch values is selected, such as by applying a cost function to the speech signal. In a second stage of the method, a best candidate is selected. Specifically, in the second stage, pitch values calculated for previous speech segments are used to calculate an average pitch value. Then, depending on whether the average pitch value is short or long, one of two different analysis-by-synthesis (ABS) processes is performed. The ABS process is repeated for each candidate, such that for each iteration, a synthesized speech signal is derived from that pitch candidate and compared to the input speech signal. A time domain ABS process is performed if the average pitch is short, whereas a frequency domain ABS process is performed if the average pitch is long. Both ABS processes provide an error value corresponding to each pitch candidate. The pitch candidate having the smallest error is deemed to be the best candidate.
An advantage of the pitch estimation method is that it is robust, and its ability to perform well is independent of the peculiarities of the input speech signal. In other words, the method overcomes the problem encountered by existing pitch estimation methods, of dealing with a variety of input speech conditions.
Another aspect of the invention is a mixed voicing estimation method for determining the voiced and unvoiced characteristics of an input speech signal that is to be encoded. The method assumes that a pitch for the input speech signal has previously been estimated. The pitch is used to determine the harmonic frequencies of the speech signal. A probability function is used to assign a probability value to each harmonic frequency, with the probability value being the probability that the speech at that frequency is voiced. For transmission efficiency, a cut-off frequency can be calculated. Below the cut-off frequency, the speech signal is assumed to be voiced so that no probability value is required. The voicing estimator provides an improved method of modeling voicing information. It permits a probability function to be efficiently used to differentiate between voiced and unvoiced portions of mixed speech signals.