1. Field of the Invention
The present invention relates generally to an apparatus and method for processing a speech signal, and in particular, to an apparatus and method for extracting pitch information from a speech signal.
2. Description of the Related Art
In general, an audio signal including a speech signal and a sound signal is classified into a periodic or harmonic component and a non-periodic or random component, i.e., a voice part and an non-voice part, according to statistical characteristics in a time domain and a frequency domain and is called quasi-periodic. The periodic component and the non-periodic component are determined as the voice part and the unvoiced part according to the existence or non-existence of pitch information, and a periodic voice sound and a non-periodic non-voice sound are identified based on the pitch information. In particular, the periodic component has most information and significantly affects sound quality, and a period of the voice part is called a pitch. That is, pitch information is typically regarded as highly important information in systems which process speech signals, and a pitch error is an element which most significantly affects the general performance and sound quality of these systems.
Thus, how accurately the pitch information is detected is important for improving the sound quality. Conventional pitch information extraction methods are based on linear prediction analysis by which a signal of a post-stage is predicted using a signal of a pre-stage. In addition, because of its superior performance, a pitch information extraction method is widely used to represent a speech signal based on a sinusoidal representation and to calculate a maximum likelihood ratio using the harmonics of the speech signal.
In a Linear Prediction Analysis Method (LPAM) widely used for speech signal analysis, the performance of the method is affected according to the order of the linear prediction. Accordingly, if the order is increased to improve the performance, the number of calculations required to perform the LPAM also increases. Therefore, the performance of the prediction analysis method is limited by the number of calculations. The prediction analysis method works only when it is assumed that a signal is stationary for a short time. Thus, in a transition region of a speech signal, the linear prediction cannot easily follow the rapidly changed speech signal, resulting in a failure of the linear prediction analysis.
In addition, the linear prediction analysis method uses data windowing, and in this case, if the balance between resolutions of a time axis and a frequency axis is not maintained, it is difficult to detect a spectral envelope. For example, for voice having a very high pitch, the prediction follows individual harmonics rather than the spectral envelope because of wide gaps between the harmonics when the linear prediction analysis method is used. Thus, for a speaker with a high-pitched voice, such as a woman or a child, the performance of linear prediction analysis methods tends to decrease. Regardless of these problems, the linear prediction analysis method is a spectrum prediction method widely used because of a resolution in the frequency axis and an easy application in voice compression.
However, the conventional pitch information extraction methods may experience pitch doubling or pitch halving. In detail, to extract correct pitch information from a frame, the length of only a periodic component having pitch information in the frame must be found. However, conventional systems may incorrectly determine a period which is one-half or twice the length of the periodic component which is known as pitch doubling and pitch halving, respectively. As described above, since the conventional pitch information extraction methods may experience pitch doubling and/or pitch halving, a pitch error affecting the general performance and sound quality of a system must be considered.
When the pitch error is generated, a frequency considered as the best candidate is selected using an algorithm, and the pitch error is distinguished by a fine error ratio due to the performance limit of the algorithm and a gross error ratio indicating a ratio of the number of frames including errors to the number of total frames. For example, when errors are generated in 5 frames out of 100 frames, the fine error ratio is a difference between pitch information of the 95 frames and pitch information after a checking process, and an error range has a tendency to increase according to an increase of noise. The gross error ratio is obtained from an unrecoverable error of around one period in the pitch doubling and around half a period in the pitch halving.
As described above, the conventional pitch information extraction methods perform poorly with respect to the pitch error most significantly affecting the general performance and sound quality of a system due to the pitch doubling or halving.