1. Field of the Invention
The present invention relates to a formant frequency estimation method which is important information in speech recognition by accelerating a spectrum using a pitch frequency, and an apparatus using the method.
2. Description of the Related Art
Generally, a formant frequency (hereinafter, referred to as “formant frequency or formants”) extracted from a speech signal is mainly utilized in a speech coding, such as a formants vocoder, text-to-speech using a formant frequency and a feature vector in a speech recognizer. Particularly, in speech recognition, a formant frequency is very important information, therefore, a formant frequency is vital information for linguists to distinguish a speech. A formant frequency may be directly utilized as a feature vector of speech recognition and may intensify a component of the speech by a formant component.
In a method of searching for a formant frequency using a conventional technique, a formant frequency is obtained by identifying a local maximum point in a linear prediction spectrum and a cepstrally smoothed spectrum.
First, a speech signal to be processed is filtered as an operation of preprocessing, a quality of the speech signal is enhanced in the signal process or is passed through a pre-emphasis filter. Initially, a short-time signal is extracted by multiplying either a Hamming window or a Kaiser window by an appropriate section, approximately 20 ms to 40 ms, of a speech signal as required. Next, the linear prediction spectrum is obtained or the cepstrally smoothed spectrum is obtained by obtaining a linear prediction coefficient in the short-time signal. Next, after a local maximum point is discovered in the obtained spectrum, a formant frequency corresponding to the local maximum point is obtained. In this instance, error values which may unpredictably occur are filtered by an operation Smoothing as a post-process.
Second, a root of a prediction error filter, that is, the formant frequency is obtained by obtaining a ‘zero’. Initially, after the speech signal is passed through a low emphasis filter or a pre-emphasis filter, the short time signal is obtained by multiplying either a Hamming window or a Kaiser window by an appropriate section, approximately 20 ms to 40 ms, of a speech signal as required. Next, a predictable error filter is obtained by calculating the linear prediction coefficient in the short-time signal. Next, after the ‘zero’ is obtained by resolving the predictable error filter in a method of numerical analysis, by applying the ‘zero’ to a certain equation, and the formant frequency is obtained. In this instance, error values which may unpredictably occur are filtered by an operation Smoothing as a post-process.
Third, a ‘zero’ point is gradually searched by dividing a region in a z-region by Cauchy's integral formula. Initially, by using the prediction error filter, a number of the ‘zero’ is obtained in a fan shaped region of the z-region by using Cauchy's integral formula in an equation embodied as below. Next, except for a region without the ‘zero’ in the fan shaped, a region with the ‘zero’ in the fan shaped region is repeatedly bisectioned until the region without the ‘zero’ has the ‘zero’, and the bisectioning is repeatedly executed until sufficient precision is achieved. The above described methods using conventional techniques may directly calculate a formant frequency and they are comparatively strong against a noise. However, a harmonic component and a formant component may be difficult to be distinguished and when a colored noise occurs, a formant component and a noise component may not be distinguished. FIG. 1 is a diagram illustrating graphs estimating a formant frequency according to a conventional technique, as shown in an area 101 and 102 of FIG. 1, when a colored noise occurs, it is difficult to distinguish either a format component or a noise component.