In digital wireless communication and packet communication represented by the Internet communication or in the field of speech accumulation and the like, a speech signal coding/decoding technique is indispensable for effective utilization of the capacity of a transmission line for radio waves and the like or a storage medium, and many speech coding/decoding systems have been developed up to now. Among such systems, a CELP (Code Excited Linear Prediction) speech coding/decoding system has been practically applied as a mainstream system.
A CELP speech coding apparatus encodes an input speech on the basis of a speech model stored in advance. Specifically, the CELP speech coding apparatus separates a digitalized speech signal into frames of about 10 to 20 ms, performs linear prediction analysis of the speech signal for each frame, determines a linear prediction coefficient and a linear prediction residual vector, and encodes each of the linear prediction coefficient and the linear prediction residual vector separately.
A variable rate coding apparatus has also been realized which changes a bit rate according to an input signal. In the variable rate coding apparatus, it is possible to encode an input signal at a high bit rate if the input signal mainly includes a lot of speech information and encode the input signal at a low bit rate if the input signal mainly includes a lot of noise information. That is, if a lot of important information is included, high-quality coding is performed to realize the high quality of an output signal reproduced on the decoding apparatus side. On the other hand, if importance is low, the power, the transmission band and the like can be saved by low-quality coding. In this way, by detecting features of an input signal (for example, voicedness, unvoicedness, tonality and the like) and changing a coding method according to the result of the detection, it is possible to perform coding suitable for the features of the input signal and improve coding performance.
As a method for classifying an input signal into speech information or noise information, a VAD (Voice Active Detector) exists. Specifically, there are methods such as (1) a method in which an input signal is quantized to classify the class thereof, and classification of speech information/noise information is performed on the basis of class information, (2) a method in which the fundamental period of an input signal is determined, and classification of speech information/noise information is performed according to the level of correlation between a signal earlier than a current signal by the length of the fundamental period and the current signal, and (3) a method in which temporal variation in frequency components of an input signal is examined, and classification of speech information/noise information is performed according to variation information.
There is also a technique in which frequency components of an input signal are determined by SDFT (Shifted Discrete Fourier Transform), and the tonality of the input signal is classified according to the level of correlation between the frequency components of a current frame and the frequency components of a previous frame (for example, PTL 1). In the above technique disclosed in PTL 1, a frequency band extension method is switched according to the tonality so as to improve coding performance.