1. Field of the Invention
The present invention relates generally to a speech signal classification system, and in particular, to a speech signal classification system and method to classify an input speech signal into a voice sound, a non-voice sound, and background noise based on a characteristic of a speech frame of the speech signal.
2. Description of the Related Art
In general, a speech signal classification system is used during the pre-processing of an input speech signal that is recognized as a specific character and used to determine if the input speech signal is a voice sound, a non-voice sound, or background noise. The background noise is noise having no recognizable meaning in speech recognition, that is, background noise is neither a voice sound nor a non-voice sound.
The classification of a speech signal is important in order to recognize subsequent speech signals since a recognizable character type of the subsequent speech signals depends on whether the speech signal is a voice sound or a non-voice sound. The classification of a speech signal as a voice sound or a non-voice sound is basic and important in all kinds of speech recognition, audio signal processing systems, e.g., signal processing systems performing coding, synthesis, recognition, and enhancement.
In order to classify an input speech signal as a voice sound, a non-voice sound, or background noise, various characteristics extracted from a resulting signal obtained by converting the speech signal to a speech signal in a frequency domain are used. For example, some of the characteristics are a periodic characteristic of harmonics, Root Mean Squared Energy (RMSE) of a low band speech signal, and a Zero-crossing Count (ZC). A conventional speech signal classification system extracts various characteristics from an input speech signal, weights the extracted characteristics using a recognition unit comprised of neural networks, and according to a value obtained by calculating the weighted characteristics recognizes whether the input speech signal is a voice sound, a non-voice sound, or background noise. The input speech signal is classified according to the recognition result and output.
FIG. 1 is a block diagram of a conventional speech signal classification system.
Referring to FIG. 1, the conventional speech signal classification system includes a speech frame input unit 100 for generating a speech frame by converting an input speech signal, a characteristic extractor 102 for receiving the speech frame and extracting pre-set characteristics, a recognition unit 104, a determiner 106 for determining according to the extracted characteristics whether the speech frame corresponds to a voice sound, a non-voice sound, or background noise, and a classification & output unit 108 for classifying and outputting the speech frame according to the determination result.
The speech frame input unit 100 converts the speech signal to a speech frame by transforming the speech signal to a speech signal in the frequency domain using a fast Fourier transform (FFT) method. The characteristic extractor 102 receives the speech frame from the speech frame input unit 100, extracts characteristics, such as a periodic characteristic of harmonics, RMSE of a low band speech signal, and a ZC, from the speech frame, and outputs the extracted characteristics to the recognition unit 104. In general, the recognition unit 104 is comprised of a neural network. Since the neural network is useful in analyzing complicated problems which are nonlinear, i.e., cannot be mathematically solved, due to its attributes, the neural network is suitable for determining according to an analysis result whether an input speech signal is a voice sound, a non-voice sound, or background noise. The recognition unit 104 is comprised of the neural network and grants pre-set weights to the characteristics input from the characteristic extractor 102 and derives a recognition result through a neural network calculation process. The recognition result is a result obtained by calculating computation elements of the speech frame according to the weights granted to the characteristics of the speech frame, i.e., a calculation value.
The determiner 106 determines, according to the recognition result, i.e., the value calculated by the recognition unit 104, whether the input speech signal is a voice sound, a non-voice sound, or background noise. The classification & output unit 108 outputs the speech frame as a voice sound, a non-voice sound, or background noise according to a determination result of the determiner 106.
In general, for a voice sound, since various characteristics extracted by the characteristic extractor 102 are clearly different from those of a non-voice sound or background noise, it is relatively easy to distinguish a voice sound from a non-voice sound or background noise. However, a non-voice sound is not clearly distinguishable from background noise.
For example, a voice sound has a periodic characteristic in which harmonics appear repeatedly within a predetermined period, background noise does not have such a characteristic related to harmonics, and a non-voice sound has harmonics with weak periodicity. In other words, a voice sound has a characteristic in which harmonics are repeated even in a single frame, whereas a non-voice sound has a weak periodic characteristic in which harmonics appear but the periodicity of the harmonics, one characteristic of a voice sound, occurs over several frames.
Thus, in the conventional speech signal classification system, since an input single speech frame is determined using characteristics extracted from the single speech frame, when a voice sound is determined, high accuracy is maintained. However, if the input single speech frame is not a voice sound, the accuracy is significantly decreased to classify the input single speech frame as a non-voice sound or background noise.