1. Field of the Invention
The present invention relates to an apparatus and method for speech segment detection and a system for speech recognition that combine an image signal and a sound signal to detect a speech segment.
2. Discussion of Related Art
Speech recognition is a sequential process that analyzes features of a sound signal corresponding to speech and converts the sound signal into characters using a computer. A main process of speech recognition can be broken down into a preprocess step, a search step, and a post-process step.
First, a sound signal is input through a speech input device. In the preprocess step, a beginning point and end point of speech (a speech segment) is detected (end point detection (EPD)) from the input sound signal, and then sound features are extracted.
Subsequently, in the search step, a previously prepared sound model and pronouncing dictionary are searched, phonemes having similar features to those extracted in the preprocess step are found out, and the phonemes are combined into a word or a sentence. Then, in order to reduce errors in the search result, the post-process step of applying a language model is performed.
The above speech recognition process will be described in detail below with reference to FIG. 1.
FIG. 1 is a flowchart showing a method for speech recognition in a conventional speech recognition system.
Referring to FIG. 1, when a sound signal is received in step 100, the speech recognition system frames the received sound signal in step 102.
Then, in step 104, the speech recognition system removes stationary noise from the sound signal frame by frame. More specifically, the speech recognition system eliminates high-frequency components by performing frame-specific low-pass filtering.
Next, in step 106, the speech recognition system determines whether or not absolute energy is large and a zero-crossing rate is small according to frames from which stationary noise is removed. More specifically, the speech recognition system determines that the corresponding frame is noise when the absolute energy is small or the zero-crossing rate is large, and that the corresponding frame is a speech frame when the absolute energy is large and the zero-crossing rate is small.
When the absolute energy of the corresponding frame is large and the zero-crossing rate is small, as a result of the determination of step 106, the speech recognition system determines that the corresponding frame is a speech frame in step 108.
Subsequently, the speech recognition system determines whether or not speech frames continue for at least a predetermined number of frames in step 110.
When it is determined that speech frames continue for at least the predetermined number of frames, the speech recognition system determines that a segment corresponding to the frames is a speech segment in step 112.
Then, the speech recognition system extracts a feature vector from the determined speech segment in step 114, and performs speech recognition using the extracted feature vector in step 116.
More specifically, the speech recognition system searches a sound model and a pronouncing dictionary, finds out phonemes similar to the extracted feature vector, and combines the phonemes into a word or a sentence. Then, the speech recognition system performs speech recognition with a language model applied so as to reduce errors in the combined word or sentence.
When it is determined that the absolute energy of the corresponding frame is not large or the zero-crossing rate is not small, the speech recognition system determines that the corresponding frame is noise in step 118, and performs step 104.
When it is determined in step 110 that speech frames do not continue for at least the predetermined number of frames, the speech recognition system determines that the corresponding frame is noise in step 118, and performs step 104.
The daily environment in which speech recognition can be performed through the above-described process is filled with a variety of noise such as surrounding noise, channel noise in a computer, and noise in a communication network.
Therefore, speech segment detection, a necessary initial part of the entire speech recognition process, directly affects recognition rate.
However, since the above-described conventional speech segment detection method fundamentally utilizes a level of sound energy, a zero-crossing rate and continuity of an input signal as main parameters, it is hard to distinguish speech from noise.
In addition, speech segment detection starts with a check of whether or not an input signal has sound energy, but speech and noise both have sound energy, and thus it is hard to distinguish speech from noise.
In addition, a stationary noise removal technique characterized by a uniform level of sound energy and a high frequency is frequently used, but there is no technique capable of distinguishing speech from dynamic noise.
In addition, since dynamic noise is not removed but classified as a speech segment and handed over to a speech recognition process, resources are consumed unnecessarily and speech recognition errors occur.