1. Technical Field
The present invention relates to a method and an apparatus for detecting a speech endpoint, specifically to a method and an apparatus for detecting a speech endpoint using a WFST.
2. Background Art
A speech recognition technology extracts features from people's speech transferred to a computer or a speech recognition system through a telephone, a microphone, and the like, analyzes them, and finds a closest result from a pre-inputted recognition list.
The key to increasing speech recognition performances in the speech recognition technology depends on how accurately a speech section is obtained between noises from a speech inputted with the noises. There has been a growing demand recently for a real time speech recognition technology, with the increased popularity of devices implemented with a voice-operated user interface. Accordingly, there have been a variety of studies on a speech section detection technology for accurately detecting a speech section, which is between the time when a speech is inputted and the time when the speech ends, of the speech inputted with noises.
It is generally known that the accuracy of the speech section detection technology depends on the performances of detecting the speech endpoint representing the end of the speech section. Moreover, the current level of the speech endpoint detection technology is the biggest reason why the speech recognition technology has not been very popularized. Therefore, it is urgently needed to improve the speech endpoint detection technology.
FIG. 1 is a block diagram showing an example of a conventional apparatus for detecting speech endpoint.
As FIG. 1 shows, a conventional apparatus 1 for detecting speech endpoint mainly includes a frame-level decision 10 and an utterance-level decision 20. The frame-level decision 10 receives a feature vector fv of a frame unit created by converting an input signal, and decides whether the feature vector fv of a frame unit is a speech or a non-speech. Then, the utterance-level decision 20 decides whether a speech section is detected from the result of the decision by the frame-level decision 10.
The frame-level decision 10 includes a speech decision portion 11 and a hang-over portion 12. The speech decision portion 11 decides whether the inputted feature vector fv of frame unit is a speech or a non-speech. However, error can be included in deciding a speech signal in frame units. Therefore, the frame-level decision 10 corrects the frame units of error by additionally implementing the hang-over portion 12. The hang-over portion 12 compensates the frame units of error in deciding the speech signal with an assumption that adjacent frames have high correlations.
The utterance-level decision 20 includes a state flow control portion 21 and a heuristic application portion 22. The state flow control portion 21 controls an internal flow for detecting an endpoint of an utterance unit according to a preset rule by use of the result decided by the frame-level decision 10. Moreover, the heuristic application portion 22 verifies whether the speech detected as an endpoint by the state flow control portion 21 is a speech endpoint or not. The heuristic application portion 22 verifies a speech endpoint generally by analyzing whether the length of speech detected as an endpoint satisfies a preset minimum length of speech (generally 20 ms) or not.
In the conventional apparatus 1 for detecting speech endpoint of FIG. 1, while the frame-level decision 10 uses a statistics-based decision logic, the utterance-level decision 20 mainly uses a rule-based logic. Accordingly, because the frame-level decision 10 and the utterance-level decision 20 use logics that are independent from each other, the independently configured logics need to be optimized individually even though they have a relevance of analyzing speech, and they often fail to manage overall optimal performances despite their individual optimization. Namely, global optimization is frequently not made. Moreover, as the utterance-level decision 20 mostly uses the rule-based logic, conflicts can occur between the rules when various rules are added, greatly hindering the optimizing of endpoint detection.