A general voice recognition system extracts a time series of a feature value from time-series data of an input sound garnered by a microphone, and calculates likelihood about the time series of the feature value using a word and phoneme model to be a recognition object, and a model of non-voice other than the recognition object. Then, the voice recognition system searches for a word sequence corresponding to the time series of the input sound based on the likelihood which has been calculated, and outputs a recognition result. A plurality of proposals has been made about a method to improve accuracy of voice recognition.
In patent document 1, there is described a voice recognition apparatus which reduces deterioration of voice recognition performance caused by a silent part. FIG. 9 is an explanatory drawing showing the voice recognition apparatus disclosed in patent document 1. The voice recognition apparatus disclosed in patent document 1 includes a microphone 201 which catches an input sound, a framing unit 202 that clips time-series data of the garnered sound in predetermined time units, a noise observation section extraction unit 203 which extracts a noise section, an utterance switch 204 for a user to notify the system of starting of utterance, a feature value extracting unit 205 which extracts the feature value for each piece of voice data that has been clipped, a voice recognition unit 208 which performs voice recognition about the time series of the feature value and a silent model correction unit 207 which corrects the silent model among acoustic models used by the voice recognition unit.
In the voice recognition apparatus disclosed in patent document 1, the noise observation section extraction unit 203 estimates background noise from the section just before the utterance switch 204 is pushed, and the silent model correction unit 207 makes the silent model adapt itself to a background noise environment based on the background noise which has been estimated. By making it easy to discriminate a sound except for voices to be an object as being silent by such structure, the voice recognition apparatus mitigates false recognition of voice.
In patent document 2, there is described a voice recognition apparatus which reduces a false-recognition rate about a voice section to which background noise other than data having been used at the time of garbage model learning has been added. FIG. 10 is an explanatory drawing showing the voice recognition apparatus disclosed in patent document 2. The voice recognition apparatus described in patent document 2 includes analysis means 302 for analyzing a time series of a feature value from time-series data of a garnered sound, offset calculating means 303 for calculating a correction amount based on the feature value, a collating means 304 for performing collation of a recognition object word sequence from the time series of the feature value, garbage model 305 made by modeling a sound pattern corresponding to background noise and a recognition object vocabulary model 306.
In the voice recognition apparatus disclosed in patent document 2, the offset calculating means 303 judges a pitch frequency from the feature value, and possibility of being voice from a formant frequency and a feature value of a bandwidth and the like. Then, the offset calculating means 303 obtains an offset for correcting likelihood about the garbage model based on the judgment result. The collating means 304 performs pattern matching using the likelihood about the garbage model corrected using the above-mentioned offset, the feature value, the garbage model and the recognition object vocabulary model. By such structure, the voice recognition apparatus can recognize just the voice to be the recognition object correctly.
Further, in non-patent document 1, there is described a method to recognize a voice from voice data and a model used for voice recognition.