As the demand for smaller, more portable electronic devices grows, consumers want additional features that enhance and expand the use of portable electronic devices. These electronic devices include compact disc players, two-way radios, cellular telephones, computers, personal organizers, speech recorders, and similar devices. In particular, consumers want to input information and control the electronic device using voice communication alone. It is understood that voice communication includes speech, acoustic, and other non-contact communication. With voice input and control, a user may operate the electronic device without touching the device and may input information and control commands at a faster rate than a keypad. Moreover, voice-input-and-control devices eliminate the need for a keypad and other direct-contact input, thus permitting even smaller electronic devices.
Voice-input-and-control devices require proper operation of the underlying speech recognition technology. Basically, speech recognition technology analyzes a speech waveform within a speech data acquisition window for matching the waveform to word models stored in memory. If a match is found between the speech waveform and a word model, the speech recognition technology provides a signal to the electronic device identifying the speech waveform as the word associated with the word model.
A word model is created generally by storing parameters derived from the speech waveform of a particular word in memory. In speaker independent speech recognition devices, parameters of speech waveforms of a word spoken by a sample population of expected users are averaged in some manner to create a word model for that word. By averaging speech parameters for the same word spoken by different people, the word model should be usable by most if not all people.
In speaker dependent speech recognition devices, the user trains the device by speaking the particular word when prompted by the device. The speech recognition technology then creates a word model based on the input from the user. The speech recognition technology may prompt the user to repeat the word any number of times and then average the speech waveform parameters in some manner to create the word model.
To properly operate speech recognition technology, it is important to consistently identify the start and end endpoints of the speech utterances. Inconsistently identified endpoints may truncate words and may include extraneous noises within the speech waveform acquired by the speech recognition technology. Truncated words and/or noises may result in poorly trained models and cause the speech recognition technology not to work properly when the acquired speech waveform does not match any word model. In addition, truncated words and noises may cause the speech recognition technology to misidentify the acquired speech waveform as another word. In speaker dependent speech recognition devices, problems due to poor endpointing are aggravated when the speech recognition technology permits only a few training utterances.
The prior art describe techniques using threshold energy comparisons, zero crossings analysis, and cross correlation. These methods sequentially analyze speech features from left to right, right to left, or center outwards of the speech waveform. In these techniques, utterances containing pauses or gaps are problematic. Typically, pauses or gaps in an utterance are caused by the nature of the word, the speaking style of the user, and by utterances containing multiple words. Some techniques truncate the word or phrase at the gap, assuming erroneously that the endpoint has been reached. Other techniques use a maximum gap size criteria to combine detected parts of utterances with pauses into a single utterance. In such techniques, a pause longer than a predetermined threshold can cause parts of the utterance to be excluded.
Accordingly, there is a need to consistently identify the start and end endpoints of a complete speech utterance within a speech acquisition window. There also is a need to ensure words or parts of words separated by pauses or gaps in the utterance are completely included within the utterance boundaries.