Speech recognition systems are increasingly utilized in various applications such as telephone services where a caller orally commands the telephone to call a particular destination. In these systems, a telephone customer may enroll words corresponding to particular telephone numbers and destinations. Subsequently, the customer may pronounce the enrolled words, and the corresponding telephone numbers are automatically dialled. In a typical enrollment, input utterance is segmented, word boundaries are identified, and the identified words are enrolled to create a word model which may be later compared against subsequent input utterances. In subsequent speech recognition, the input utterance is compared against enrolled words. Under a speaker-dependent approach, the input utterance is compared against words enrolled by the same speaker. Under a speaker-independent approach, the input utterance is compared against words enrolled to correspond with any speaker.
Many prior art systems falsely incorporate noise as part of a word. Another major problem in speech enrollment and recognition systems is the false classification of a word portion as being noise. Typical enrollment and speech recognition approaches rely upon frame energy as the primary means of identifying word boundaries and of segmenting an input utterance into words. However, the frame energy approach frequently excludes low energy portions of a word. Hence, words are inaccurately delineated, and subsequent recognition suffers. Moreover, in frame energy-based systems, all words must typically be enunciated in isolation which is undesirable if several words or phrases must be enrolled or recognized. Even if frame energy is not used to segment words in the subsequent speech recognition process, the accuracy of speech recognition will depend upon the accuracy of prior speech enrollment which typically does rely upon frame energy.
Therefore, a need has arisen for an accurate method and apparatus for identifying a speech pattern.