The present invention relates to speech recognition, and more specifically to facilitating the identification of spoken words and individual speakers in continuous speech via deconstruction of the sound envelope representing those spoken words.
Speech recognition enables users to interact with devices using spoken words. There are many technologies today that enable speech recognition. Some of the current technologies include techniques that predominantly analyze the speech spectrograms.
In one approach, a window (e.g., Hamming window, etc.) of 20 to 50 milliseconds (cepstral extraction) is applied, and then the spectrum of the captured waveform is measured and compared against the spectrum samples in a library of sounds. The comparison finds distances for the set of features and the feature with the minimum distance is selected.
Additionally, the currently known solutions require training of the tool by the speakers to supplement the pre-training from the corpus. Several HMM's (Hidden Markov Models) are set up to help with the identification of words represented by the sounds. Sometimes, statistical language models, semantic interpretation and acoustic models, such as phoneme based models, are also used to help identify the spoken word. Alternatively, some models compare the spoken word against a very large corpus of words.