A speech recognition system receives an audio stream and filters the audio stream to extract and isolate sound segments that make up speech. Speech recognition technologies allow computers and other electronic devices equipped with a source of sound input, such as a microphone, to interpret human speech, e.g., for transcription or as an alternative method of interacting with a computer. Speech recognition software is being developed for use in consumer electronic devices such as mobile telephones, game platforms, personal computers and personal digital assistants. In a typical speech recognition algorithm, a time domain signal representing human speech is broken into a number of time windows and each window is converted to a frequency domain signal, e.g., by fast Fourier transform (FFT). This frequency or spectral domain signal is then compressed by taking a logarithm of the spectral domain signal and then performing another FFT. From the compressed signal, a statistical model can be used to determine phonemes and context within the speech represented by the signal. The extracted phonemes and context may be compared to stored entries in a database to determine the word or words that have been spoken.
In the field of computer speech recognition a speech recognition system receives an audio stream and filters the audio stream to extract and isolate sound segments that make up speech. The sound segments are sometimes referred to as phonemes. The speech recognition engine then analyzes the phonemes by comparing them to a defined pronunciation dictionary, grammar recognition network and an acoustic model.
Speech recognition systems are usually equipped with a way to compose words and sentences from more fundamental units. For example, in a speech recognition system based on phoneme models, pronunciation dictionaries can be used as look-up tables to build words from their phonetic transcriptions. A grammar recognition network can then interconnect the words.
A data structure that relates words in a given language represented, e.g., in some graphical form (e.g., letters or symbols) to particular combinations of phonemes is generally referred to as a Grammar and Dictionary (GnD). An example of a Grammar and Dictionary is described, e.g., in U.S. Patent Application publication number 20060277032 to Gustavo Hernandez-Abrego and Ruxin Chen entitled Structure for Grammar and Dictionary Representation in Voice Recognition and Method For Simplifying Link and Node-Generated Grammars, the entire contents of which are incorporated herein by reference.
Certain applications utilize computer speech recognition to implement voice activated commands. One example of a category of such applications is computer video games. Speech recognition is sometimes used in video games, e.g., to allow a user to select or issue a command or to select an option from a menu by speaking predetermined words or phrases.
Video game devices and other applications that use speech recognition are often used in noisy environments that may include sources of speech other than the person playing the game or using the application. In such situations, stray speech from persons other than the user may inadvertently trigger a command or menu selection.
Some prior art applications that use speech recognition, e.g., for voice activated commands, also use two microphones. Prior art solutions have either performed voice detection on only one microphone signal. Unfortunately voice volume is very unreliable for source distance estimation because the real voice volume of the source is unknown. Furthermore, determining whether a voice signal in a noisy game environment corresponds to an intended voice or an unwanted voice is particularly challenging for a single source.
Other prior art systems perform signal arrival direction estimation using an array of sound signals from an array of microphones. Unfortunately, prior art systems based on arrays of microphones generally utilize far-field microphones that are not used for close talk. Consequently, signals from such microphones are sub-optimal for speech recognition.
It is within this context that embodiments of the current invention arise.
Common reference numerals are used to refer to common features of the drawings.