1. Field of the Invention
This invention relates generally to electronic speech recognition systems, and relates more particularly to a system and method for speech verification using out-of-vocabulary models.
2. Description of the Background Art
Implementing an effective and efficient method for system users to interface with electronic devices is a significant consideration of system designers and manufacturers. Voice-controlled operation of electronic devices is a desirable interface for many system users. For example, voice-controlled operation allows a user to perform other tasks simultaneously. For instance, a person may operate a vehicle and operate an electronic organizer by voice control at the same time. Hands-free operation of electronic systems may also be desirable for users who have physical limitations or other special requirements.
Hands-free operation of electronic devices may be implemented by various speech-activated electronic systems. Speech-activated electronic systems thus advantageously allow users to interface with electronic devices in situations where it would be inconvenient or potentially hazardous to utilize a traditional input device. Electronic entertainment systems may also utilize speech recognition techniques to allow users to interact with a system by speaking to it.
Speech-activated electronic systems may be used in a variety of noisy environments such as industrial facilities, manufacturing facilities, commercial vehicles, passenger vehicles, homes, and office environments. A significant amount of noise in an environment may interfere with and degrade the performance and effectiveness of speech-activated systems. System designers and manufacturers typically seek to develop speech-activated systems that provide reliable performance in noisy environments.
In a noisy environment, sound energy detected by a speech-activated system may contain speech and a significant amount of noise or other non-typical sounds. In such an environment, the speech may be masked by the noise and be undetected. This result is unacceptable for reliable performance of the speech-activated system.
Alternatively, sound energy detected by the speech-activated system may contain only noise. The noise may be of such a character that the speech-activated system identifies the noise as speech. This result reduces the effectiveness of the speech-activated system, and is also unacceptable for reliable performance. Verifying that a detected signal is actually speech increases the effectiveness and reliability of speech-activated systems.
A speech-activated system may have a limited vocabulary of words that the system is programmed to recognize. The system should respond to words or phrases that are in its vocabulary, and should not respond to words or phrases that are not in its vocabulary. Verifying that a recognized word is in the system""s vocabulary increases the accuracy and reliability of speech-activated systems.
Therefore, for all the foregoing reasons, implementing an effective and efficient method for a system user to interface with electronic devices remains a significant consideration of system designers and manufacturers.
In accordance with the present invention, a system and method are disclosed for speech verification using out-of-vocabulary models. In one embodiment of the present invention, out-of-vocabulary models are created for use in a speech verification procedure by a speech recognition system. Initially, noise types to be modeled for use in the speech verification procedure are selected and a noise database is created. The foregoing noise types may be selected according to the intended operating environment of the speech recognition system. The noise types will typically include various human noises and other noise types that are likely to be encountered during use of the speech recognition system.
Next, an initial noise model for each type of noise is trained using the noise database. In certain embodiments, each initial noise model is preferably a Hidden Markov Model that is trained to recognize one of the different types of noise. A set of test noises is then preferably input to all of the initial noise models, and the initial noise models generate recognition scores for each test noise. Then, the recognition scores are preferably normalized by dividing the recognition scores by the duration of the corresponding test noise. The recognition scores may be normalized because a noise of short duration usually produces a higher recognition score than a noise of long duration for an arbitrary noise model.
The differential scores between each initial noise model for each test noise may then be calculated. Each test noise will produce a separate recognition score for each of the initial noise models. The mutual differences between all of these differential scores may then be calculated, and an average differential score between each initial noise model may then be determined.
Next, a distance matrix may be created to include the average differential scores between each initial noise model. Then, a minimum non-zero distance for the distance matrix may preferably be determined. The two initial noise models in the distance matrix that have a minimum distance typically are acoustically similar, and therefore may be grouped together as a noise cluster.
A new distance matrix may then be created to incorporate the distances between the new noise cluster and the remaining initial noise models. Distances between the new noise cluster and the remaining initial noise models may then be calculated by averaging the mutual distances between the noise models in the new noise cluster and every remaining initial noise model.
Then, a determination may be made as to whether the final number of noise clusters has been reached. The final number of noise clusters may preferably be chosen by the designer or manufacturer of the speech recognition system, and is typically a trade-off between accuracy and computational cost. In accordance with the present invention, the initial noise models continue to be grouped into new noise clusters until the final number of noise clusters is reached. When the pre-determined final number of noise clusters has been reached, then a final noise model is trained for each of the final noise clusters for use by the speech recognition system to perform a speech verification procedure. The present invention thus efficiently and effectively performs speech verification using out-of-vocabulary models.