1. Field of the Invention
The present invention generally relates to speech and speaker recognition systems and, more particularly, to speech recognition systems supplemented by a speaker recognition system and including signal processing model substitution for use by a potentially large number of speakers.
2. Description of the Prior Art
Many electronic devices require input from a user in order to convey to the device particular information required to determine or perform a desired function or, in a trivially simple case, when a desired function is to be performed as would be indicated by, for example, activation of an on/off switch. When multiple different inputs are possible, a keyboard comprising an array of two or more switches has been the input device of choice in recent years.
However, keyboards of any type have inherent disadvantages. Most evidently, keyboards include a plurality of distributed actuable areas, each generally including moving parts subject to wear and damage and which must be sized to be actuated by a portion of the body unless a stylus or other separate mechanical expedient is employed. Accordingly, in many types of devices, such as input panels for security systems and electronic calculators, the size of the device is often determined by the dimensions of the keypad rather than the electronic contents of the housing. Additionally, numerous keystrokes may be required (e.g. to specify an operation, enter a security code, personal identification number (PIN), etc.) which slows operation and increases the possibility that erroneous actuation may occur. Therefore, use of a keyboard or other manually manipulated input structure requires action which is not optimally natural or expeditious for the user.
In an effort to provide a more naturally usable, convenient and rapid interface and to increase the capabilities thereof, numerous approaches to voice or sound detection and recognition systems have been proposed and implemented with some degree of success. However, variations in acoustic signals, even from a single speaker, which may represent a command, present substantial signal processing difficulties and present the possibility of errors or ambiguity of command understanding by the system which may only be partially avoided by substantial increase of processing complexity and increase of response time.
For example, a simple voice actuated system which relies on template matching of the acoustical content of an utterance theoretically requires a particular word or phrase to be input for each command which can be used by each enrolled (e.g. authorized) user. Therefore, even a moderate number of recognizable commands for each of a moderate number of users can require comparison with a very large number of templates while not guaranteeing successful or accurate voice recognition due to variation of the acoustical signal each time a command may be uttered. Conversely, a speaker independent system would only require enrollment of commands to be recognized and a correspondingly reduced number of template comparisons but accuracy of command recognition or understanding by the system would be severely compromised by additional variation of acoustical signals from speaker to speaker. In continuous speech recognition using Hidden Markov Models (HMM), the speech is usually characterized by a large number of lefemes (portions of phones in a given left and right context). The recognition of an utterance involves aligning the utterance against different hypotheses and computing the likelihood of each hypothesis. In this context, the stochastic distribution of acoustic feature vectors corresponding to each lefeme must be properly modeled. What was said for template matching remains applicable to each lefeme: speaker dependent systems present less variability than speaker independent systems which must account, at the lefeme level, for not only intra-speaker variations but inter-speaker variations as well.
Accordingly, it can be understood that while improved performance can be expected from a speaker dependent system, such improved performance has heretofore only been achieved at the expense of greater processing complexity and memory requirements when the speaker population becomes large. In other words while it is possible to build a speaker dependent model for a small number of users, such a process cannot, as a practical matter, be repeated indefinitely as the speaker population size increases.
That is, as an incident of modelling a lefeme of a speaker, the signal components which represent corresponding acoustic components of the lefeme will have some statistical distribution of parameters (e.g. spectral content, cepstral vector, etc., each having a mean, variance and a weight) even for a single speaker and a single utterance. In general, intra-speaker variability of the acoustic vectors associated to the same lefeme are usually smaller than inter-speaker variability.
Accordingly, when modelling the probability distribution of the likelihood that a particular parameter will, in combination with other parameters, correspond to a particular lefeme, must be modelled as a sum of overlapping distributions (e.g. Gaussian distributions), referred to as mixture components which take into account intra-speaker and inter-speaker signal variations in the system.
As will be readily understood, the sum of non-coincident distributions tends to widen the distribution (e.g. increase the variance) of the resulting distribution reflecting such a sum. Thus, as the number of speakers increases the number of mixture components of a signal, the definition of states of the parameters corresponding to a speaker and/or an utterance degrades.
In systems which rely on matching of templates or any other type of representation of acoustic signals, the increasing variances thus developed tends to cause ambiguity in the matching process. This problem may be compounded in speech recognition systems which are supplemented by speaker recognition and use of speaker dependent signal processing models since lack of success in speaker recognition may prevent the application of any speaker dependent model while an incorrect model may be used if an incorrect speaker is identified. Thus, there is a need to resolve the disadvantages between speaker-dependent and speaker-independent speech recognition systems which have heretofore been considered as mutually exclusive.