The present embodiments relate to speech recognition, and are more particularly directed to a system for permitting access to a common resource in response to speaker identification and verification.
Over the past decade, speech recognition in computers has dramatically improved. This improvement has led to various applications of speech recognition, such as telephony operations and apparatus control. Spoken name speed dialing is an example of speech recognition in telephony operations. In spoken name speed dialing, a computer maintains a directory which includes numbers frequently called by a caller, and for each such number the directory further includes a representation of the caller's voice speaking a name corresponding to the telephone number. A caller may then call a number identified in the directory by merely speaking the corresponding name into the phone, assuming that the spoken name matches an entry in the directory. For example, a call can be placed by saying "call the boss" or some other utterance into the phone microphone, in response to which the phone system will dial the corresponding number for the uttered name. Door access is an example of speech recognition for apparatus control. In such a system, a computer maintains a directory including voice representations for each person having authority to control the door by way of voice recognition. A user may then control the door (e.g., gain access to the room or building via the door) by uttering information into a microphone, assuming that the spoken information matches an entry in the directory. The spoken information may be the user's name, although a name may be too short for robust speaker verification. Thus, as an alternative, the information may be some type of code which is anticipated as sufficiently long in duration to permit proper verification.
One particular context in which voice recognition may have particular benefits is access to a common resource by multiple persons. The resource may arise in telephony or control apparatus. As an example, assume that a selected group of employees of an office environment are authorized to place long distance calls through a single telephone account number. Assume further that such access is desired via speech recognition. Thus, when the system operates properly, only those authorized employees are permitted to make long distance telephone calls, whereas non-authorized persons (i.e., either in the office or outside of the office) are rejected by the system. This type of group access scenario is addressed by the present inventive embodiments. Thus, before proceeding with a detailed analysis of those embodiments, it is first instructive to examine some potential alternative approaches as may be contemplated by one skilled in the art.
In the above example as well as in comparable group access scenarios, a non-speech recognition approach as may be contemplated by one skilled in the art is to require all members of the group to remember a personal identification number (PIN) to be entered on the touch tone pad of a telephone. Such an approach, however, has at least two drawbacks. First, the user is required to remember particularized information (e.g., PIN) which may be forgotten by, or at least burdensome to, the user. Second, a PIN approach may not be sufficiently secure since an unauthorized person may obtain the PIN and use it to gain fraudulent access to the system. Third, such an approach is not a speech recognition based system and, thus, is not suitable where speech recognition is either desired or mandated as the control technique to be imposed on the group.
Also in the above example and in other group access scenarios, a speech recognition approach as may be contemplated by one skilled in the art is to use a "speaker verification" approach for each member in the group. Under such an approach, each speaker would be required to provide an utterance sufficient to perform speaker verification on that utterance, with the term speaker verification being understood in the art. For speaker verification, the speaker must enroll a single phrase into a system, typically by repeating the same utterance multiple times, which then uses a speaker verification model based on the enrolled utterances to form a speaker verification template. This template therefore includes only a vocabulary corresponding to the single utterance, and is tightly constrained to permit only on the order of a one to three percent speaker impostor acceptance (i.e., an acceptance of either an inaccurate utterance from the authorized speaker or of either an accurate or inaccurate utterance from an unauthorized speaker). Once the template is formed, the speaker thereafter could have access to the resource by again stating an utterance which is then compared, using a speaker verification algorithm, to the speaker verification template. While this approach is one which could be devised by one skilled in the art, note that it provides various drawbacks in the context of a group of persons having access to the same resource. For example, returning to the scenario above where selected employees desire to access a long distance account, note that after all of the selected employees are enrolled, each time access is attempted this approach is required to perform a speaker verification for an utterance against all speaker verification templates in the system. For example, if there are 100 authorized employees, then when one employee attempts access to the resource by stating an utterance, then that utterance would be analyzed against 100 corresponding speaker verification templates. However, as known in the art as performance typical of today's hardware and algorithms, a single speaker verification analysis takes typically on the order of one-half to one time the period of the utterance. Thus, for an utterance of one second, then a serial analysis of that utterance against 100 corresponding speaker verification templates would require between 50 (i.e.,1/2* 1 second utterance) and 100 (i.e., 1*1 second utterance) seconds. For various applications, this time period is far too long. Additionally, if the speaker verification analyses are done in parallel fashion, then the complexity for such an approach is vastly increased. As still another drawback to this approach, if numerous (e.g., 100) speaker verification templates are analyzed, then the overall impostor rate is increased given the one to three percent impostor rate for each separate analysis.
In view of the above, there arises a need to address the drawbacks of the above approaches and provide an improved system for permitting access to a common resource in response to speaker utterances.