1. Field of the Invention
The present invention relates generally to a speaker verification system (“SVS”), and more specifically to a text-independent SVS.
2. Description of the Related Art
The field of speaker verification has gained the interest of researchers and industry, since it is a simple biometric measure that can be used with ease and relatively low implementation costs for personnel authorization purposes. Extensive research has been conducted to identify improved pattern recognition methodologies to enhance the accuracy of speaker verification systems.
Biometric measurements, or “biometrics,” allow humans to be recognized by automated electronic means based upon one or more intrinsic and unique physical characteristics. Fingerprints, retina patterns and speech patterns are examples of measurable characteristics that can be used to verify an individual's identity. A very important advantage of biometrics is that the physical characteristics selected cannot be forgotten or transferred to other individuals and cannot easily be mimicked by others. The basic concept of biometrics consists mainly of comparing an individual's characteristic(s) against a recorded reference pattern taken from the claimed identity.
Voice recordings and reproductions cannot easily be used to penetrate sophisticated voice-based security systems. However, there are some drawbacks to the technology, such as fluctuations in voice patterns due to illness, physical exertion, and the like, which can affect the accuracy of the recognition systems. Depending on the security classification and other application requirements, several precautions can be taken to reduce the effect of such fluctuations, such as using high quality microphones and noise cancellers.
There are two types of speaker recognition systems: (a) speaker verification systems, and (b) speaker identification systems.
A speaker verification system determines whether a speaker who provides a voice sample is whom she claims to be, by comparing her speech sample to a speech pattern for the claimed individual. For example, a person seeking access to a secured area can swipe an identification card or enter a personal identification number (“PIN”) assigned to a particular employee, leading the system to ask the person for a few seconds of speech. The system will then compare that speech to a pattern that the system had previously established for that employee's speech, and will decide whether the person presenting the card or entering the code is the employee or an imposter.
In contrast, a speaker identification system does not initially focus on one claimed individual, but rather compares a speech sample against a database of speech patterns from a number of speakers to try to attain a match.
A speaker recognition system can be classified as either a text-dependent system (“TDS”) or a text-independent system (“TIS”).
A TDS requires the speaker to speak certain key words in enrollment and threshold generation stages, and then to repeat some of the same complete words in a verification stage. Thus, a TDS requires a cooperative speaker, and such systems are used primarily to control physical or computer access. Some randomization can be introduced to try to prevent an intruder from recording the speech and playing it back. Thus, in the enrollment stage, the speaker can be asked to recite the numbers “one” through “twenty,” and words such as, “Able,” “Baker,” “Charlie,” etc. In the verification stage, the system can prompt the user, e.g., by way of a display monitor, to pronounce “Three Able Five Baker,” and upon the next access attempt, “Twenty Charlie Fifteen.” A TDS typically requires two to three seconds of speech in each of the stages.
In contrast, a TIS does not rely on matching words being spoken during the verification stage. The enrollment stage requires about 10-30 seconds of speech from the speaker, either with the speaker's voluntary participation, or speech that could even have been recorded surreptitiously. During the verification stage, the system requires 5-10 seconds of speech, which also may have been recorded surreptitiously, and which does not need to be any of the same words recorded during the enrollment stage. Where users are cooperative, a text-independent system can use a fixed text in the enrollment stage. The term “text independent” simply means that there is no requirement that the words proffered during the verification stage match some of those used during the enrollment speech. Text-independent systems are used in forensic and surveillance applications where the system user is not cooperative.
Prior art speaker recognition systems, such as that shown in FIG. 1, typically employ a basic configuration of enrollment, threshold generation and verification stages. Some systems include threshold generation as part of either the enrollment stage or the verification stage, instead of illustrating it as a separate stage. Prior art systems also commonly group the functions to be performed in each stage by functional block.
In the enrollment stage, speech from a particular speaker is entered into the system and the system generates and stores a speaker reference model, which is a codebook containing characteristic speech patterns for given sounds made by a particular speaker.
In the threshold generation stage, additional speech from the same speaker will be entered into the system, with the system establishing a threshold, which is defined as the maximum deviation from the codebook that the system considers to be acceptable.
In the verification stage, a speaker will claim to be a particular individual whose codebook resides in the system. The speaker will make that claim, for example, by swiping an identification card, or by entering an PIN assigned to a particular employee. The system will locate the codebook for the claimed individual and load it into memory. The speaker will provide a speech sample, to which the system applies digital signal processing and feature extraction, after which the system compares the result to the codebook previously established for the claimed individual. If the difference is within the threshold, the system accepts the speaker as the person associated with the codebook. If the difference is outside the threshold, the system rejects the speaker as an imposter.
As noted, the operation of a typical prior art SVS can be divided into a number of functional blocks, in which a known computer device processes information. The Digital Signal Processing (“DSP”) blocks appear respectively in all three stages, where they filter and digitize the analog sound waves entering each stage.
Each stage also has a Feature Extraction (“FE”) block that derives a lower-dimensional feature space representation of speech elements. This representation will allow for reduced data storage and computing requirements, while still being capable of discriminating between speakers.
Pattern Matching (“PM”) blocks are present in the enrollment stage and threshold generation stage, where they create a speaker model. In the enrollment stage, the speaker model is retained as the codebook. In the threshold generation stage, the speaker model is not retained, but is forwarded to a threshold generation block, where it is compared with the codebook and used to establish a threshold.
The feature comparison block, in the verification stage, compares features extracted from the speaker with the codebook established for the claimed identity.
The decision block, in the verification stage, receives the results calculated by the feature comparison block and compares them with the threshold, before deciding whether to accept or reject the claimed identity.
As noted above, a feature extraction technique is utilized in all three stages of an SVS to find a lower-dimensional feature space representation that includes sufficient vectors of information to achieve suitable similarity measurements. The speech signal is a complex function of the speaker's physical characteristics (i.e., vocal tract dimensions and environment) and emotional state (i.e., physical and mental stress). A broad sampling or selection of the acoustic features is critical for the effectiveness of an SVS. Specifically, feature extraction should (a) extract speaker-dependent features from speech that are capable of discriminating between speakers while being tolerant of intra-speaker variabilities; (b) be easily measurable from the speech signal; (c) be stable over time; and (d) not be susceptible to mimicry by impostors.
The feature extraction process can be programmed to identify a number of features including linear predictive coefficients (“LPC”), pitch and pitch contours, format frequency and bandwidth, nasal coarticulation and gain. The effectiveness of an SVS is highly dependent on the accuracy of discrimination of the speaker models obtained from the speech features. The two most popular methods of feature extraction are modeling human voice production, especially using LPC, and modeling the human system of perception by analyzing pitch and pitch contours, especially using mel frequency cepstral coefficient (“MFCC”). The main advantage to MFCC is that the technique accurately approximates the human auditory system, which does not perceive frequency components in speech as following a linear scale.
As noted above, the main objective of pattern matching is to create a speaker model, which in turn is used to formulate a codebook in the enrollment stage and as an input to the threshold generation block in the threshold generation stage. The selection of a proper value for threshold is vital, as it determines whether an identity claim with a reasonable variance will be accepted or rejected. The most common pattern matching methods are stochastic modeling and neural networks. Template modeling and other methods have also been used, though less commonly.
These prior art systems typically require powerful microprocessors and large amounts of computer memory. In addition, the only known commercial speaker verification system is a text-dependent system. It would be highly desirable to have a text-independent speaker verification system that can be implemented with a less powerful microprocessor and smaller data storage device than used by known comparable systems of the prior art.