The field of user authentication has received increasing attention over the past decade. To enable around-the-dock availability of more and more personal services, many sophisticated transactions have been automated, and remote database access has become pervasive. This, in turn, heightened the need to automatically and reliably establish a user's identity. In addition to standard password-type information, it is now possible to include, in some advanced authentication systems, a variety of biometric data, such as voice characteristics, retina patterns, and fingerprints.
In the context of voice processing, two areas of focus can be distinguished. Speaker identification is the process of determining which registered speaker provides a given utterance. Speaker verification, on the other hand, is the process of accepting or rejecting the identity of a speaker based upon an utterance. Collectively, they refer to the automatic recognition of a speaker (i.e., speaker authentication) on the basis of individual information present in the speech wave form. Most applications in which a voice sample is used as a key to confirm the identity of a speaker are classified as speaker verification. Many of the underlying algorithms, however, can be applied to both speaker identification and verification.
Speaker authentication methods may be divided into text-dependent and text-independent methods. Text-dependent methods require the speaker to say key phrases having the same text for both training and recognition trials, whereas text-independent methods do not rely on a specific text to be spoken. Text-dependent systems offer the possibility of verifying the spoken key phrase (assuming it is kept secret) in addition to the speaker identity, thus resulting in an additional layer of security. This is referred to as the dual verification of speaker and verbal content, which is predicated on the user maintaining the confidentiality of his or her pass-phrase.
On the other hand, text-independent systems offer the possibility of prompting each speaker with a new key phrase every time the system is used. This provides essentially the same level of security as a secret pass-phrase without burdening the user with the responsibility to safeguarding and remembering the pass-phrase. This is because prospective impostors cannot know in advance what random sentence will be requested and therefore cannot (easily) play back some illegally pre-recorded voice samples from a legitimate user. However, implicit verbal content verification must still be performed to be able to reject such potential impostors. Thus, in both cases, the additional layer of security may be traced to the use of dual verification.
In all of the above, the technology of choice to exploit the acoustic information is hidden Markov modeling (HMM) using phonemes as the basic acoustic units. Speaker verification relies on speaker-specific phoneme models while verbal content verification normally employs speaker-independent phoneme models. These models are represented by Gaussian mixture continuous HMMs, or tied-mixture HMMs, depending on the training data. Speaker-specific models are typically constructed by adapting speaker-independent phoneme models to each speaker's voice. During the verification stage, the system concatenates the phoneme models appropriately, according to the expected sentence (or broad phonetic categories, in the non-prompted text-independent case). The likelihood of the input speech matching the reference model is then calculated and used for the authentication decision. If the likelihood is high enough, the speaker/verbal content is accepted as claimed.
The crux of speaker authentication is the comparison between features of the input utterance and some stored templates, so it is important to select appropriate features for the authentication. Speaker identity is correlated with the physiological and behavioral characteristics of the speaker. These characteristics exist both in the spectral envelope (vocal tract characteristics) and in the supra-segmental features (voice source characteristics and dynamic features spanning several segments). As a result, the input utterance is typically represented by a sequence of short-term spectral measurements and their regression coefficients (i.e., the derivatives of the time function of these spectral measurements).
Since HMMs can efficiently model statistical variation in such spectral features, they have achieved significantly better performance than less sophisticated template-matching techniques, such as dynamic time-warping. However, HMMs require the a priori selection of a suitable acoustic unit, such as the phoneme. This selection entails the need to adjust the authentication implementation from one language to another, just as speech recognition systems must be re-implemented when moving from one language to another. In addition, depending on the number of context-dependent phonemes and other modeling parameters, the HMM framework can become computationally intensive.