Object determination (such as fingerprint recognition, face identification, image/audio classification or speaker verification), is a major problem in many important and relevant fields of business and government. For example, it can be used in biometric authentication to enhance information and homeland security. Or more generally it can be one of the components in rich media indexing systems where the ability to recognize speakers in audio streams is quite useful.
Focusing on speaker identification, there are two major approaches of implementation: text-dependent or text-independent phrases. In the text-dependent approach, the system aligns the incoming and enrolled utterances and compares these utterances based on the speech context. Text-dependent approaches are more suitable to situations where the user is cooperative and where the final goal is verification rather than identification. In the text-independent approach, the system identifies speakers based on the specific acoustic features of the vocal tract instead of the context of the speech, thus no prior knowledge of what is being said is necessary. This identification process is a lot harder and more prone to errors.
Most text-independent speaker identification systems use Gaussian Mixture Models (GMM's) to represent the speech characteristics of the speakers. GMM's are well-known type of generative models. The idea is that each speaker's enrollment utterances are converted to feature vectors. A common feature extraction technique is called Mel-frequency cepstral coefficient (MFCC) See Slaney, M., “Auditory Toolbox, Version 2,” Technical Report, Interval Research Corporation (1988). Feature vectors from each speaker are trained to fit in a series of Gaussian models. These series of models are weighted by their priors and are combined linearly to form a mixture. Therefore each speaker is represented by a unique GMM. An example of this approach is described in Reynolds, D. A., “Speaker Identification and Verification Using Gaussian Mixture Speaker Models,” Speech Communication,” pp. 91-108 (August 1995). GMM's are generative models since they model the statistical distribution of the feature vectors p(x|speaker) using data from that speaker only. Similarly, GMM's can also be used to model an image (or generally, an object) in any object determination tasks.
As opposed to using generative models such as GMM's, object determination can also be based on discriminative models. The discriminative approach learns how each object compares with all other objects and spends its modeling power in learning what makes objects unique or different from one another. Support Vector Machines (SVM's) (Vapnik, V., Statistical Learning Theory, John Wiley and Sons, 1998) are often used in this approach.