The problem of speaker recognition has traditionally been treated as one of speech classification. The speech from the speaker is parameterized into sequences of feature vectors. The sequences of feature vectors are classified as belonging to a particular speaker using some classification mechanism. The prior art has primarily focused either on deriving better descriptive features from the speech signal, and on better classifiers applied to the features.
Speaker recognition can be improved by augmenting measurements from the speech signal with input from other sensors, in particular a camera. A variety of techniques are known for integrating information extracted from the video with that obtained from the speech signal. The most obvious is to combine evidence from a face recognition classifier that operates on the video to evidence from the speaker ID system that works on the speech.
Other techniques have explicitly to derive speaking-related features, such as characterizations of lip configurations, facial texture around the lips [8] etc.
Other secondary sensors, such as a physiological microphone (PMIC) and a glottal electromagnetic micropower sensor (GEMS), provide measurements that augment speech signals. However, they have largely been used for speech recognition, because they primarily produce readings that represent relatively noise-free readings of the some aspects of the speech signal, such as a filtered version of the speech, or the excitation to the vocal tract, and do not provide any additional information about the speaker that is not contained in the speech signal itself. Additionally, many of those devices must be mounted on the speaker, and are not appropriate for use in most speaker recognition or verification applications.