Humans identify speakers based on a variety of attributes of the person which include acoustic cues, visual appearance cues and behavioral characteristics (e.g., such as characteristic gestures, lip movements). In the past, machine implementations of person identification have focused on single techniques relating to audio cues alone (e.g., audio- based speaker recognition), visual cues alone (e.g., face identification, iris identification) or other biometrics. More recently, researchers are attempting to combine multiple modalities for person identification, see, e.g., J. Bigun, B. Duc, F. Smeraldi, S. Fischer and A. Makarov, "Multi-modal person authentication," In H. Wechsler, J. Phillips, V. Bruce, F. Fogelman Soulie, T. Huang (eds.) Face Recognition: From theory to applications, Berlin Springer- Verlag, 1999.
Speaker recognition is an important technology for a variety of applications including security and, more recently, as an index for search and retrieval of digitized multimedia content (for instance in the MPEG-7 standard). Audio-based speaker recognition accuracy under acoustically degraded conditions (e.g., such as background noise) and channel mismatch (e.g., telephone) still needs further improvements. To make improvements in such degraded conditions is a difficult problem. As a result, it would be highly advantageous to provide methods and apparatus for providing improved speaker recognition that successfully perform in the presence of acoustic degradation, channel mismatch, and other conditions which have hampered existing speaker recognition techniques.