Reliable identification and analysis of facial features is important for a wide range of applications, including security applications and visual tracking of individuals. Facial analysis can include facial feature extraction, representation, and expression recognition, and available systems are currently capable of discriminating among different facial expressions, including lip and mouth position. Unfortunately, many systems require substantial manual input for best results, especially when low quality video systems are the primary data source.
In recent years, it has been shown that the use of even low quality facial visual information together with audio information significantly improve the performance of speech recognition in environments affected by acoustic noise. Conventional audio only recognition systems are adversely impacted by environmental noise, often requiring acoustically isolated rooms and consistent microphone positioning to reach even minimally acceptable error rates in common speech recognition tasks. The success of the currently available speech recognition systems is accordingly restricted to relatively controlled environments and well defined applications such as dictation or small to medium vocabulary voice-based control commands (hand free dialing, menu navigation, GUI screen control). These limitations have prevented the widespread acceptance of speech recognition systems in acoustically uncontrolled workplace or public sites.
The use of visual features in conjunction with audio signals takes advantage of the bimodality of the speech (audio is correlated with lip position) and the fact that visual features are unaffected by acoustic noise. Various approaches to recovering and fusing audio and visual data in audiovisual speech recognition (AVSR) systems are known. One popular approach relies on mouth shape as a key visual data input. Unfortunately, accurate detection of lip contours is often very challenging in conditions of varying illumination or during facial rotations. Alternatively, computationally intensive approaches based on gray scale lip contours modeled through principal component analysis, linear discriminant analysis, two-dimensional DCT, and maximum likelihood transform have been employed to recover suitable visual data for processing.