Typically, background noise or babble noise may pose a problem in a video-conferencing scenario when determining whether a person is speaking or not. Further, the existence of background or babble noise may make it difficult to understand the person's spoken words. According to one conventional approach, visual speech recognition may recognize spoken word(s) by using only the video signal that is produced during speech. Typically, visual speech recognition deals with the visual domain of speech and involves image processing, artificial intelligence, object detection, pattern recognition, statistical modeling, etc. In one example, the visual speech recognition may employ face detection for “lip reading” to convert lip and mouth movements into speech or text based on visemes. A viseme is the mouth shapes (or appearances) or sequences of mouth dynamics that are required to generate a phoneme in the visual domain. However, visemes cover only a small subspace of the mouth motions represented in the visual domain, and therefore may be limited to a relatively small subset of words.