1. Technical Field
The present disclosure relates to speech processing and more specifically to detecting speech activity based on facial features.
2. Introduction
Many mobile devices include microphones, such as smartphones, personal digital assistants, and tablets. Such devices can use audio received via the microphones for processing speech commands. However, when processing speech and transcribing the speech to text, unintended noises can be processed into ghost words or otherwise confuse the speech processor. Thus, the systems can attempt to determine where the user's speech starts and stops to prevent unintended noises from being accidentally processed. Such determinations are difficult to make, especially in environments with a significant audio floor, like coffee shops, train stations, and so forth, or where multiple people are having a conversation while using the speech application.
To alleviate this problem, many speech applications allow the user to provide manual input, such as pressing a button, to control when the application starts and stops listening. However, this can interfere with natural usage of the speech application and can prevent hands-free operation. Other speech applications allow users to say trigger words to signal the beginning of speech commands, but the trigger word approach can lead to unnatural, stilted dialogs. Further, these trigger words may not be consistent across applications or platforms, leading to user confusion. These and other problems exist in current voice controlled applications.