Speech recognition and voice processing systems are known for translation of dictated speech into text or computer instructions (such as menu operations, and the like). Conventional speech recognition systems use a number of different algorithms and technologies in a perennial effort to recognize a user's speech and do what the user desires based on that speech recognition. A common application of this technology is in the classic dictation sense, where voice is converted into text in a word processing application. Another application is conversion of voice into common instructions for menu operations, such as open a file, close a file, save a file, copy, paste, etc.
In most systems, there is a computing device with memory, storage, and a processor, that executes a software application enabling the speech recognition functionality. A user speaks into a microphone and the speech recognition software processes the user's voice into text or commands.
There are several performance factors that are considered when assessing these speech recognition applications. Among the factors are speed and accuracy. The users of such applications desire that the applications interpret the user's voice as accurately as possible, so that later editing time is reduced or eliminated, and/or commands are understood. Likewise, the users of such applications also benefit from the applications providing feedback in real-time, so that the user knows as quickly as possible what the application heard and what it is doing in response to the voice input, or that commands are acted on quickly.
When operating a speech recognition software application for the first time, it is highly recommended (and in some instances required) to go through a process of acoustic training. The phrase “acoustic training”, as utilized herein, refers to a process that is performed in an effort to improve the quality of speech recognition by speech recognition engines in a single environment. The process attempts to teach the speech recognition engine of the speech recognition software application to recognize and accurately interpret a particular user's voice.
Generally, the training process includes presenting the user with a prepared script. The user is required to read the script aloud into the microphone in communication with the speech recognition software application. The speech recognition engine attempts to recognize the speech and compare it in some manner with the prepared script that it knows. In some instances, speech recognition software applications will provide immediate feedback during this process, accepting or rejecting spoken phrases as matching or not matching the prepared script. If the spoken phrase matches the prepared script, the software application provides the user with the next phrase or sentence to read. If the spoken phrase does not match the prepared script in a way that the speech recognition engine recognizes, the software application will prompt the user to repeat the phrase or sentence until recognition is confirmed, or the phrase or sentence is skipped following multiple failed attempts to match. Historically, such a process has taken over 30 minutes to complete. More recently, acoustic training can be completed faster, but it still requires the user to read the prepared script for some period of time, attempting to get the speech recognition engine to recognize the spoken word, and match it with the prepared script.