Field
The technology of the present application relates generally to speech recognition systems, and more particular, to apparatuses and methods to allow for determining training resources in a speech to text center.
Background
Natural language or continuous speech recognition and speech to text engines are becoming ubiquitous for the generation of text from user audio. Exemplary natural language speech to text engines are available from companies such as Microsoft, Inc., International Business Machine Company, and Nuance, Inc. to name but three exemplary companies with speech recognition engines. The recognized text may be used to generate word documents, such as, for example, this patent application, or populate fields in a user interface, database, or the like, such as, for example, the data fields in a customer relationship management application usable with a call center. The use of speech recognition in applications, such as, for example, customer relationship management applications, legal applications, accounting applications, and medical applications is particularly beneficial as those services generally are document intensive and the service providers are rarely experts in typing or the like.
The focus of natural language systems is to match the utterance to a likely vocabulary and phraseology and determine how likely the sequence of language symbols would appear in speech. Determining the likelihood of a particular sequence of language symbols is generally called a language model. The language model provides a powerful statistical model to direct a word search based on predecessor words for a span of n words. Thus, the language model will use probability and statistically more likely for words with similar utterances. For example, the words “see” and “sea” are pronounced substantially the same in the United States of America. Using a language model, the speech recognition engine would populate the phrase: “Ships sail on the sea” correctly because the probability indicates the word “sea” is more likely to follow the earlier words in the sentence. The mathematics behind the natural language speech recognition system are conventionally known as the hidden Markov model. The hidden Markov model is a system that predicts the value of the next state based on the previous states in the system and the limited number of choices available. The details of the hidden Markov model are reasonably well known in the industry of speech recognition and will not be further described herein.
Conventionally, the speech recognition systems are machine specific. The machine includes the language model, speech recognition engine, and user profile for the user (or users) of the machine. These conventional speech recognition engines may be considered thick or fat clients where a bulk of the processing is accomplished on the local machine. More recently, companies such as nVoq located in Boulder, Colo., have developed technology to provide a distributed speech recognition system using the Cloud. In these cases, the audio file of the user is streamed or batched to a remote processor from a local device. The remote processor performs the conversion (speech to text or text to speech) and returns the converted file to the user. For example, a user at a desktop computer may produce an audio file that is sent to a text to speech device that returns a Word document to the desktop. In another example, a user on a mobile device may transmit a text message to a speech to text device that returns an audio file that is played through the speakers on the mobile device.
While dictation to generate text for documents, a clipboard, or fields in a database are reasonably common, they all suffer from the same drawback in that the most robust systems require the speech to text engine to be trained to the individual using the speech to text engine. The initial training of a natural language speech recognition engine generally uses a number of “known” words and phrases that the user dictates. The statistical algorithms are modified to match the user's speech patterns. Subsequent modifications of the speech recognition engine may be individualized by corrections entered by a user to transcripts when the transcribed speech is returned incorrect.
While significantly more robust, natural language speech recognition engines generally require training to a particular user's speech patterns, dialect, etc., to function properly. The training is often time consuming and tedious. However, natural language speech recognition engines that are not properly trained frequently operate with mistakes causing frustration and inefficiency for the users. In some cases, this may lead to the user discontinuing the implementation of the natural language speech recognition engine. Thus, many industries seeking to use speech recognition need to determine training programs to provide sufficient training (both of the speech recognition engine as well as the individual using the speech recognition engine) such that the system is used properly to avoid frustration and inefficiencies but not too much training, which is time consuming, tedious, and a waste of scarce resources. Conventionally, speech recognition engine training is pursuant to an accepted protocol. However, little regard is given for sufficient training to provide an acceptable level of accuracy.
Thus, against this background, it is desirable to develop improved apparatuses and methods for managing resources for a system using voice recognition.