The discussion below is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
Small computing devices such as personal digital assistants (PDA), devices and portable phones are used with ever increasing frequency by people in their day-to-day activities. With the increase in processing power now available for microprocessors used to run these devices, the functionality of these devices is increasing, and in some cases, merging. For instance, many portable phones now can be used to access and browse the Internet as well as can be used to store personal information such as addresses, phone numbers and the like.
In view that these computing devices are being used with increasing frequency, it is therefore necessary to provide an easy interface for the user to enter information into the computing device. Unfortunately, due to the desire to keep these devices as small as possible in order that they are easily carried, conventional keyboards having all the letters of the alphabet as isolated buttons are usually not possible due to the limited surface area available on the housings of the computing devices. Even beyond the example of small computing devices, there is interest in providing a more convenient interface for all types of computing devices.
To address this problem, there has been increased interest and adoption of using voice or speech to access information, whether locally on the computing device, over a local network, or over a wide area network such as the Internet. With speech recognition, a dialog interaction is generally conducted between the user and the computing device. The user receives information typically audibly and/or visually, while responding audibly to prompts or issuing commands.
Generally, a speech recognition system uses various modules, such as an acoustic model and a language model as is well known in the art, to process the input utterance. Either general purpose models, or application specific models can be used, if, for instance, the application is well-defined. In many cases though, tuning of the speech recognition system, and more particularly, adjustment of the models is necessary to ensure that the speech recognition system functions effectively for the user group that it is intended. Once the system is deployed, it may be very helpful to capture, transcribe and analyze real spoken utterances in order that the speech recognition system can be tuned for optimal performance. For instance, language model tuning can increase the coverage of the system, while removing unnecessary words so as to improve system response and accuracy. Likewise, acoustic model tuning focuses on conducting experiments to determine improvement in search, confidence and acoustic parameters to increase accuracy and/or speed of the speech recognition system.
As indicated above, transcription of recorded speech data collected from the field provides a means for evaluating system performance and to train data modules. Literally, current practices require a data transcriber/operator to listen to utterances and then type or otherwise associate a transcription of the utterance for each utterance. For instance, in a call transfer system, the utterances can be names of individuals or departments the caller is trying to reach. The transcriber would listen to each utterance and transcribe each request, possibly by accessing a list of known names. Transcription is time consuming and thus, an expensive process. In addition, transcription is also error-prone, particularly for utterances comprising less common words or phrases.