1. Field of the Invention
This invention relates to computer voice recognition enhancements. It explains methodologies for measuring reliability, accuracy, and performance as system responsiveness using a standardized method of measurement. The invention introduces a method of machine independent user mobility between different voice recognition systems. It addresses a method for enabling speaker independent voice recognition for masses of people without the need for training or enrollment. It describes how to apply the technology to a new style of interactive real time voice to text handheld transcriber including visual feedback to replace the previous handheld transcribers that are only recording devices. It describes using these techniques in a system that translates voice mail audio into text readable messages.
2. Field of the Related Art
Prior to voice recognition dictation, transcription was completely done by humans. Some inventions enhanced this situation by providing a means to make it faster and easier for the transcriptionist to work with the audio file that needed to be transcribed. An example is U.S. Pat. No. 6,175,822 Bryce Alan Jones (Method and System for Providing Network Base Transcription Services) where the audio file is captured at one location and sent over the Internet and played back to a transcriptionist at a second location removing the requirement of the transcriptionist to be at the location where the dictation was taking place. Over time features were added to audio files including a parallel processing using speech recognition. An example of this is U.S. Pat. No. 6,073,103 Dunn et al. (Display Accessory for Record Playback System) where it is described how to combine audio voice input and speech recognition applications to identifying numbers in the voice audio files. This gives a user the ability to index into the audio where the numbers are located. Another feature added was the ability to capture audio when the speech recognition was turned off to avoid loss of any of the spoken words, as described in U.S. Pat. No. 6,415,258 Reynar et al. (Background Audio Recovery System).
In general terms however, voice recognition dictation products that are presently in the market follow the typical clone PC market strategy. The state of the art is buying a personal computer that is designed as a general purpose computing device, installing voice recognition software (i.e. IBM ViaVoice, L&H Voice Express, Philips Speech Pro from Philips, Dragon Naturally Speaking, from Dragon Systems), and using that configuration as a Large Vocabulary Voice Recognition dictation system. When using Large Vocabulary Voice Recognition (LVVR) applications in the clone PC environment, two problems are experienced: machine dependency and speaker dependency. While this approach is typically used throughout the computer industry, it often leaves users frustrated with accuracy and performance of the voice recognition applications.
This is especially true when applying the technology to a hand held transcriber type of devices like a tape recorder or digital voice recorder. The industry standard for handheld dictation is to use handheld tape recorders or memory devices that provide the same functionality as tape recorders, i.e. a handheld digital reorder. Voice recognition software packages supported connections from these handheld devices to desktop types of computers allowing the voice to be translated into text through a voice recognition package like IBM's ViaVoice voice recognition software. These approaches have many problems including: No direct feedback while the dictation is taking place, it was not real time large vocabulary voice recognition, training for the voice recognition was cumbersome to accomplish resulting in poor accuracy and user frustration, and training required redundant work since a separate voice model is needed from the desktop speaker voice files. Moreover, updating the voice parameters and training was typically not possible or very difficult to accomplish resulting in the accuracy level not getting better over time. And lastly, a separate physical connection to the dictation device was needed to accomplish the translation to text with little to no control of manipulating the text output until the entire recorded voice was dumped and translated into text.
The voice recognition dictation systems require the training sessions to enable the system to identify the words of a person is speaking. The process of training a voice recognition system creates speaker voice files or a “Voice Model”. A “Voice Model” is defined here as a signal, information, or electronic data file that is information and/or parameters that representation of a person's voice or a noise. A Voice Model contains attributes that characterize specific speaking items such as formants, phonemes, speaking rate, pause length, acoustic models, unique vocabulary's, etc. for a given user. One use for a voice model that contains data and parameters of a specific user is that it allows the user to take advantage of Large Vocabulary Voice Recognition (LVVR) dictation applications. All approaches to LVVR (e.g. Acoustic phonetic, Pattern recognition, Artificial intelligence, Neural networks, etc.) require some training. Training is required to create a reference pattern from which decisions are made using templates or statistical models (e.g. Markov Models and Hidden Markov Models) as to the probability of the audio word to be translated to a text displayed word. When using Large Vocabulary Voice Recognition applications, training of the voice recognition software allows the software to identify words during the uniqueness of a specific person speaking. Since training can be time consuming and ongoing task and typically results in speaker dependency other inventions have avoided confronting the training and speaker voice models issues needed to accomplish speaker independent and/or mobility between voice recognition dictation systems. As an example, the problem exists and was described within U.S. Pat. No. 5,822,727 Garberg et al. (Method for Automatic Speech Recognition and Telephony) where voice recognition training is accomplished using sub-words of a current speaker compared with templates for a plurality of speakers. This patent recognizes that there is a need for a more convenient and thorough process for building a database of sub word transcriptions and a database using speaker independent templates.
U.S. Pat. No. 6,477,491 Chandler et al. describes needing training for voice recognition applications but does not provide any specific means to accomplish this task and is focused on providing identity of a specific person by the specific microphone they are speaking into.
Therefore it is generally accepted that upfront training to gain an acceptable level of accuracy and system responsiveness requires time and effort as the system learns a specific user. This investment of time and effort is a per machine cost adding to machine dependency. Training a voice recognition system will result in a specific system voice to text translation accuracy in a given time indicating system responsiveness/performance. When trying to determine and obtain the highest level of system accuracy and performance, one can spend much effort, time, and money trying to determine the best options, (performance and accuracy versus components, effort, and cost). This has led to frustration and funds wasted with the result being that the speech recognition system is left sitting on the shelf or discarded.
Many professional people use more than 1 computer to accomplish their daily task. When more than 1 computer is used for voice recognition, accuracy and performance may not be consistent due to different levels of training accomplished for each system. This was discovered through experimentation with voice recognition packages and was verified in talking with doctors, lawyers, and other professionals that use speech recognition. These users described accuracy for example, at an estimated 94 percent but all claimed that they didn't know accurately what the accuracy was. Other statements made included how accuracy would vary when using an assortment of machines for voice dictation.
This invention is targeting to resolve the specific problems of measuring a standard performance and standard accuracy, machine dependency, speaker dependency, mobility, and methods of estimating accurate cost for users and manufacturers.