People typically communicate with each other either verbally, e.g., in face-to-face conversations or via some form of telephone/radio; or, in written messages. Traditionally, written communications have been in the form of hand written or typed notes and letters. More recently, the Internet has made communication by chat and email messages a preferred form of communication.
Telephone systems are designed to convey audio signals that facilitate verbal communications. However, since the recipient of a telephone call is often not available to receive it, voice mail systems have been developed to record verbal messages so that they that can be heard by the intended recipient at a later time. Periodically, the intended recipient can access their voice mail system via telephone or cell phone to hear the voice mail messages recorded from telephone calls that they missed receiving. However, a person may need to access several different voice mail accounts at different times during a day. For example, it is not unusual to have a voice mail account for a cell phone, another for a home phone, and yet another for an office phone.
For many people, it would be more convenient to receive all communications in text format rather than having to repeatedly access verbal messages stored in different locations. In regard to receiving the communications stored as verbal messages in multiple voice mail accounts, it would thus be easier for a person to receive emails or other forms of text messages that convey the content of the verbal messages, since it would then not be necessary for the person to call a voice mail account, and enter the appropriate codes and passwords to access the content of those accounts. Accordingly, it would be desirable to provide an efficient and at least semi-automated mechanism for transcribing verbal messages to text, so that the text can be provided to an intended recipient (or to a program or application programming interface (API) that uses the text). This procedure and system need not be limited only to transcribing voice mail messages, but could be applied for transcribing almost any form of verbal communication to a corresponding text. Ideally, the system should function so efficiently that the text message transcription is available for use within only a few minutes of the verbal message being submitted for transcription.
One approach that might be applied to solve this problem would use fully automated speech recognition (ASR) systems to process any voice or verbal message in order to produce corresponding text. However, even though the accuracy of an ASR program such as Nuance's Dragon Dictate™ program has dramatically improved compared to the earlier versions when trained to recognize the characteristics of a specific speaker's speech patterns, such programs still have a relatively high error rate when attempting to recognize speech produced by a person for which the system has not been trained. The accuracy is particularly poor when the speech is not clearly pronounced or if the speaker has a pronounced accent. Accordingly, it is currently generally not possible to solely rely on an automated speech recognition program to provide the transcription to solve the problem noted above.
Furthermore, if a service is employed to provide the transcription of verbal messages to text, the queuing of the verbal messages to be transcribed should be efficient and scalable so as to handle a varying demand for the service. The number of verbal messages that a service of this type would be required to transcribe is likely to vary considerably at different times of the day and during week days compared to weekends. This type of service can be relatively labor intensive since the transcription cannot be provided solely by automated computer programs. Accordingly, the system that provides this type of service must be capable of responding to varying demand levels in an effective and labor efficient manner. If overloaded with a higher demand for transcription than the number of transcribers then employed can provide, the system must provide some effective manner in which to balance quality and turnaround time to meet the high demand, so that the system does not completely fail or become unacceptably backlogged. Since a service that uses only manual transcription would be too slow and have too high a labor cost, it would be desirable to use both ASR and manual transcription, to ensure that the text produced is of acceptable quality, with minimal errors.
It has been recognized that specific portions of verbal messages tend to be easier to understand than other portions. For example, the initial part of a verbal message and the closing of the message are often spoken more rapidly than the main body of the message, since the user puts more thought into the composition of the main body of the message. Accordingly, ASR of the rapidly spoken opening and closing portions of a verbal message may result in higher errors in those parts of the message, but fewer errors than the main body of the verbal message. It would be desirable to use a system that takes such considerations into effect when determining the portion of the message on which to apply manual editing or transcription, and to provide some automated approach for determining which portions of a message should be manually transcribed relative to those portions that might be acceptable if only automatically transcribed by an ASR program.