Most of the linguistic information people communicate is in the form of speech, and most people can speak much faster than they can communicate linguistic information by any other means. Yet most people can read much faster than they can listen to speech, even if the speech is recorded and artificially sped up. And whereas recorded textual information can be visually scanned and searched with great ease and rapidity, searching or scanning recorded speech is painfully tedious, a discrepancy exacerbated by today's networked computer systems, which make it possible to search enormous quantities of textual data in an instant, but cannot begin to penetrate voice data. Compared to speech, text is also far easier to edit, organize, and process in many other ways.
Accurate, affordable, and rapid speech transcription could bridge the advantages of speech with the advantages of text; however, no existing solution meets all three of these criteria. Trained human dictation typists set the standard for accuracy, but they are slow and expensive. Automatic speech recognizers are the most affordable, but their accuracy for normal conversational speech of most speakers in most situations is, in the current state of the art, unacceptably low for most purposes. Trained human voicewriters substitute their clearly enunciated speech as input to automatic speech recognizers, and correct the remaining errors in the output, thereby matching the accuracy of typists while retaining much of the speed of automatic speech recognition; but trained voicewriters are even more expensive than dictation typists.
Highly trained voicewriters are typically employed in formal public situations where the source speech is already well-enunciated and easily understood on first hearing, such as court reporting and public speech transcription, where real-time transcription is essential; as well as for dictation transcription. And in these situations, they can generally keep up with the speech in real time on the fly, though a second pass through the transcript is generally needed to correct errors introduced by the automatic speech recognizer.
For everyday conversational speech, particularly telephone speech, multiple hearings are commonly required, not only for typists and less-skilled voicewriters, who can rarely keep up with the pace of the original speech, but even for highly skilled voicewriters. This is due to a number of factors, including signal degradation issues such as bandpass filtering, line noise, and codec artifacts; enunciation issues such as mumbling, whispering, slurring, and clipping; pronunciation issues such as stutters, splutters, hems and haws, spoonerisms, and other phonological speech errors; lexical issues such as colloquialisms, localisms, slang, private vocabulary, and euphemisms; syntactic issues such as false starts and repetitions, vacuous filler phrases, and incoherence; and pragmatic issues such as presuppositions, interruptions, and talking over one another.
Whenever a single hearing is insufficient for accurate transcription, whether because of speed or unintelligibility, transcribers spend an inordinate amount of time rewinding the audio recording, searching for an appropriate starting point to provide sufficient context, and replaying the audio.
Some systems can adjust the playback tempo without affecting the pitch or formants of the speech, which can alleviate a speed mismatch between the original speech and the transcriber, but only in an average way, since any constant playback tempo still tends to alternate between being too slow and too fast for the transcriptionist's capabilities.
Some systems provide a foot pedal to permit a human transcriber to pace the playback by controlling such parameters as the playback tempo, duration, and repetition, leaving the transcriber's hands free for typing, but this requires additional effort on the part of the transcriber, as well as additional fault-prone electromechanical hardware.
Some specific examples in the prior art include:                U.S. Pat. No. 4,207,440 in which Schiffman describes the basic technique of having a transcriptionist manually control the playback speed of recorded speech in order to control its pace. Schiffman does not anticipate any kind of automated dynamic rate control of the speech to be transcribed nor is playback rate linked to the transcriptionists output.        U.S. Pat. No. 6,490,553 in which Van Thong et al describe a system which uses voice recognition of the input speech and alters its playback rate to meet a target rate. Van Thong does not perform any examination of the transcriptionists output and does not attempt to control the playback rate to dynamically adapt to the pace that the transcriptionist is performing at.        U.S. Pat. No. 6,952,673 in which Amir et al describe a system which can change the playback speed of input speech based on the typing speed of a transcriptionist. Amir's “rate” is based on monitoring the transcriptionists typing pace in terms of characters, keystrokes and words per unit time and then attempting to relate that typing rate back to the input speech. Amir does not perform any kind of acoustic alignment between a synthesized version of the transcript and the input speech, nor does he anticipate using an acoustic alignment between a voice-writer's speech and the input speech.        U.S. Pat. No. 7,412,378 in which Lewis et al describe a system to dynamically match a speech output rate to a user's spoken speech input rate. Lewis assumes either an existing alternative text for the recorded speech, or the use of a transcription server to determine the number of words in the recorded speech in order to know its rate. The purpose of this invention is to efficiently create a transcription of recorded speech where no transcription or alternative text for it exists. This invention does not rely on knowing the number of words in the recorded speech—it performs an acoustic alignment between the recorded speech and either a synthesized version of the ongoing transcription output, or the audio of the voice writers input speech.        