One function performed by existing automatic speech recognizers (ASRs) is to transcribe speech to produce a document representing the content of the speech. This process is typically referred to as “dictation,” and the resulting document a “transcript.” If human speakers naturally spoke in the exact format required for the transcript that is desired, dictation systems could be designed to produce transcripts by performing verbatim transcription of the speaker's speech. Natural human speech, however, typically cannot be transcribed verbatim to produce the desired transcript, because (among other reasons) such speech often omits information that is needed for the transcript, such as punctuation marks (e.g., periods and commas), formatting information (e.g., boldface and italics), capitalization, and document structure. This problem poses challenges both for human transcriptionists who transcribe speech manually, and for those who design automatic dictation systems.
One way to overcome the problem of speech lacking necessary information is to train human speakers to explicitly speak (verbalize) such information when dictating, such as by saying, “new sentence today is october first two thousand and nine period.” Another solution is to design a dictation system which is capable of inserting the missing information, such as punctuation, into the transcript automatically, even when such information was not explicitly verbalized by the speaker. One benefit of the latter approach is that it does not require speakers to learn how to speak in an artificial manner when dictating. Automatic punctuation insertion systems, however, are challenging to design due to the need to enable them to predict the type and location of punctuation marks accurately, automatically, and (in some cases) quickly.
Consistent and accurate prediction and insertion of all punctuation (both verbalized and non-verbalized) in transcripts of conversational speech is critical to many tasks involving automatic speech recognition. In particular, accurate phrase and sentence segmentation is needed for speech-to-speech translation, parsing, and rendering of transcribed speech into written language.
For example, there is a particularly strong need to accurately predict punctuation when transcribing medical dictation. Physicians are accustomed to documenting their patient encounters and the medical procedures they have performed by dictating a report using conversational speech. They assume that a human medical transcriptionist will listen to the dictation and clean it up, such as by correcting non-grammatical and incomplete phrases, and by inserting non-verbalized punctuation symbols where appropriate. Because doctors need to dictate a high volume of repetitive reports under tight time constraints, they often speak relatively quickly and without including discernible pauses or other prosodic cues in place of non-verbalized punctuation. In short, the lack of explicitly-verbalized punctuation in medical dictation creates a strong need for accurate punctuation prediction when transcribing such dictation, and yet the features of physicians' speech makes it particularly challenging to predict such punctuation accurately.
Existing approaches to predicting non-verbalized punctuation typically perform such prediction in a post-processing step, after the completion of speech decoding, either using the generated best-scoring hypothesis or the word lattice as input, sometimes including acoustic and/or prosodic features. For example, Stolcke et al. (“Combining Words and Speech Prosody for Automatic Topic Segmentation,” Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 1999) have tried to make use of prosodic cues extracted from the spoken data by extracting pause durations which may indicate sentence boundaries, thus providing evidence for non-verbalized periods. As another example, Hirschberg and Nakatani (“Acoustic Indicators of Topic Segmentation,” Proceedings of ICSLP, 1998) also made use of various acoustic/prosodic features in order to carry out topic and phrase boundary identification. In contrast, Gotoh and Renals (“Sentence Boundary Detection in Broadcast Speech Transcripts,” Proceedings of the International Speech Communication Association Workshop: Automatic Speech Recognition: Challenges for the New Millenium, Paris, September, 2000) have tried to identify sentence boundaries in broadcast speech using statistical finite state models derived from news transcripts and speech recognizer outputs. They claim that their work is a step towards the production of structured speech transcriptions which may include punctuation or content annotation. Ramabhadran et al. (“The IBM 2006 Speech Transcription System for European Parliamentary Speeches,” Proceedings of the International Conference on Spoken Language Processing, 2006) rely exclusively on prosodic cues for predicting non-verbalized punctuation as part of a transcription system for parliamentary speeches. Common to all of these approaches is the need for separate punctuation prediction models that are applied in a second pass after the initial decoding of speech recordings in a first pass over the data.
Such techniques have a variety of limitations. First, the use of a two-pass process, in which speech is first decoded and then punctuation is predicted as a post-process, results in punctuation prediction which is slower than needed for real-time applications. Second, such techniques typically use two language models, one for speech decoding and one for punctuation prediction. Creating and maintaining such separate language models can be time-consuming and expensive. Furthermore, the use of separate language models for speech decoding and punctuation prediction can lead to inaccuracies in both speech decoding and punctuation prediction.
What is needed, therefore, are improved techniques for predicting non-verbalized punctuation symbols and other tokens in speech for use in producing transcripts of such speech.