1. Field of the Invention
The present invention relates to the field of speech processing and, more particularly, to enhancing a networked speech processing system to add a real-time dictation capability.
2. Description of the Related Art
In speech processing, dictation can be a real-time process of converting live, free from speech (e.g., a conversion) into text. User's often want an ability to view the text in real-time, where the automatically generated text can be edited before being stored. Desktop speech processing solutions, such as ViaVoice, are able to perform dictation functions as described. These dictation functions require a user to explicitly activate/deactivate a dictation mode. When the dictation mode is active, software dedicated to dictation seizes (using a local application program interface (API) specifically designed for dictation) speech processing resources (i.e., a microphone, a speech recognition engine, and the like) for a duration of the dictation. When a user exits a dictation mode, the previously seized resources are released.
Networked speech processing systems, which concurrently handle speech processing tasks for multiple remote clients, currently lack a dictation ability. That is, no current networked speech processing system accepts a real-time audio input and produces real-time textual output that is presented prior to speech recognition completion. A few conventional networked systems have a transcription capability, which produces a transcription of speech input during a speech processing session. This transcription is not a real-time text stream, but is generated and provided upon session completion. Other existing networked systems permit client devices to send a sequential set of audio input files, each being processed (e.g., transcribed) to create textual output files, which are sent back to client device. These systems are effecting transcribing in a traditional manner, for a “virtual” session that consists of multiple sub sessions, each sub session consisting of processing an audio input file and producing a corresponding output files that is sent to a client. The client can automatically parse a transcription session into a two or more of sub sessions, each having an input file. The client can also concatenate output files received from two or more sub sessions to create a transcription for a full session consisting of the set of component sub sessions. Real-time performance is not possible with such a system. That is, conventional implementations of a file based remote transcription technology are unable to accept a real-time input stream of audio and produce real-time text output.
An inability of networked speech processing systems to perform dictation, which local speech processing is able to perform, relates to an underlying infrastructure of networked systems. In a typical networked system, a speech enabled application, often written in VoiceXML, interacts with a client or a user. The speech enabled application submits VoiceXML formatted speech processing requests to a voice server, which uses turn based speech recognition engines and/or speech synthesis engines to perform the processing tasks. A continuous dictation channel having dedicated resources does not exist. Further, a text output channel, which could be needed to present real-time textual output, is not present in conventional systems.
Further, a client microphone is continuously being turned on and off as new speech is detected using a speech activated technology. The speech activated technology results in some information loss, such as an initial portion of an audio utterance being clipped. Conventional networked speech processing systems have not been able overcome these problems and remain unable to perform real-time dictations for remote clients.