Computer based speech processing technology currently is employed in two different broad categories. The first of these categories is speech synthesis, which is the ability to respond to a user activated input to generate or synthesize spoken language from text or stored digital speech representation. The second of these categories is that of speech recognition, which allows a computer to accept and process spoken language. It is this latter technology to which this application is directed.
Speech recognition, or speech-to-text, involves the capturing and digitizing of sound waves, converting them to basic language units or phonemes, constructing words from the phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as “write” and “right”). The determination of spelling from such sound-alike words is based upon the context of the preceding input in most cases.
The speech recognition process takes place through speech recognition application software, also frequently referred to as “speech recognition engines”. Such software or speech recognition engines are what is used to convert the acoustical signals resulting from spoken words to digital signals, and then to deliver recognized speech as text to the computer display monitor or printer. To accomplish this, such applications, which are produced by several different software sources utilize a microphone into which the user speaks. For example, a user may speak the words “what time is it?”. The microphone captures the sound waves and generates electrical impulses or analog signals corresponding to these sound waves. The sound waves then are supplied to a sound card, which converts the analog signals to digital signals recognizable by a computer, such as a personal computer (PC).
The speech recognition application or speech recognition engine converts the digital signals to phonemes, and then from there, into words. For example, the phonemes for the present example “what time is it?” would be “w aa td t aym ih s ih it”. The speech recognition software then processes these phonemes as words to produce “what time is it”. It should be noted in this example, that the question mark (?) is not present in the finally produced conversion from speech to text, since the software cannot recognize implied punctuation or, without separate commands, other types of punctuation.
Currently, most speech recognition application software supports continuous speech, meaning that the user can speak naturally into a microphone at the speech of most conversion. Two such systems which operate in this manner are the IBM Via Voice™ produced by IBM Corporation, and DRAGON NATURALLY SPEAKING™ produced by Lermout & Hauspie Speech Products. Prior to such continuous speech recognizers, isolated or discrete speech recognizers required the user to pause after each word, which is a very cumbersome and unnatural way of speaking. Such systems currently are being replaced by continuous speech engines or continuous speech recognition applications of the type mentioned above for the DRAGON NATURALLY SPEAKING™ and VIA VOICE™ systems.
Continuous speech recognition engines currently support two different modes of speech recognition. The first of these modes is the dictation mode, in which the user enters data by speaking directly through the microphone, into the computer, in the continuous manner mentioned above. The other mode is the command and control mode, in which the user initiates computer operations by speaking commands or asking questions. The dictation mode allows the user to dictate memos, letters, e-mail messages and the like, as well as to enter data using a speech recognition dictation engine or speech recognition application. The possibilities for what can be recognized are limited, however, by the size of the recognition application's dictionary of words, or its “grammar”.
Most recognizers which support a dictation mode are speaker-dependent, meaning that the accuracy varies on the basis of the user's speaking patterns and accent. To ensure accurate recognition, the application must create or access a “speaker profile” that includes a detailed map of the user's speech patterns used in the matching process during recognition. In order to accomplish this, such speech recognition applications or speech engines employ an initial training mode, in which the user reads a specific text into the computer, which then is stored in the form of the phonemes for that user's speech, to generate that same text. Since the application “knows” the precise words used in the training text, the translation of the speaker's accent and manner of speaking into those words is used to create the basic stored library of sounds for producing the translation of those sounds into the various words of the dictionary of words used in the system. Consequently, the application creates a “speaker profile” which includes a detailed map of the user's speech patterns used in the matching process during recognition of subsequent dictation.
The command and control mode of operation allows the simplest implementation of a speech interface in an existing application. In the command and control mode, the grammar (or list of recognized words) is limited to a relatively short list of available commands. This is of much more finite scope than what is required for the list or dictionary for continuous dictation, which must encompass nearly the entire dictionary in any particular language. As a consequence, the command and control mode allows more accurate performance and reduces the processing overhead (memory, for example) required by the application. The limited grammar or dictionary needed for the command and control mode also enables speaker-independent processing, eliminating the need for speaker profiles or “training” of the speech recognition application or recognizer. The purpose of speech recognition software and systems is to reduce reliance upon the traditional data entry of information to a computer via a keyboard, or to allow the computer to be provided with inputs in environments where a keyboard is impractical, such as in small mobile devices, or in mobile phones, for example.
Obviously, from the foregoing, it is apparent that the more “human” the voice recognition system and method can be made, the more likely users are to utilize such a system for performing various computer tasks, including dictation of text such as letters, documents, such as the present one, as well as educational and entertainment applications via the worldwide web and the like. Even though relatively sophisticated voice recognition application software systems, such as the two mentioned above, presently exist, the problem in naturally flowing speech which occurs is that of recognizing the command and control functions or instructions, and properly executing them in the flow of dictation which is to be translated into words or text. In order to do this, it is necessary to provide a way for the speech recognition application to differentiate and recognize command instructions or words and separate them from dictation.
To allow a person to speak commands in the middle of dictation, another routine is added to the probability checker of such applications to spot words and phrases which have a low probability score for the current sentence or text material being dictated. This phrase is then compared with a list of known command phrases and checked for a match. If one is found, then a command function is carried out. A problem with operating in this manner, however, is that even though it may be rare, the word or phrase may very well be a part of a flowing dictated text; and to carry out a command, instead of transforming the spoken sounds into a printed or text word, creates an error in the final document.
Systems of the type such as the DRAGON NATURALLY SPEAKING™ or the IBM VIA VOICE™ attempt to allow what is called natural language commands, which is an ability to give an instruction in a person's own words rather than instructions selected by the developer of the application. This increases the usability of such systems, since the user may known what he or she wants to do, but not what the application being used calls it. To achieve such control, it is necessary to recognize the significant words within the command, and then combine them to form an instruction. Many words which are spoken do not add to the overall meaning of a command; and once they have been disregarded, it is easier to match the command words with existing concepts within the application which is being controlled.
To use natural language command, however, requires all of the processing needed for dictation, plus additional processing to pick out the command words. Typically, with systems of the type mentioned above, the system is controlled to recognize spoken command phrases as commands by causing the user to deviate from normal continuous spoken dictation. For example, currently available speech recognition engines or applications operate to interpret a pause in the dictation as a signal to start the probability checker for a command instruction following the pause. Expressed in other words, in order to initiate a command, a pause must be made in the normal flow of dictation. The pause is of a distinct length, to cause the application engine to begin to recognize the spoken phrase or word following the pause as a command instruction. Once the command is carried out, another pause is required to switch the system back into the dictation mode. This interrupts the flow of dictation because this is an unnatural way of speaking for most persons.
In order to clearly identify the spoken words following a pause, systems also may include an additional instruction, such as the spoken word “computer” following the pause, to then couple the pause with the spoken instruction word to cause the system to interpret the sounds following as instructions or command functions for the computer. The utilization of this two-part technique significantly reduces the potential for errors in the completed text, at the expense, however, of an unnatural speech rhythm and pattern in the dictated material.
The U.S. Pat. No. 6,125,342 to Selesky is directed to a voice recognition system with command recognition to perform various command functions. This patent includes pronoun semantic analysis for interpreting command statements.
The U.S. Pat. No. 6,100,882 to Sharman is directed to an audio conferencing speech recognition software and system. Speech from users at one work station is displayed as text at other work stations. This is repeated for all of the work stations in the conferencing network; so that all of the text is stored in a text file at all of the work stations. There is no disclosure in this patent, however, for the differentiation between dictation and command inputs.
The U.S. Pat. No. 6,281,883 to Barker is directed to a hand-held data entry device which includes two or more buttons on it. One of the buttons is for recording dictation information; and the other is for switching the system to a voice command. Thus, when the voice command button is operated, words or phrases spoken into the microphone in the device are interpreted by the computer as command mode instructions; so that voice input with this button depressed is interpreted by the computer as voice command instructions.
It is desirable to provide a voice recognition software application and method which overcomes the disadvantages of the prior art discussed above, and which facilitates the operation of the voice recognition application to allow a user to utilize the voice recognition application in a continuous dictation mode, and which clearly allows the computer to recognize command instructions unambiguously, during operation of the voice recognition application in the continuous dictation mode, without requiring unnatural modification of the dictation flow of the user.