1. Field of the Invention
The invention relates to the field of computer systems and in particular to a speech-to-text converter with a dialect database and two-way speech recognition capability.
2. Description of the Related Art
Many routine tasks require generating and utilizing written text. This is typically done by typing text into a computer via a keyboard. Typing text into a computer allows the computer to perform a variety of useful tasks such as checking the text for spelling and grammar. The computer generated text can be incorporated into other documents, sent to other people via e-mail systems, or posted to the Internet. Typing text by keyboard has the disadvantage that it requires the operator to use both hands for optimal typing speed, thereby preventing them from using their hands for any other task. Typing is an acquired skill and can take significant time and practice to attain a relatively high rate of typing. In addition, even a skilled typist can only type at ¼ to ½ the rate of normal speech. Thus, it is generally not possible for a typist to transcribe a normal flowing conversation at the same rate it is spoken.
One method developed to allow faster transcription is stenography. Stenography is a shorthand manner of identifying words and representing them with alternative symbols. Stenography involves the use of a stenography machine. A skilled stenographer can easily keep up with transcribing a conversation as it is spoken. However, stenography also has some significant disadvantages. Stenography is a learned skill and a stenographer requires a significant amount of instruction and practice to become proficient. In addition the stenography symbols are not the same as the normal alphabet and are illegible to one not skilled as a stenographer. Stenography symbols are also not typically understood by most commonly available computer applications or e-mail servers.
Speech recognition and speech-to-text conversion have been developed to generate text more rapidly while keeping the user's hands free for other tasks. Speech recognition involves hardware and software that is capable of receiving a spoken sound pattern and matching it with a particular word, phrase, or action. Speech-to-text conversion is a more elaborate system that is capable of continuously performing speech recognition but in such a manner that it is capable of converting a spoken conversation or discourse to corresponding text that is comparable to what a typist at a keyboard would do, but more rapidly. Current speech-to-text systems are capable of following a natural conversation and generating corresponding text with a relatively low rate of errors with some limitations.
One difficulty current speech-to-text systems have is correctly interpreting variations in speech when the meaning stays constant. A given person will tend to pronounce words slightly differently at different times. As they become excited, they tend to speak more rapidly. Many people tend to slur words together or to partially drop phonemes from their pronunciation. For example, “Howareya” instead of “How are you” or “bout” instead of “about”. This is a particular problem with English because with the example of “bout” versus “about” they are both proper English words but with quite different meanings. A human speaker is familiar with the vagaries of typical human speech and would readily make the correct interpretation in this case, but a machine has a more difficult time making the distinction.
Some speech-to-text systems address this problem by “learning” a particular person's speech patterns. This is typically done by sampling the person's speech and matching that speech with corresponding text or actions. This type of speech recognition or speech-to-text is called speaker dependent. Many speaker dependent systems provide a correction feature enabling them to iteratively improve the conversion of a person's speech to corresponding text. Speaker dependent systems can require several hours of training before the system is capable of reliably converting the person's speech to text.
Different people will tend to pronounce the same words differently and use different phrasing. Oftentimes the variations in people's speech patterns follow predictable and identifiable patterns by groups such as: the place that the speakers grew up in, their age or gender, or their profession or type of work they do. These variations in pronunciation and word use are referred to as dialects. A dialect is typically distinguished by the use or absence of certain words or phrasing. A dialect will also typically have predictable manners of pronouncing certain syllables and/or words. It can be appreciated that the predictable nature of a dialect could be used to facilitate the learning process for a speaker dependent speech-to-text converter.
Another limitation of a speaker dependent system is that it is generally only reliable with the speech patterns of the person who trained it. A speaker dependent system typically has significantly poorer performance with speakers other than the trainer, often to the point that it is no longer useful unless trained with another user. Each new user needs to teach the speech-to-text system their unique speech patterns which again can take several hours. The speech-to-text system must also store the voice pattern files of the different speakers, which takes up limited memory capacity. It can be appreciated that in circumstances with multiple speakers a speech-to-text-system that is capable of minimizing the time required for training for each speaker would be an advantage.
In several situations, a desirable feature for speech-to-text systems is the ability to not only correctly transcribe the speech of multiple speakers but also to distinguish the multiple speakers. One example would be courtroom transcription, wherein several attorneys, the judge, and parties to the case would have occasion to speak and wherein an accurate transcription of what is said and by whom needs to be made to record the proceedings. A second example is a telephone customer assistance line where a company would like a written record of customers' calls to assess their employees and track and evaluate customer concerns and comments. It can be appreciated that the transcription of the conversations in these cases should be unobtrusive to the participants and should not interfere with the main business at hand.
Speech-to-text systems can be provided with more extensive libraries of speech patterns and more sophisticated recognition algorithms to enable them to convert more reliably the speech of multiple users to text. However, these systems become increasingly demanding of computer processor power and memory capacity as their flexibility increases. The more capacious processors and memory increase the cost of the systems. In addition, more complicated algorithms can slow a system down to the point that it is no longer capable of keeping up with a normal conversation.
It can be appreciated that there is an ongoing need for a method of reducing the time needed to train a speech-to-text conversion system and for providing less expensive speech-to-text conversion systems. There is a further need for speech-to-text conversion that can reliably transcribe the speech of multiple speakers and be able to correctly match the converted text with the speaker. The system and method should be cost effective to implement and not require extensive additional hardware.