1. Field of the Invention
The present invention relates to the field of computer systems and in particular to a speech-to-text converter with a dialect database and two-way speech recognition capability.
2. Description of the Related Art
Many routine tasks require generating and utilizing written text. This is typically done by typing text into a computer via a keyboard. Typing text into a computer allows the computer to perform a variety of useful tasks such as checking the text for spelling and grammar. The computer generated text can be incorporated into other documents, sent to other people via e-mail systems, or posted to the Internet. Typing text by keyboard has the disadvantage that it requires the operator to use both hands for optimal typing speed, thereby preventing them from using their hands for any other task. Typing is an acquired skill and can take significant time and practice to attain a relatively high rate of typing. In addition, even a skilled typist can only type at xc2xc to xc2xd the rate of normal speech. Thus, it is generally not possible for a typist to transcribe a normal flowing conversation at the same rate it is spoken.
One method developed to allow faster transcription is stenography. Stenography is a shorthand manner of identifying words and representing them with alternative symbols. Stenography involves the use of a stenography machine. A skilled stenographer can easily keep up with transcribing a conversation as it is spoken. However, stenography also has some significant disadvantages. Stenography is a learned skill and a stenographer requires a significant amount of instruction and practice to become proficient. In addition the stenography symbols are not the same as the normal alphabet and are illegible to one not skilled as a stenographer. Stenography symbols are also not typically understood by most commonly available computer applications or e-mail servers.
Speech recognition and speech-to-text conversion have been developed to generate text more rapidly while keeping the user""s hands free for other tasks. Speech recognition involves hardware and software that is capable of receiving a spoken sound pattern and matching it with a particular word, phrase, or action. Speech-to-text conversion is a more elaborate system that is capable of continuously performing speech recognition but in such a manner that it is capable of converting a spoken conversation or discourse to corresponding text that is comparable to what a typist at a keyboard would do, but more rapidly. Current speech-to-text systems are capable of following a natural conversation and generating corresponding text with a relatively low rate of errors with some limitations.
One difficulty current speech-to-text systems have is correctly interpreting variations in speech when the meaning stays constant. A given person will tend to pronounce words slightly differently at different times. As they become excited, they tend to speak more rapidly. Many people tend to slur words together or to partially drop phonemes from their pronunciation. For example, xe2x80x9cHowareyaxe2x80x9d instead of xe2x80x9cHow are youxe2x80x9d or xe2x80x9cboutxe2x80x9d instead of xe2x80x9caboutxe2x80x9d. This is a particular problem with English because with the example of xe2x80x9cboutxe2x80x9d versus xe2x80x9caboutxe2x80x9d they are both proper English words but with quite different meanings. A human speaker is familiar with the vagaries of typical human speech and would readily make the correct interpretation in this case, but a machine has a more difficult time making the distinction.
Some speech-to-text systems address this problem by xe2x80x9clearningxe2x80x9d a particular person""s speech patterns. This is typically done by sampling the person""s speech and matching that speech with corresponding text or actions. This type of speech recognition or speech-to-text is called speaker dependent. Many speaker dependent systems provide a correction feature enabling them to iteratively improve the conversion of a person""s speech to corresponding text. Speaker dependent systems can require several hours of training before the system is capable of reliably converting the person""s speech to text.
Different people will tend to pronounce the same words differently and use different phrasing. Oftentimes the variations in people""s speech patterns follow predictable and identifiable patterns by groups such as: the place that the speakers grew up in, their age or gender, or their profession or type of work they do. These variations in pronunciation and word use are referred to as dialects. A dialect is typically distinguished by the use or absence of certain words or phrasing. A dialect will also typically have predictable manners of pronouncing certain syllables and/or words. It can be appreciated that the predictable nature of a dialect could be used to facilitate the learning process for a speaker dependent speech-to-text converter.
Another limitation of a speaker dependent system is that it is generally only reliable with the speech patterns of the person who trained it. A speaker dependent system typically has significantly poorer performance with speakers other than the trainer, often to the point that it is no longer useful unless trained with another user. Each new user needs to teach the speech-to-text system their unique speech patterns which again can take several hours. The speech-to-text system must also store the voice pattern files of the different speakers, which takes up limited memory capacity. It can be appreciated that in circumstances with multiple speakers a speech-to-text system that is capable of minimizing the time required for training for each speaker would be an advantage.
In several situations, a desirable feature for speech-to-text systems is the ability to not only correctly transcribe the speech of multiple speakers but also to distinguish the multiple speakers. One example would be courtroom transcription, wherein several attorneys, the judge, and parties to the case would have occasion to speak and wherein an accurate transcription of what is said and by whom needs to be made to record the proceedings. A second example is a telephone customer assistance line where a company would like a written record of customers"" calls to assess their employees and track and evaluate customer concerns and comments. It can be appreciated that the transcription of the conversations in these cases should be unobtrusive to the participants and should not interfere with the main business at hand.
Speech-to-text systems can be provided with more extensive libraries of speech patterns and more sophisticated recognition algorithms to enable them to convert more reliably the speech of multiple users to text. However, these systems become increasingly demanding of computer processor power and memory capacity as their flexibility increases. The more capacious processors and memory increase the cost of the systems. In addition, more complicated algorithms can slow a system down to the point that it is no longer capable of keeping up with a normal conversation.
It can be appreciated that there is an ongoing need for a method of reducing the time needed to train a speech-to-text conversion system and for providing less expensive speech-to-text conversion systems. There is a further need for speech-to-text conversion that can reliably transcribe the speech of multiple speakers and be able to correctly match the converted text with the speaker. The system and method should be cost effective to implement and not require extensive additional hardware.
The aforementioned needs are satisfied by the two-way speech recognition and dialect system of the present invention which, in one aspect, comprises a system for receiving spoken sounds and converting them into written text. The system includes a dialect database which is used to narrow the expected tonal qualities of the speaker and reduce the time required for the system to reliably transcribe the speaker""s speech. The two-way speech recognition and dialect system allows for determining the dialectal characteristics of a user. In one embodiment, the two-way speech recognition and dialect system includes the ability to distinguish between multiple speakers based on their dialectal speech characteristics.
In one embodiment, the two-way speech recognition and dialect system comprises a microphone, memory, a microprocessor, at least one input device, and at least one user interface. The microphone allows the speech input of the user to be transduced into electrical signals. The microprocessor processes the input from the microphone and other devices. The microprocessor also performs the speech recognition and text conversion actions of the system. The memory stores the xe2x80x9clearnedxe2x80x9d vocal patterns of the user as well as a plurality of dialectal speech characteristics. The input device(s) and user interface(s) allow the user to interact with the two-way speech recognition and dialect system.
In this embodiment, the two-way speech recognition and dialect system provides dialect determination by posing a series of questions to the user. The questions can branch depending on the respondent""s answers. In one embodiment, the questions attempt to determine the likely dialectal characteristics of the speaker by asking a series of questions indicative of the speaker that relate to speaking style. These questions can include questions determining the speaker""s age, gender, level of education, type of work that they do, where they grew up, where they live now and for how long, whether they are a native speaker of the language, and if not what their native language is.
The two-way speech recognition and dialect system uses the responses to these parameter questions to determine the dialect that the user likely has. The two-way speech recognition and dialect system then uses the likely dialect to narrow the speech patterns to expect for the user. For example, the speech patterns and vocabulary of a young, working class female from rural South Carolina are likely to be quite different than those of an older male doctor from Bombay, India. The two-way speech recognition and dialect system uses this information to narrow the expected tonal range of the speaker and anticipate certain pronunciations and word uses. Thus, the learning period for the two-way speech recognition and dialect system is shorter than for a generic speaker dependent speech-to-text conversion system.
Another embodiment of the present invention adds the ability to transcribe the speech of multiple users and the ability to identify and distinguish the speakers. The two-way speech recognition and dialect system monitors the pronunciation of the speakers and determines the dialectal differences between the speakers. The two-way speech recognition and dialect system uses these differences to determine who is speaking at any given time. Thus the two-way speech recognition and dialect system can distinguish between the speakers and identify the origin of each segment of transcribed speech. The two-way speech recognition and dialect system can number the text from each speaker or present the text on a monitor in different colors or fonts for the different speakers so that the transcribed text for each speaker can be readily distinguished.
These and other objects and advantages of the present invention will become more fully apparent from the following description taken in conjunction with the accompanying drawings.