The present invention relates to speech recognition and more particularly to inexpensive and user friendly speech recognition techniques.
Speech recognition has been extensively studied for several decades because of its interest on intellectual grounds and because of its military and commercial applications. Some of the commercial applications involve speaker verification and improving the man-machine interface (e.g., U.S. Pat. Nos. 3,742,143; 4,049,913; 4,882,685; 5,281,143; and 5,297,183). As evidence of the extensive research on speech recognition, the U.S. Patent Office has granted more than 600 patents on speech recognition or related topics in the last three decades and as many as 10,000 articles have appeared in the scientific or engineering literature during that time.
Generally, a speech recognition device analyzes an unknown audio signal to generate a pattern that contains the acoustically significant information in the utterance. This information typically includes the audio signal power in several frequency bands and the important frequencies in the waveform, each as a function of time. The power may be obtained through the use of bandpass filters (e.g., U.S. Pat. No. 5,285,552) or fast Fourier transforms (i.e., FFTs) (e.g., U.S. Pat. No. 5,313,531). The frequency information may be obtained from the FFTs or by counting zero crossings in the filtered input waveform (U.S. Pat. No. 4,388,495).
Speech recognition devices can be classified as “speaker dependent” or “speaker independent.” Speaker dependent devices require that the user train the system by speaking all of the utterances in the entire recognition set several times. Speaker independent devices do not require such training because the acoustic cues obtained from many repetitions of the utterances in the recognition set, as spoken by many different speakers, are used to train the recognizer to recognize an unknown utterance by a speaker whose phrase was not part of the training set.
Commercial applications of both speaker independent and speaker dependent recognition are becoming prevalent for applications such as voice activated phone dialing, computer command and control, telephone inquiries, voice recorders, electronic learning aids, data entry, menu selection, and data base searching. The growth of the speech recognition marketplace results from the decreasing cost of computing power and recognition technology as well as the need for more friendly user interfaces.
In some applications, speaker dependent recognition is required because the user must input information that he/she later requests. An example is voice dialing, which is being test marketed by U.S. West among others, in which the user verbally enters a directory of names and phone numbers. This information is later solicited by using speaker dependent recognition when the user wishes to make a phone call. Except for applications such as voice dialing that require speaker dependent recognition, this technology has not achieved wide market acceptance because it is not user-friendly due to the required training.
Much of the interest in speaker independent recognition is because of the simpler user interface. An example of a speaker independent recognition software package running on personal computers is VOICE Release 2.0 from Kurzweil AI, which is able to recognize as many as 60,000 words without user training. Other examples of similar technologies are the IBM Voice Type 3.0, used in radiology, the Wild Card LawTALK, used in legal applications, and the Cortex Medical Management, used for anatomic pathology. More than two dozen speaker independent recognition computer products are available and they all require considerable computing power to perform the sophisticated natural language processing involving context, semantics, phonetics, prosody, etc., that is required to recognize very large sets of utterances without user training. Hence, large vocabulary, speaker independent recognition products require considerable computing power.
Small vocabulary, speaker independent recognition also appears in commercial applications where the number of utterances to be recognized is limited. Examples are the Sensory, Inc. speaker independent recognition LSI chip (U.S. Pat. No. 5,790,754) used in electronic learning aids such as the Fisher-Price Radar product, or in time setting applications such as the VoiceIt clock. This technology is accurate and inexpensive but, in the current art, it is limited to use with relatively small vocabularies because the LSI chip does not contain the computing power required for natural language processing or the memory required to store information about a very large inventory of recognition words.
The above described limitations of current recognition technology narrow the range of its applicability in consumer electronic products. For example, it would be desirable to select a particular song from a compact disk changer that holds many compact disks by telling it which disk and which song on that disk you wish to hear. This is not currently feasible because solving this problem with speaker dependent recognition requires that the user repeat the names of all recordings on every compact disk that he owns, while solving it with speaker independent technology would require that the recognizer be able to understand the name of every song on every compact disk in the world. Or, consider the use of speech recognition during the interaction of a surfer with an internet website. Most of this interaction is at a simple one-step-at-a-time level where the vocabulary to be recognized at each step is small but the total vocabulary associated with all of the steps may be large. For this application, speaker dependent recognition may not be feasible because of its inconvenience. Speaker independent recognition is feasible, but, in the current art, analyzing the speech by the web site's main processor creates conflicts between the recognition program and the application and may slow down the application to the point that use of recognition becomes unacceptable to the user. Also, adding additional processing power to handle the speaker independent recognition may not be feasible due to its cost.