The present invention relates to speech recognition and more particularly to inexpensive and user friendly speech recognition techniques.
Speech recognition has been extensively studied for several decades because of its interest on intellectual grounds and because of its military and commercial applications. Some of the commercial applications involve speaker verification and improving the man-machine interface (e.g., U.S. Pat. Nos. 3,742,143; 4,049,913; 4,882,685; 5,281,143; and 5,297,183). As evidence of the extensive research on speech recognition, the U.S. Patent Office has granted more than 600 patents on speech recognition or related topics in the last three decades and as many as 10,000 articles have appeared in the scientific or engineering literature during that time.
Generally, a speech recognition device analyzes an unknown audio signal to generate a pattern that contains the acoustically significant information in the utterance. This information typically includes the audio signal power in several frequency bands and the important frequencies in the waveform, each as a function of time. The power may be obtained through the use of bandpass filters (e.g., U.S. Pat. No. 5,285,552) or fast Fourier transforms (i.e., FFTs) (e.g., U.S. Pat. No. 5,313,531). The frequency information may be obtained from the FFTs or by counting zero crossings in the filtered input waveform (U.S. Pat. No. 4,388,495).
Speech recognition devices can be classified as xe2x80x9cspeaker dependentxe2x80x9d or xe2x80x9cspeaker independent.xe2x80x9d Speaker dependent devices require that the user train the system by speaking all of the utterances in the entire recognition set several times. Speaker independent devices do not require such training because the acoustic cues obtained from many repetitions of the utterances in the recognition set, as spoken by many different speakers, are used to train the recognizer to recognize an unknown utterance by a speaker whose phrase was not part of the training set.
Commercial applications of both speaker independent and speaker dependent recognition are becoming prevalent for applications such as voice activated phone dialing, computer command and control, telephone inquiries, voice recorders, electronic learning aids, data entry, menu selection, and data base searching. The growth of the speech recognition marketplace results from the decreasing cost of computing power and recognition technology as well as the need for more friendly user interfaces.
In some applications, speaker dependent recognition is required because the user must input information that he/she later requests. An example is voice dialing, which is being test marketed by U.S. West among others, in which the user verbally enters a directory of names and phone numbers. This information is later solicited by using speaker dependent recognition when the user wishes to make a phone call. Except for applications such as voice dialing that require speaker dependent recognition, this technology has not achieved wide market acceptance because it is not user-friendly due to the required training.
Much of the interest in speaker independent recognition is because of the simpler user interface. An example of a speaker independent recognition software package running on personal computers is VOICE Release 2.0 from Kurzweil AI, which is able to recognize as many as 60,000 words without user training. Other examples of similar technologies are the IBM Voice Type 3.0, used in radiology, the Wild Card LawTALK, used in legal applications, and the Cortex Medical Management, used for anatomic pathology. More than two dozen speaker independent recognition computer products are available and they all require considerable computing power to perform the sophisticated natural language processing involving context, semantics, phonetics, prosody, etc., that is required to recognize very large sets of utterances without user training. Hence, large vocabulary, speaker independent recognition products require considerable computing power.
Small vocabulary, speaker independent recognition also appears in commercial applications where the number of utterances to be recognized is limited. Examples are the Sensory, Inc. speaker independent recognition LSI chip (U.S. patent application Ser. No. 08/327,455) used in electronic learning aids such as the Fisher-Price Radar product, or in time setting applications such as the VoiceIt clock. This technology is accurate and inexpensive but, in the current art, it is limited to use with relatively small vocabularies because the LSI chip does not contain the computing power required for natural language processing or the memory required to store information about a very large inventory of recognition words.
The above described limitations of current recognition technology narrow the range of its applicability in consumer electronic products. For example, it would be desirable to select a particular song from a compact disk changer that holds many compact disks by telling it which disk and which song on that disk you wish to hear. This is not currently feasible because solving this problem with speaker dependent recognition requires that the user repeat the names of all recordings on every compact disk that he owns, while solving it with speaker independent technology would require that the recognizer be able to understand the name of every song on every compact disk in the world. Or, consider the use of speech recognition during the interaction of a surfer with an internet website. Most of this interaction is at a simple one-step-at-a-time level where the vocabulary to be recognized at each step is small but the total vocabulary associated with all of the steps may be large. For this application, speaker dependent recognition may not be feasible because of its inconvenience. Speaker independent recognition is feasible, but, in the current art, analyzing the speech by the web site""s main processor creates conflicts between the recognition program and the application and may slow down the application to the point that use of recognition becomes unacceptable to the user. Also, adding additional processing power to handle the speaker independent recognition may not be feasible due to its cost.
The present invention provides an inexpensive and user-friendly speaker independent speech recognition system. A speech recognition system according to the present invention may function without the use of natural language processing or internal storage of large amounts of speech recognition data.
In one embodiment, an inexpensive, speaker independent recognition engine is placed in the base unit of an electronic apparatus. Depending on the application, the base unit may be a compact disk player, computer, internet access device, video game player, television set, telephone, etc. The recognition engine may be a software program running in a general purpose microprocessor or an LSI chip such as the Sensory RSC-164 available from the assignee of the present application. Since the recognition engine should be inexpensive, it may be capable of recognizing only a limited set of utterances at any one time, although this recognition set of utterances may change from one application of recognition to the next in the same base unit.
The architecture of the product is such that, in operation, an external medium is connected to the base unit. The external medium may be a compact disk if the base unit is a compact disk changer, a floppy disk if the base unit is a computer, a video game cartridge if the base unit is a video game player, a cable or rf transmission if the base unit is a television set or an internet access device, a phone cable if the base unit is a telephone, etc. Included in the information provided to the base unit by the external medium is the data required for the recognition engine to recognize a spoken utterance from a limited set of candidate utterances. As the interaction between the base unit and the user progresses, different sets of data may be supplied by the external medium to the recognition engine in the base unit in order to allow different recognition sets at different times in the interaction.
Or, in some applications, only one or two data sets might ever be supplied from the external medium to the base unit. Consider the case of a watch that utilizes speech recognition for setting the time. To function, this watch might require two speaker-independent recognition sets, the first of which would be the digits, and the second of which would be the words xe2x80x9cset,xe2x80x9d xe2x80x9chours,xe2x80x9d xe2x80x9cminutes,xe2x80x9d xe2x80x9cseconds,xe2x80x9d and xe2x80x9cdone.xe2x80x9d A problem is that worldwide sales require that this watch perform speech recognition in any of dozens of languages. In the current art, this would require either that the watch manufacturer and retailers carry inventories of a large number of different units or that the watch is loaded with information in many languages, at an unacceptable expense. An alternative approach would be to include a small amount of programmable, non-volatile memory in the watch, and to download, from the Internet, the pertinent information for whatever language a purchaser wishes his watch to recognize. The voice prompts required to guide the user through setting the time would also be downloaded in the language of the user""s choice in the same way. Downloading information to devices from the Internet is already a normal operation and watches with infra-red interfaces to computers are available in the market.
In accordance with a first aspect of the present invention, a base unit is provided wherein features of spoken utterances are analyzed by a programmable pattern recognition system to provide recognition results. A method of operating the base unit includes steps of programming the pattern recognition system to recognize a first set of words, operating the pattern recognition system as programmed to generate at least a first recognition result responsive to input speech, retrieving programming information for the pattern recognition system from a source external to the base unit responsive to the first recognition result and reprogramming the pattern recognition system to recognize a second set of words selected responsive to the first recognition result.
In accordance with a second aspect for he present invention, a method for speaker-independent speech recognition includes steps of performing speaker-independent speech recognition of user utterances in a base unit, receiving, in the base unit, first information pertinent to the speech recognition from an external medium, and receiving, in the base unit, second information independent from the first information and related to the user utterances from the external medium.
In accordance with a third aspect of the present invention, a method for speaker-independent speech recognition includes steps of downloading from an external medium into a base unit the information required for the speech recognition to operate in a selected one or a few of several different languages.
A further understanding of the nature and advantages of the inventions here may be realized by reference to the remaining portions of the specification and the attached drawings.