This invention relates to speech recognition, and in particular, to techniques for making a recognizer and performing speech recognition. This invention also relates to methods of improving the accuracy of a recognizer through tuning of recognition parameters.
Speech recognizers are systems that are typically designed to recognize a spoken word or phrase. The words or phrases that the system is able to recognize are commonly referred to as the recognition set. Speech recognition systems are typically implemented in hardware, software or as a combination of hardware and software. In general, there are two types of speech recognizers: speaker-dependent and speaker-independent. Speaker-dependent recognizers operate by requiring the user to record the words or phrases in recognition set before first use. These words or phrases are then analyzed to produce templates representing the acoustic features of the words or phrases in the recognition set. In operation, an unknown word or phrase is spoken by the same user who performed the recording. The acoustic features in the unknown word or phrase are analyzed to form a pattern that is compared to the several templates in order to decide which of the words or phrases in the recognition set was spoken. This comparison is generally done using dynamic time warping, which allows the unknown phrase to be spoken at a different cadence than that of the phrases that produced the templates, without degradation of the recognition capability. While speaker-dependent recognition devices perform well, they are limited in their general applicability by the requirement that the user must train them and that they work well only for the user that trained them. For these reasons, speaker-independent speech recognition devices are highly desired for many applications. Their benefit is that any speaker may use them without that speaker having to say the phrases before first use.
Speaker-independent speech recognizers consume various amounts of computing resources. For example, some recognizers are made from a limited number of computing and memory resources (e.g., execution of the order of a few million instructions per second (MIPS) using a few kilobytes of random access memory (RAM), tens of kilobytes of read-only memory (ROM) and a limited power supply) which makes the recognizer have a low cost. Other recognizers require a large number of arithmetic and addressing units, hundreds or more MIPS, megabytes of RAM and ROM and an unlimited power supply. Recognizers with constrained computational resources are generally adapted for use in a single product and are included as part of that product. Recognizers with unconstrained computational resources usually stand-alone and are accessed remotely via telephone or some other device by multiple users. Because of this difference, speech recognizers used in constrained computing environments must be economical in terms of the resources required for their use while large speech recognizers are less subject to this limitation.
To train a speaker-independent speech recognizer to recognize a specific set of phrases in a constrained computing environment, many recordings of each of the phrases in the recognition set must be obtained. By contrast, the acoustic model in a computationally unconstrained speaker-independent recognizer is trained once for all recognition sets with which it will be used for the given language. This advantage of training an acoustic model, which describes acoustic elements in a language, once for all recognition sets instead of once for each recognition set is offset by the significant resource requirements of such recognizers that make them incompatible with use in many consumer electronic and similar products. An example of a computationally constrained speaker-independent speech recognizer that requires recordings of each vocabulary for training is given in U.S. Pat. No. 5,790,754. Examples of computationally unconstrained speaker-independent recognizers that are trained on acoustic models for each language are given by Bourlard and Morgan (1997), Nuance Corporation (www.nuance.com), OGI Campus, Oregon Health & Science University (OGI/OHSU), Center for Spoken Language Understanding (CSLU), and SpeechWorks (www.speechworks.com).
Major drawbacks of current art speaker-independent speech recognizers are that those inexpensive enough to be used in consumer electronic products require training by collection of recordings of each of the phrases in each recognition set, while those that do not require such recordings require computational resources that render them cost ineffective for use in consumer electronic products.