The present invention is directed to speech recognition techniques and, more particularly, to methods and apparatus for generating speech recognition models, distributing speech recognition models and performing speech recognition operations, e.g., voice dialing and word processing operations, using speech recognition models.
Speech recognition, which includes both speaker independent speech recognition and speaker dependent speech recognition, is used for a wide variety of applications.
Speech recognition normally involves the use of speech recognition models or templates that have been trained using speech samples provided by one or more individuals. Commonly used speech recognition models include Hidden Markov Models (HMMS). An example of a common template is a dynamic time warping (DTW) template. In the context of the present application xe2x80x9cspeech recognition modelxe2x80x9d is intended to encompass both speech recognition models as well as templates which are used for speech recognition purposes.
As part of a speech recognition operation, speech input is normally digitized and then processed. The processing normally involves extracting feature information, e.g., energy and/timing information, from the digitized signal. The extracted feature information normally takes the form of one or more feature vectors. The extracted feature vectors are then compared to one or more speech recognition models in an attempt to recognize words, phrases or sounds.
In speech recognition systems, various actions, e.g., dialing a telephone number, entering information into a form, etc., are often performed in response to the results of the speech recognition operation.
Before speech recognition operations can be performed, one or more speech recognition models need to be trained. Speech recognition models can be either speaker dependent or speaker independent. Speaker dependent (SD) speech recognition models are normally trained using speech from a single individual and are designed so that they should accurately recognize the speech of the individual who provided the training speech but not necessarily other individuals. Speaker independent (SI) speech recognition models are normally generated from speech provided from numerous individuals or from text. The generated speaker independent speech recognition models often represent composite models which take into consideration variations between different speakers, e.g., due to differing pronunciations of the same word. Speaker independent speech recognition models are designed to accurately identify speech from a wide range of individuals including individuals who did not provide speech samples for training purposes.
In general, model training involves one or more individuals speaking a word or phrase, converting the speech into digital signal data, and then processing the digital signal data to generate a speech recognition model. Model training frequently involves an iterative process of computing a speech recognition model, scoring the model, and then using the results of the scoring operation to further improve and retrain the speech recognition model.
Speech recognition model training processes can be very computationally complex. This is true particularly in the case of SI models where audio data from numerous speakers is normally processed to generate each model. For this reason, speech recognition models are often generated using a relatively powerful computer systems.
Individual speech recognition models can take up a considerable amount of storage space. For this reason, it is often impractical to store speech recognition models corresponding to large numbers of words or phrases, e.g., the names of all the people in a mid-sized company, or large dictionary in a portable device or speech recognizer where storage space, e.g., memory, is limited.
In addition to limits in storage capacity, portable devices are often equipped with limited processing power. Speech recognition, like the model training process, can be a relatively computationally complex process and can there for be time consuming given limited processing resources. Since most users of a speech processing system expect a prompt response from the system, to satisfy user demands speech processing often needs to be performed in real or near real time. As the number of potential words which may be recognized increases, so does the amount of processing required to perform a speech recognition operation. Thus, devices with limited processing power which may be able to perform a speech recognition operation involving recognizing, e.g., 20 possible names in near real time, may not be fast enough to perform a recognition operation in near real time where the number of names is increased to 100 possible names.
In the case of voice dialing and other applications where the recognition results need to be generated in near real time, e.g., with relatively little delay, the limited processing power of portable devices often limits the size of the vocabulary which can be considered as possible recognition outcomes.
In addition to the above implementation problems, implementers of speech recognition systems are often confronted with logistical problems associated with collecting speech samples to be used for model training purposes. This is particularly a problem in the case of speaker independent speech recognition models where the robustness of the models are often a function of the number of speech samples used for training and the differences between the individuals providing the samples. In applications where speech recognition models are to be used over a wide geographical region, it is particularly desirable that speech samples be collected from the various geographic regions where the models will ultimately be used. In this manner, regional speech differences can be taken into account during model training.
Another problem confronting implementers of speech recognition systems is that older speech recognition models may include different feature information than current speech recognition models. When updating a system to use newer speech recognition models, previously used models in addition to speech recognition software may have to be revised or replaced. This frequently requires speech samples to retrain and/or update the older models. Thus the problems of collecting training data and training speech recognition models discussed above are often encountered when updating existing speech recognition systems.
In systems using multiple speech recognition devices, speech model incompatibility may require the extraction of different speech features for different speech recognition devices when the devices are used to perform a speech recognition operation on the same speech segment. Accordingly, in some cases it is desirable to be able to supply the speech to be processed to multiple systems so that each system can perform its own feature extraction operation.
In view of the above discussion, it is apparent that there is a need for new and improved methods and apparatus relating to a wider range of speech recognition issues. For example, there is a need for improvements with regard to the collecting of speech samples for purposes of training speech recognition models. There is also a need for improved methods of providing users of portable devices with limited processing power, e.g., notebook computers and personal data assistants (PDAs) speech recognition functionality. Improved methods of providing speech recognition functionality in systems where different types of speech recognition models are used by different speech recognizers is also desirable. Enhanced methods and apparatus for updating speech recognition models are also desirable.
The present invention is directed to methods and apparatus for generating, distributing, and using speech recognition models. In accordance with the present invention, a shared, e.g., centralized, speech processing facility is used to support speech recognition for a wide variety of devices, e.g., notebook computers, business computer systems personal data assistants, etc. The centralized speech processing facility of the present invention may be located at a physically remote site, e.g., in a different room, building, or even country, than the devices to which it provides speech processing and/or speech recognition services. The shared speech processing facility may be coupled to numerous devices via the Internet and/or one or more other communications channels such as telephone lines, a local area network (LAN), etc.
In various embodiments, the Internet is used as the communications channel via which model training data is collected and/or speech recognition input is received by the shared speech processing facility of the present invention. Speech files may be sent to the speech processing facility as electronic mail (E-mail) message attachments. The Internet is also used to return speech recognition models and/or information identifying recognized words or phrases included in the processed speech. The speech recognition models may be returned as E-mail message attachments while the recognized words may be returned as text in the body of an E-mail message or in a text file attachment to an E-mail message.
Thus, via the Internet, devices with audio capture capability and Internet access can record and transmit to the centralized speech processing facility of the present invention digitized speech, e.g., as speech files. The speech processing facility then performs a model training operation or speech recognition operation using the received speech. A speech recognition model or data message including the recognized words, phases or other information is then returned depending on whether a model training or recognition operation was performed, to the device which supplied the speech.
Thus, the speech processing facility of the present invention can be used to provide speech recognition capabilities and/or to augment a device""s speech processing capability by performing speech recognition model training operations and/or additional speech recognition operations which can be used to supplement local speech recognition attempts.
For example, in various embodiments of the present invention, the generation of speech recognition models to be used locally is performed by the remote speech processing facility. In one such embodiment, when the local computer device needs a speech recognition model to be trained, the local computer system collects the necessary training data, e.g., speech samples from the system user and text corresponding to the retrieved speech samples and then transmits the training data, e.g., via the Internet, to the speech processing facility of the present invention. The speech processing facility then generates one or more speech recognition models and returns them to the local computer system for use in local speech recognition operations.
In various embodiments, the shared speech processing facility updates a training database with the speech samples received from local computer systems. In this way, a more robust set of training data is created at the remote speech processing facility as part of the model training and/or updating process without imposing addition burdens on individual devices beyond those needed to support services being provided to a use of an individual device, e.g., notebook computer or PDA. As the training database is augmented, speaker independent speech recognition models may be retrained periodically using the updated training data and then transmitted to those computer systems which use speech recognition models corresponding to those models which are retrained. In this manner, multiple local systems can benefit from one or more different users initiating the retraining of speech recognition models to enhance recognition results.
As discussed above, in various embodiments, the remote speech processing facility of the present invention is used to perform speech recognition operations and then return the recognition results or take other actions based on the recognition results. For example, in one embodiment business computer systems capture speech from, e.g., customers, and then transmit the speech or extracted speech information to the shared speech processing facility via the Internet. The remote speech processing facility performs speech recognition operations on the received speech and/or received extracted speech information. The results of the recognition operation, e.g., recognized words in the form of, e.g., text, are then returned to the business computer system which supplied the processed speech or speech information. The business system can then use the information returned by the speech processing facility, e.g., recognized text, to fill in forms or perform other services such as automatically respond to verbal customer inquires. Thus, the remote speech processing method of the present invention can be used to supply speech processing capabilities to customers, e.g., businesses, who can""t, or do not want to, support local speech processing operations.
In addition to providing speech recognition capabilities to systems which can""t perform speech recognition locally, the speech processing facility of the present invention is used in various embodiments to augment the speech recognition capabilities of various devices such as notebook computers and personal data assistants. In such embodiments the remote speech processing facility may be used to perform speech recognition when the local device is unable to obtain a satisfactory recognition result, e.g., because of a limited vocabulary or limited processing capability.
In one particular exemplary embodiment, a notebook computer attempts to perform a voice dialing operation on received speech using locally stored speech recognition models prior to contracting the speech processing facility of the present invention. If the local speech recognition operation fails to result in the recognition of a name, the received speech or extracted feature information is transmitted to the remote speech processing facility. If the local notebook computer can""t perform a dialing operation the notebook computer also transmits to the remote speech processing facility a telephone number where the user of the notebook computer can be contacted by telephone. The remote speech processing facility performs a speech recognition operation using the received speech and/or extracted feature information. If the speech recognition operation results in the recognition of a name with which a telephone number is associated the telephone number is retrieved from the remote speech processing facility""s memory. The telephone number is returned to the device requesting that the voice dialing speech recognition operation be performed unless a contact telephone number was provided with the speech and/or extracted feature information. In such a case, the speech processing facility uses telephone circuitry to initiate one telephone call to the telephone number retrieved from memory and another telephone call to the received contact telephone number. When the two calls are answered, they are bridged thereby completing the voice dialing operation.
In addition to generating new speech recognition models to be used in speech processing operations and providing speech recognition services, the centralized speech processing facility of the present invention can be used for modernizing existing speech recognition system but upgrading speech recognition models and the speech recognition engine used therewith. In one particular embodiment, speech recognition models or templates are received via the Internet from a system to be updated along with speech corresponding to the modeled words. The received models or templates and/or speech are used to generate updated models which include different speech characteristic information or have a different model format than the existing speech recognition models. The updated models are returned to the speech recognition systems along with, in some cases, new speech recognition engine software.
In one particular embodiment, speech recognition templates used by voice dialing systems are updated and replaced with HMMs generated by the central processing system of the present invention.
At the time the templates are replaced, the speech recognition engine software is also replaced with a new speech recognition engine which uses HMMs for recognition purposes.
Various additional features and advantages of the present invention will be apparent from the detailed description which follows.