Speech recognition involves the identification of words or phrases in speech. It generally involves using a speech recognition system including, e.g., a computer, that analyzes the speech according to one or more speech recognition methods to identify the words or phrases included in the speech. Speech recognition may be either speaker dependent, speaker independent or a combination of both.
Speaker dependent speech recognition normally uses a computer that has been "trained" to respond to the manner in which a particular person speaks. In general, the training involves the particular person speaking a word or phrase, converting the speech input into digital signal data, and then generating a template or model of the speech which includes information about various characteristics of the speech. Because template and modules include speech characteristic information, e.g., energy level, duration, and other types of information, templates and models are well suited for speech recognition applications where such characteristics can be measured in received speech and compared to the information included in the templates or models. However, because templates and models do not include all speech characteristics less speech information can be derived from a template or model than can be derived from a voice recording which has not been compressed.
Templates or models generated during a speech recognition training process are normally stored in a database for future use during a speech recognition operation. During real time speech recognition applications, input speech is processed in a manner similar to that used to generate a template or model during training. The signal characteristic information or data generated by processing the speech upon which a recognition operation is to be performed is then normally compared to a user's set of templates or models and/or speaker independent templates or models. The best match between the input speech and the templates or models is determined in an attempt to identify the speech input. Upon recognition of a particular word or phase, an appropriate response is normally performed.
Speaker independent speech recognition normally uses composite templates or models or clusters thereof, that represent the same sound, word, or phrase spoken by a number of different persons. Speaker independent templates are normally derived from numerous samples of signal data to represent a wide range of pronunciations. Such data can often be collected during the normal course of business by various services or from a company's own employees thereby eliminating the need for users of the speech recognition system to be consciously involved in the generation of speaker independent speech recognition templates or models.
Referring now to FIG. 1, there is illustrated a known voice dialing telephone system 100 which supports a speech recognition capability. The known system 100 includes a plurality of telephones 101, 102 which are coupled to a switch 116. The switch 116, in turn is coupled to what is sometimes referred to as an intelligent peripheral 124 via a T1 link. The intelligent peripheral 124 is responsible for supporting voice dialing services by performing speech recognition operations and outputting telephone numbers to the switch associated with spoken names, words, or phrases which are to be used to complete the call initiated by the voice dialing service subscriber.
Upon receiving a request from a voice dialing service subscriber to initiate a voice dialing call, the switch 116 couples the subscriber to the intelligent peripheral 124. The intelligent peripheral 124 than process the subscribers speech to identify names and/or commands in the subscribers speech. This is done using the speech recognizer 126, application processor 130 and database 129.
The application processor 130 controls the storage and retrieval of speaker dependent and speaker independent templates or models stored in the database 129. Frequently speaker independent templates or models are used for commands while speaker dependent templates or models are used for names of individuals to be called. Thus, the database 129 normally includes subscriber specific speech recognition information, e.g., a plurality of speaker dependent templates or models, for each subscriber. In addition, the database 129 normally includes a destination telephone number or instruction associated with each stored speaker dependent template or model and, optionally, a compressed voice recording of the name represented by the template or model which can be played back to the subscriber when placing a call to the number associated therewith. Each subscriber may have a plurality of speaker dependent templates, e.g., of 20 or more names. Accordingly, the creation of a subscriber's speaker dependent templates may represent a substantial investment in terms of training time contributed on the part of each individual subscriber. Speaker dependent templates or models for an individual subscriber are normally retrieved from the database 129 and loaded into the speech recognizer 126 each time the individual subscriber attempts to initiate a voice dialing operation.
In as much as a system of the type illustrated in FIG. 1 may have many thousands of subscribers, it can be appreciated that the cumulative investment in subscriber time in generating the stored speaker dependent templates can be quite substantial.
Over time the cost of computers and electronics has decreased while, at the same time, the processing power of such devices has increased. In addition, over time, various advances in speech recognition techniques have resulted in increased recognition accuracy. Such improvements in both speech related recognition hardware and the methods by which speech recognition is performed have provided a significant incentive to users of speech recognition systems, e.g., telephone companies among others, to upgrade older systems and to increase the use of speech recognition systems in general. In addition, with the increased use of speech recognition systems in general, it has become desirable to be able to port a set of speech templates or models created for one application to another application.
Unfortunately, switching from one speech recognition system to another rarely involves a simple substitution of hardware and/or software. This is because databases, e.g., databases of speech recognition templates, used with one, e.g., older speech recognition system, will frequently include data, e.g., speech recognition templates or models, which are incompatible with another, e.g., newer, speech recognition system due to differences in the speech characteristic information included in the speech recognition templates or models. Differences in template or model format or data storage techniques may also complicate matters.
At the present time many users of speech recognition systems are being confronted with the problem of transitioning from older speech recognition systems and platforms to newer ones. Or, alternatively, the problem of sharing or using speech recognition templates or models developed for one application or system with another application or system. As discussed above, various applications and/or systems often use different models or templates with differences in the speech characteristic information stored therein.
Because of the differences between the templates and models used in various applications and systems it is often necessary to generate new speaker dependent templates or models to replace already existing ones when transitioning from one application or system to another. Unfortunately, since the original utterances used to generate the stored templates or models are normally not available to serve as the basis for the generation of new templates or models, it is often necessary to have each user of the speech recognition system repeat the training process for the new application or system. Accordingly, new speaker dependent templates or models often need to be generated to replace those previously used with an application or system being replaced.
The need for users of a speaker dependent speech recognition system to actively participate in the training of a new speech recognition system is a substantial deterrent to the replacement or upgrading of older speech recognition systems. Consider, for example, the case of the telephone provider who provides voice dialing services to many thousands of customers and the inconvenience to those customers that would result if they had to participate in generating entirely new sets of speaker dependent templates or models to replace already existing ones. Differences between speech templates and/or models developed for one system or application, and those developed for other applications or systems have greatly reduced the ability to share or re-use existing speech recognition databases.
In view of the above, it becomes apparent that there is a need for methods and apparatus which are capable of facilitating the transitioning from one speech recognition application or system to another without requiring the generation of new sets of speech recognition templates. Such methods and apparatus may be required, e.g., when upgrading speech recognition applications and systems; when a service provider desires to change vendors or to deploy multiple vendors without impacting an existing customer base and/or when a customer changes residences and wishes to carry his speaker dependent directory, e.g., with a large number of trained names, with him to a new service provider.
Thus, there is a need for methods and apparatus which will allow the reuse and/or sharing of speech recognition templates or models developed for one application or system with another application or system which uses templates or models having a different characteristic information content and/or format. As with most speech recognition systems it is also desirable that any new methods and/or apparatus achieve a suitable recognition rate and degree of accuracy when used.