In addition to providing printed telephone directories, telephone companies provide information services to their subscribers. The services may include stock quotes, directory assistance and many others. In most of these applications, when the information requested can be expressed as a number or number sequence, the user is required to enter his request via a touch tone telephone. This is often aggravating for the user since he is usually obliged to make repetitive entries in order to obtain a single answer. This situation becomes even more difficult when the input information is a word or phrase. In these situations, the involvement of a human operator may be required to complete the desired task.
Because telephone companies are likely to handle a very large number of calls per year, the associated labour costs are very significant. Consequently, telephone companies and telephone equipment manufacturers have devoted considerable efforts to the development of systems that reduce the labour costs associated with providing information services on the telephone network. These efforts comprise the development of sophisticated speech processing and recognition systems that can be used in the context of telephone networks.
In typical speech recognition systems, the user enters his request using isolated word, connected word or continuous speech via a microphone or telephone set. The request may be a name, a city or any other type of information for which either a function is to be performed or information is to be supplied. If valid speech is detected, the speech recognition layer of the system is invoked in an attempt to recognize the unknown utterance. Typically entries in a speech recognition dictionary, usually including transcriptions associated to labels, are scored in order to determine the most likely match to the utterance.
Commonly, the speech recognition dictionary is created by obtaining transcriptions of words, each component of the transcription being associated with an acoustic model. This operation is usually performed when the speech recognition system is built or at the time of installation. However, the initial transcription content of the speech recognition dictionary may inadequately model the pronunciation of certain words. Furthermore, the speech recognition system may be unable to track the time varying aspect of the pronunciation of words in a language leading to performance degradation over time. Typically, many transcriptions are stored for each orthography in the speech recognition vocabulary in order to model the different pronunciations of the orthography. However, to allow real-time performance of the speech recognition system, only a limited number of transcriptions are stored for each word and the correct transcription may not be chosen to be added to the speech recognition dictionary by the training system. Finally, the training of the speech recognition dictionary is limited to information available in a fixed observation period and events not occurring within the observation period are often not well modelled.
One way to improve the performance of a speech recognition system under these conditions is to make use of past utterance usage. Using this existing knowledge, the speech recognition system is modified to better reflect the utterances received by the system. These techniques are commonly referred to as adaptation techniques. Adaptation attempts to correct the shortcomings of the initial configuration of the speech recogniser. Typical adaptation algorithms can adapt to changes in the speech source, such as changes in the speaker population, as well as changes in the application lexicon, such as modifications made by a system administrator. In principle, almost any parameter in a pattern recognition system can be adapted. Two categories of adaptation can be distinguished namely supervised adaptation and unsupervised adaptation.
Supervised adaptation involves human intervention, typically that of a phonetician who will correct labels. Supervised adaptation also requires storage of a batch of data, transferring data for validation purposes, and offline computing resources. It requires a large corpus of labelled speech and usually requires a high level of labelling accuracy in order to obtain improved performance of the speech recogniser. In a typical interaction, a phonetician transcribes each sample of speech and labels it with its corresponding orthography and transcription (e.g. Phonetic spelling) and noise codes. The transcriptions generated in this fashion are then entered into the speech recognition dictionary. This operation is time consuming and the labour costs associated with the expert phoneticians are significant. Furthermore, tracking the evolution of word pronunciation requires frequent training sessions and is impractical. In order to reduce the labour costs associated with the generation of transcriptions, systems providing unsupervised or automatic adaptation have been developed.
Unsupervised adaptation, herein referred to as automatic adaptation, involves little or no human intervention. It requires a large corpus of labelled speech and usually does not require a high level of labelling accuracy in order to obtain improved performance of the speech recogniser. In a specific example the labelled speech is generated by the speech recognizer or by a human operator.
A common method of finding the most suitable transcription from a set of transcriptions is to find the transcription, T*, which maximises the equation p (U/T) p (T/L) where p (U/T) is the acoustic model score from the speech recognizer (e.g. the probability that the transcription corresponds to a given sample of speech) and p (T/L) is the language model probability (the probability of the transcription given the language model). When the word identity is known with certainty, then p(T/L) becomes p(T/W), the probability of a transcription given a fixed word. Typically, P (T/W) is difficult to assess and a constant value is generally used. When multiple examples of a word are available, then the logarithm of p(U/T) is summed, and the transcription which minimises the summed log p(U/T) is taken. For a more detailed discussion of this method, the reader is invited to consult L. R. Bahl et al. "Automatic phonetic baseform determination", 1991, IEEE; D. Hunnicutt et al. "Reversible letter-to-sound and sound-to-letter generation based on parsing word morphology", and R. Haeb-Umbach et al. "Automatic transcription of unknown words in a speech recognition system", 1995, IEEE. The contents of these documents is hereby incorporated by reference.
A deficiency of the above described method is that the transcription T* is selected from a fixed set of transcriptions. This does not allow the system to adapt to the time varying nature of speech as well as does not allow the speech recognition system to learn from past speech utterances.
Thus, there exists a need in the industry to refine the process of automatically adapting a speech recognition system.