1. Field of the Invention
The invention is directed to a method for adapting hidden Markov models to operating demands of a speech recognition systems, particularly using specifically formed, multilingual hidden Markov sound models that are adapted to an applied language.
2. Description of the Prior Art
A speech recognition system essentially accesses two independent sources of knowledge. First, there is a phoneme lexicon with which the vocabulary to be recognized is defined as vocabulary. For example, the ASCII strings of the individual words to be recognized as well as their phonetic transcription are stored there. This lexicon also prescribes what is referred to as a xe2x80x9ctaskxe2x80x9d. Second, there is a code book that contains the parameters of the hidden Markov sound models (HMM) and, thus, particularly contains the mid-points of the probability density distributions belonging to recognition segments.
The best performance of a speech recognition system can be observed when the HMM code book is optimally adapted to the lexicon. This is the case when the HMM code book is operated together with that lexicon with which this HMM code book was also initially produced by training. When this cannot be assured, then a deterioration in performance is observed.
The problem often arises in speech recognition systems as utilized, for example, in switching systems that the initially trained vocabulary with which this system is delivered is modified by the customer during operation. This usually results therein that co-articulations between phonemes that could not be previously trained occur given the new words. There is thus a mismatch between lexicon and HMM code book, which leads to a deteriorated recognition performance in practical operation.
A practical example of such a situation would be a telephone exchange of a company that understands the names of the employees and automatically recognizes the connection request of a caller on the basis of his speech input and forwards the call to the corresponding extension (call-by-name). The names of the employees are thus stored in the lexicon. The names will change over and over again due to fluctuation, and the system will therefore exhibit an unsatisfactory recognition performance for said reasons.
In order to assure an optimally high recognition performance of a speech recognition system under the described conditions of use, it is thus necessary to implement an adaption of the underlying HMM code book of this recognition system to the newly established task. Different methods for solving this problem are known from the prior art. Hon. H. W., Lee K. F., xe2x80x9cOn Vocabulary-Independent Speech Modelingxe2x80x9d, Proc. IEEE Intern. Conf. on Acoustics, Speech, and Signal Processing, Albuquerque N. Mex., 1990 discloses a solution wherein it is proposed to implement a retraining for adaption of the code book to the lexicon. This procedure has the disadvantage that the vocabulary of the ultimate application is generally only partly known at the time of training. If the retraining must then be started at a later point in time, then all potentially required acoustic models of a new vocabulary must be kept on hand, which is uneconomical and would be difficult to implement in practice.
What is referred to as a MAP algorithm (maximum a posteriori) for the adaptation of the acoustic models by the user on the basis of a specific set of speech samples is disclosed by Lee C. H., Gauvain J. L., xe2x80x9cSpeaker Adaption Based on MAP Estimation of HMM Parametersxe2x80x9d, Proc. IEEE Intern. Conf. on Acoustics, Speech and Signal Processing, Minneapolis Minn., 1993. The purchaser of the speech recognition system must thereby make speech samples of a number of speakers available. The re-adaption of the code book thereby ensues by monitored learning, i.e. that the system must be informed of the correct transliteration of an expression. The complicated work steps that are thereby required cannot be expected of a customer.
Both solutions from the prior art have the common disadvantage that they only sequence off-line. For an HMM code book adaption, thus, the running system must be shut down so that the new parameters, i.e. the corresponding recognition units can be played into the system. Further, the procedures of training and adaption require a long time for being worked in and implemented, which means a financial disadvantage for the purchaser. An initial code book for the HMM is therefore often offered when the product is delivered. Two training strategies for this are available from the prior art.
On the one hand, the code book can be generated on the basis of a phonetically balanced training dataset. Such code books offer the advantage that they can handle all conceivable applications of unknown tasks since they do not prioritize any recognition units. The speech recognition system is thereby trained to exactly the same vocabulary that plays a part in the ultimate application. A higher recognition rate for the specific application is thereby mainly achieved in that the speech recognition system can make use of co-articulations that it already received in the training phase. However, such specialist code books exhibit poorer performances for applications wherein the lexicon changes.
When the lexicon and, thus, the vocabulary of the ultimate application can be modified, or is even entirely unknown at the training time, then manufacturers must, sometimes with difficulty, work an optimally generally prepared code book into their speech recognition systems.
D. B. Paul et al., xe2x80x9cThe Lincoln-Large Vocabulary Stack-Decoder HMM CSRxe2x80x9d, Vol. 2 of 5, Apr. 27, 1993, IEEE also discloses that a speech recognition system be adapted to a new speaker in real time. Since, however, the vocabulary in this known system is limited and fixed, it cannot be derived from the Paul et al. article as to how that a modification of the vocabulary could be implemented with such a method.
A significant problem is also that new acoustic phonetic models must be trained for every language in which the speech recognition technology is to be introduced in order to be able to implement a national match. HMMs for modelling the language-specific sounds are usually employed in speech recognition systems. Acoustic word models that are recognized during a search process in the speech recognition procedure are subsequently compiled from these statistically modelled sound models. Very extensive speech data banks are required for training these sound models, the collection and editing of these representing an extremely cost-intensive and time-consuming process. Disadvantages thereby arise when transferring a speech recognition technology from one language into another language since the production of a new speech data bank means, on the one hand, that the product becomes more expensive and, one the other hand, causes a time delay in the market introduction.
Language-specific models are exclusively employed in standard purchasable speech recognition systems. Extensive speech data banks are collected and edited for transferring these systems into a new language. Subsequently, the sound models for the new language are retrained from square one with these collected voice data.
In order to reduce the outlay and the time delay when transferring speech recognition systems into different languages, an examination should thus be made to see whether individual sound models are suitable for employment in different languages. P. Dalsgaard and O. Anderson, xe2x80x9cIdentification of Mono- and Poly-phonemes using acoustic-phonetic Features derived by a self-organising Neural Networkxe2x80x9d, in Proc. ICSLP ""92, pages 547-550, Banff, 1992 discloses already provides approaches for producing multilingual sound models and utilizing these in the speech recognition in the respective languages. The terms polyphoneme and monophoneme are also introduced therein, with polyphonemes defined as sounds whose sound formation properties are similar enough over several languages in order to be equated. Monophonemes indicate sounds that exhibit language-specific properties. So that new speech data banks do not have to be trained every time for such development work and investigations, these are already available as a standard as described in P. Dalsgaard, O. Andersen and W. Barry, xe2x80x9cData-driven Identification of Poly- and Mono-phonemes for four European Languages:, in Proc. EUROSPEECH ""93, pages 759-762, Berlin, 1993, J. L. Hieronymus, xe2x80x9cASCII Phonetic Symbols for the World""s Languages: Worldbet. xe2x80x9d preprint, 1993, and A. Cole, Y. K. Muthusamy and B. T. Oshika, xe2x80x9cThe OGI Multi-language Telephone Speech Corpusxe2x80x9d, in Proc. IC-SLP ""92, pages 895-898, banff, 1992 discloses that existing multi-lingual models be employed for the segmentation of the speech data in a target language. The training of the sound models is then implemented in the target language. Further prior art for multilingual employed of sound models is not known.
An object of the present invention is to provide a method for adaptation of an HMM in a speech recognition system wherein the adaptation ensues during the ongoing operation of the speech recognition system. In particular, the above-described complications that derive from the modification of the lexicon and, thus, of the task should be compensated by the adaptation.
This object is achieved in accordance with the principles of the present invention in a method for real-time adaptation of a hidden Markov sound model, in the code book of a speech recognition system, to a vocabulary modification in the phonetic lexicon which is employed, wherein hidden Markov sound models which are to be recognized are maintained available in the code book according at least to an average value vector representing their respective probability distributions, wherein a speech recognition procedure is conducted in a conventional manner by extracting feature vectors from a speech signal and allocating the feature vectors to the probability distributions of the hidden Markov sound modules from the code book, and wherein the position of the average value vector of at least one hidden Markov sound model is scaled to the position of the allocated feature vector by a defined adaptation factor for at least one recognized sound expression of the vocabulary modification, immediately after the recognition thereof, and wherein the adapted average value vector is then stored in the code book as the average value vector for that hidden Markov sound model, in place of the previously-stored vector.
A further object of the invention is to provide a method for the formation and adaptation of specific multi-lingually employable HMMs in a speech recognition system with which the transfer outlay of speech recognition systems into another language is minimized in that the parameters in a multi-lingual speech recognition are reduced.
This object is also achieved in accordance with the principles of the present invention in a version of the above-described inventive method wherein, proceeding from at least one first feature vector for a first sound in a first language, and proceeding from at least one second feature vector for a comparably spoken second sound in at least one second language, and their respective associated first and second hidden Markov sound models, a determination is made as to which of the two hidden Markov sound models better describes both feature vectors, and the hidden Markov model which better describes both feature vectors is then employed for modeling the sound in both languages.
The inventive approach provides that a code book that is kept general and that, for example, contains HMMs that are employed for several languages be employed as seed model and that, given a modified lexicon, it be adapted to this new lexicon during ongoing operation.
An adaption during operation is especially advantageously achieved in an embodiment of the method wherein an already recognized feature vector of a sound expression leads to a shift of the stored center of gravity vector in the HMM code book in that a shift of the mid-point of the probability distribution of the hidden Markov model in the direction of the recognized feature vector ensues with an adaption factor during operation after recognition of the word or of the sound sequence. The learning rate can thereby be arbitrarily set by the adaption factor.
In the method, the allocation of the feature vectors to the HMMs can be advantageously implemented with standard methods such as the Viterbi algorithm. By employing the Viterbi algorithm. an unambiguous allocation of the feature vectors to the stored mid-point vectors of the HMM code book exists after the recognition.
Especially advantageously, the sound models to be adapted and to be recognized are kept available in a standardized HMM code book that can serve as basis for all models of practice to be adapted and thus has to be trained only once upon production for all systems to be adapted or, respectively, only has to be offered in the form of a code book with multi-lingual HMMs.
The adaptation of the center of gravity vector to the recognized feature vectors given Laplacian and Gaussian probability density distributions of the hidden Markov models ensues especially advantageously with the specifically indicated equations since this involves an comparatively low calculating outlay.
Given the disclosed method, an even higher recognition rate is advantageously achieved when, given an uncertainly recognized sound expression, this is completely rejected and no adaptation ensues.
The number of sound hypotheses after the Viterbi search and their appertaining hit rates of the respective hypotheses with reference to the expression are especially advantageously taken into consideration in the rejection. In this case, the rejection is made dependent on the differences between the hit rates since these differences represent a quality particular for the quality of the found solution. Preferably, no rejection can ensue given great differences, and a rejection must ensue given small differences. A threshold of the differences in the hit rates is preferably defined for this case, a rejection ensuing when this is downwardly transgressed since the monitoring of a threshold requires only slight calculating outlay.
One advantage of the disclosed method is therein that a statistical similarity criterion is utilized that allows that sound model whose characteristic best describes all feature vectors of the respective sound that are available to be selected from a given plurality of different sound models for similar sounds in different languages.
The logarithmic probability distance between the respective HMMs and each and every feature vector is advantageously determined as criterion for the selection of the best HMM for different sound feature vectors. As a result a criterion is made available that reflects experimental findings with respect to the similarity of individual sound models and their recognition rates.
The arithmetic mean of the logarithmic probability distances between each HMM and the respective feature vectors is advantageously formed as criterion for the description of an optimally representative HMM since a symmetrical distance value is thereby obtained.
The description criterion for the representative property of an HMM for describing sounds in different languages is advantageously formed by the use of Equations 5 through 8 set forth below, since little calculating outlay arises as a result.
A barrier condition with which a recognition rate of the representative HMM can be set is advantageously prescribed for the application of s description criterion.
The memory outlay for a speech library is especially advantageously reduced by the method since one model can be employed for several languages. The transfer outlay from one language into the other is likewise minimized, this creating a reduced time expenditure for the transfer, which can also be reduced to zero by the on-line adaption. Just as advantageously, less of a calculating outlay is enabled in the Viterbi search since fewer models have to be checked, for example given multilingual input systems.
Special HMMs for employment in multilingual speech recognition systems are especially advantageously utilized. As a result of this procedure, HMMs for sounds in several languages can be combined into polyphoneme models, wherein overlap areas of the standard probability density distributions employed in the various models are investigated. An arbitrary number of standard probability density distributions identically employed in the different models can be employed for describing the polyphoneme model. Advantageously, a number of standard distributions from different speech models can also be employed without the smearing of the individual speech is characteristics caused as a result leading to a significantly lower recognition rate given the use of this model. The distance threshold value of five between similar standard probability distribution densities has proven to be especially advantageous here.
Upon utilization of the method, HMMs are especially advantageously modelled with three states of initial sound, median sound and final sound, since an adequate precision in the description of the sounds is thereby achieved and the calculating outlay in the recognition and on-line adaptation in a speech recognition means remains low.