The present invention relates to a method for recognizing speech according to the preamble of claim 1, and in particular to a method for recognizing speech, whereby through speaker adaptation the number of model function mixtures used in an acoustic model is reduced and more particular to the reduction of the number of Gaussian mixtures in a speaker adaptive HMM-based speech recognition system.
Methods for automatic speech recognition become more and more important in these days. A particular problem in conventional methods for recognizing speech is that contrary goals have to be achieved simultaneously. On the one hand, the methods and devices should be as flexible as possible so as to deal with a large variety of speaking behavior, in particular with a variety of pronunciations, accents, dialects or the like. On the other hand, however, methods and devices for recognizing speech should be small to be easy implemented, to have a fast performance and high recognition efficiency, in particular at low cost.
Prior art methods for recognizing speech use speaker adaptation methods to transform an underlying acoustic model to better fit to the acoustic properties and the speaking behavior of a current or specific speaker. The basis for each acoustic model is essentially a set of model function mixtures. Many model function mixtures are needed to cover the large variety and variability of acoustical behavior in particular with respect to phones, phonemes, subword units, syllables, words or the like. In conventional methods for recognizing speech, a current acoustic model is adapted by changing at least in part the contributions of model function mixture components of model function mixtures during the process of recognition in particular based on at least one recognition result already obtained.
A major drawback of these conventional speaker adaptation methods for recognizing speech is that these models indeed employ a very large number of model function mixtures and model function mixture components. Therefore, these common methods for recognizing speech have to perform an equivalent large number of checks, comparisons and determinations so as to fit the current acoustic model to the current speaker. Due to the burden of calculations and checks the implementation of conventional methods for recognizing speech have to be based on high performance computer systems with high capacity storage means and fast calculation units. It is an object of the present invention to provide a method for recognizing speech which allows a fast performance with a reduced burden of calculations and a particularly high recognition yield.