The invention relates to a method, a computer program and a data medium for compressing the storage space required by Hidden Markov Model “HMM” prototypes in an electronic memory, and to a system for automatic speech recognition.
Speech processing methods are disclosed, for example, in U.S. Pat. No. 6,029,135, U.S. Pat. No. 5,732,388, DE 19636739 C1 and DE 19719381 C1. In this case, naturally spoken speech has recently been described as a rule by what are termed Hidden Markov Models (HMMs) for the purpose of automatic speech recognition. In Hidden Markov Models, the term “emission probability” denotes the probability that the model belonging to class k emits or generates an actually spoken sound or an actually spoken sound sequence. Here, the class k can be, for example, a sound, a sound sequence, or a word or word sequence. An HMM prototype is the mean value of the associated emission probability distribution. They are obtained from speech recordings.
The prototypes are yielded from the recorded sound spectra after decomposition into individual spectral features and further mathematical transformations. They comprise a number of real numbers, the components and can therefore be regarded as vectors. The individual components are of different importance as regards the identification or assignment to certain sounds.
High-performance recognizers are dependent on Hidden Markov Models with many prototypes. The storage space required to store the prototypes generally rises in proportion to the number of prototypes. In the best current detectors, a prototype comprises 10 to 40 components. Each component is represented by 1 to 4 bytes.
Whole word recognizers decompose the words into arbitrary phonetic units for which prototypes are created. They manage with relatively few prototypes, for example, 1000 to 2000 in the case of a vocabulary of 10 to 50 words. Because of the small vocabulary, they are used for special applications such as number recognition or navigation in a menu.
Type-in recognizers assign prototypes exactly to individual sounds. They require 4000 to 10 000 prototypes, it being possible to assign 100 and more prototypes to each sound. The use of a type-in recognizer is advantageous in many applications, since the vocabulary to be recognized can be kept variable there.
The storage requirement for a type-in recognizer is on the order of magnitude of 40 to 1600 kilobytes. The available storage space is very limited, in particular in the case of mobile consumer terminals (for example, mobile phones, Palm Pilots, etc.); at present, it is substantially below 100 kilobytes, since the costs of the memory and the power loss caused by the memory constitute limiting factors. Methods which permit a drastic compression of the storage requirement are required in order to be able to implement high-performance type-in recognizers for consumer terminals, as well.