The present invention is generally related to the field of speech recognition. Two fundamentally different approaches for recognizing spoken language have been previously known in the field of speech recognition. A first principle is based on speaker independent speech recognition. A vocabulary composed exclusively of fabricationally defined individual words is thereby employed in the speech recognition. A computer unit for speech recognition that is based on this principle as well as the corresponding method for speaker-independent speech recognition are generally known from, for example, G. Ruske, Automatische Spracherkennung, Oldenbourg Verlag, 2nd ed., ISBN 3-48622794-7, pp. 172-195, 1992. This approach is based, for example, on a phoneme recognition, combined with a hidden markov modeling. First, feature vectors are derived from a digitalized voice signal dictated by a user. These feature vectors continue the information of the voice signal that is important for the speech recognition. The identified feature vectors are subsequently compared to prototype feature vectors typical of the phoneme segments. These prototype feature vectors may be stored, for example, in ROM (Read-Only-Memory) provided for this purpose.
Since only one memory location for the phonetic presentation of this word is provided for each word of the vocabulary to be recognized, the total memory requirement for the speaker-independent speech recognition is mainly defined by the memory capacity of the ROM. The results of the aforementioned comparison operations are then combined with one another in a search to determine the spoken word with highest probability from the predetermined vocabulary. In this approach, the vocabulary must be stored with the prototype feature vectors in the form of phonemes of the respective language. Due to the nature of speaker-independent speech recognition based on a phoneme recognition, the recognition of a user-defined part of the vocabulary can only be realized given the pre-condition of the availability of a phonetic notation of each word to be incorporated into the vocabulary that is input by the users.
For this reason, this approach harbors the disadvantage that an additional outlay is established for the user in the phonetic presentation of each user-defined part of the vocabulary. This also leads to ergonomic disadvantages of this approach.
Further, the considerable costs of an additionally required human-machined interface in the form of a keyboard is considered a substantial disadvantage of this approach. Due to the necessity that the user himself must implement the division of the respective new work into phonemes, this approach is also very susceptible to error.
A second approach is based on speaker-dependent speech recognition. This approach is based on a whole-word comparison between a dictated, digitalized voice signal and speech seminars (templates) dictated during a training phase and stored for speaker-dependent speech recognition. One means for the implementation of speaker-dependent speech recognition, as well as an example of this approach, is known from K. Zutnkler, Spracherkennung mit Hidden-Markov Modellen unter Nutzung von unterscheidungsrelevanten Markmalen, Dissertation Technical University Munchen, pp. 22-25, 1991.
A considerable disadvantage of this approach may be seen in the necessity of static storage of the stored speech samples (templates). A training phase is required to be repeated over and over again at the beginning of each "speech recognition session" that cannot be imputed to a user. The requirement for static RAM memory space caused as a result thereof is proportional to the plurality of stored templates per vocabulary word, to the plurality of user-defined vocabulary words, and to the plurality of users for whom the speaker-dependent speech recognition must be capable of being operated at the same time. Beginning with a certain value combination for the aforementioned parameters, not only does the memory capacity of the static memory become greater than given a means with speaker-independent speech recognition, but an increase in the dissipated power that impedes power-saving operation is additionally caused due to the static storing.
A further disadvantage of this approach may be seen in the considerable manufacturing costs that, in particular, are incurred because of the unfavorable area relationships of a static RAM memory to a ROM memory.
It is also known from an article entitled "Product Overview--Advance Information, DVC Advanced Voice Command Processor," in DSP Communications, Inc., Cupertino, Calif., 1995, to implement the algorithms for speaker-independent speech recognition and for speaker dependent speech recognition on a plurality of chips. This known computer unit comprises a special processor bearing the type designation DVC 306, a micro controller and a plurality of memory chips having a total of up to 16 megabits of S-RAM capacity.
This known computer unit for speech recognition has a number of considerable disadvantages. Due to the employment of both the algorithms for speaker-independent speech recognition as well as the algorithms for speaker-dependent speech recognition, a plurality of algorithms must be implemented in a ROM memory.
Further, the disadvantages of the speaker-dependent algorithms, for example, the high speed for static RAM memory and the substantial costs of manufacturing connected therewith, are still present in this implementation.
What is referred to as a Viterbi algorithm is also known from G. Ruske, Automatische Spracherkennung, Oldenbourg Verlag, 2nd ed., ISBN 3-48622794-7, pp. 172-195, 1992. Also, the method of dynamic programming (FP algorithm) is likewise known from G. Ruske, Automatische Spracherkennung, Oldenbourg Verlag, 2nd ed., ISBN 3-48622794-7, pp. 172-195, 1992.