The invention relates to a method and a device for recognizing at least one keyword in spoken speech using a computer.
A method and a device for recognizing spoken speech are known from the reference by A. Hauenstein, titled xe2x80x9cOptimierung von Algorithmen und Entwurf eines Prozessors fxc3xcr die automatische Spracherkennungxe2x80x9d, [xe2x80x9cOptimization of Algorithms and Design of a Processor for Automatic Speech Recognitionxe2x80x9d], Chair of Integrated Circuits, Technical University Munich, Dissertation, 07.19.1993, chapter 2, pages 13 to 26. This publication also contains a basic introduction on components of a device for a method for speech recognition, and also important techniques customary in speech recognition.
A keyword is a specific word that is to be recognized by a device for speech recognition in spoken speech. Such a keyword is mostly linked to a prescribed action, that is to say this action is executed after recognition of the keyword.
A method and a device for recognizing spoken speech are also described in the reference by N. Haberland, et. al., titled xe2x80x9cSprachunterrichtxe2x80x94Wie funktioniert die computerbasierte Spracherkennung?xe2x80x9d, [xe2x80x9cLanguage Instructionxe2x80x94How Does Computerxe2x80x94Based Speech Recognition Work?xe2x80x9d], c""t May 1998, Heinz Heise Verlag, Hannover 1998, pages 120 to 125. It follows therefrom, inter alia, that modeling by hidden Markov models permits adaptation to a variation in the speed of the speaker, and that in the case of recognition a dynamic adaptation of the prescribed speech modules to the spoken speech is therefore performed, in particular by carrying out compression or expansion of the time axis. This corresponds to a dynamic adaptation (also: dynamic programming) which is insured, for example, by the Viterbi algorithm.
A space between sounds or sound sequences is determined, for example, by determining a (multidimensional) space between feature vectors that describe the sounds of speech in digitized form. This spacing is an example for a measure of similarity between sounds or sound sequences.
It is accordingly an object of the invention to provide a method and a device for recognizing at least one keyword in spoken speech using a computer which overcome the above-mentioned disadvantages of the prior art methods and devices of this general type, in which the recognition is robust and insensitive to interference.
With the foregoing and other objects in view there is provided, in accordance with the invention, a speech recognition method in which a computer performs the steps of:
a) subdividing a keyword into key segments;
b) assigning each of the key segments a set of reference features;
c) subdividing a test pattern derived for spoken speech into test segments;
d) assigning each of the test segments of the test pattern a reference feature from the set of reference features from a corresponding one of the key segments being most similar to a respective test segment; and
e) recognizing the test pattern as the keyword, if a measure of similarity is determined to be below a prescribed value of an accumulated segment-wise comparison of the reference feature to the respective test segment for each of the test segments of the test pattern.
A method is specified for recognizing at least one keyword in spoken speech using a computer, the keyword being subdivided into segments and each segment being assigned a set of reference features. A test pattern which is included in the spoken speech is subdivided into segments, each segment of the test pattern is assigned a reference feature being most similar to the segment, from the set of the reference features for the corresponding segment of the keyword. The test pattern is recognized as a keyword when a measure of similarity for the accumulated segment wise assignment of a reference feature of the keyword relative to the test pattern is below a prescribed bound. The test pattern is not recognized as a keyword if the measure of similarity is not below a prescribed bound. In this case, a low measure of similarity characterizes a good correspondence between the reference feature of the keyword and the test pattern.
A brief account of the various terms and their meaning follows below. The test pattern is a pattern included in the spoken speech which is to be compared with the keyword and is recognized as the keyword, if appropriate. The measure of similarity characterizes the degree of correspondence between a test pattern and the keyword, or between a part of the test pattern and a part of the keyword. The segment is a section of the test pattern or of the keyword which has a prescribed duration. The reference feature is a sub-feature of the keyword which is referenced to a segment. A reference pattern contains the reference features characterizing a form of expression of the keyword. A word class contains all to reference patterns which can be produced by different combinations of reference features, and a plurality of reference features per segment being stored for the keyword, in particular. In a training phase, representatives of reference features of the respective keyword are determined and stored, while in a recognition phase a comparison of the test pattern with possible reference patterns of the keyword is carried out.
In the training phase, a prescribed set M of representatives of the reference features is preferably stored. If more than reference features are available as free spaces M, averaging of the reference features, for example in the form of a sliding average, can be performed in order thereby to take account of the information of the additional reference features in the representatives.
A development of the invention consists in that the test pattern (and/or the keyword) is an independent sound unit, in particular a word. The test pattern and/or the keyword can also be a phonem, a diphone, another sound composed of a plurality of phonems, or a set of words.
Another development consists in that the number of segments for the keyword and for the test pattern is the same in each case.
Within the framework of an additional development, the test pattern is compared with a plurality of keywords, and the keyword most similar to the test pattern is output. This corresponds to a system for recognizing individual words, the plurality of keywords representing the individual words to be recognized in the spoken speech. In each case, the keyword which best fits the test pattern included in the spoken speech is output.
Another development is that feature vectors are used for storing the keyword and the test pattern, in which case at prescribed sampling instances the speech is digitized and one feature vector each is stored with the data characterizing the speech. This digitization of the speech signal takes place within the framework of preprocessing. A feature vector is preferably determined from the speech signal every 10 ms.
Another development consists in that there is stored for each segment a feature vector which is averaged over all the feature vectors of this segment and is further used as a characteristic of this segment. The digitized speech data, which occur every 10 ms, for example, are preferably preprocessed in overlapping time windows with a temporal extent of 25 ms. An LPC analysis, a spectral analysis or a Cepstral analysis can be used for this purpose. A feature vector with n coefficients is available as a result of the respective analysis for each 10 ms section. The feature vectors of a segment are preferably averaged such that one feature vector is available per segment. It is possible within the framework of the training for recognizing the keyword to store a plurality of different reference features per segment from different sources for spoken speech, such that a plurality of averaged reference features (feature vectors for the keyword) are available.
Furthermore, a device is specified for recognizing at least one keyword in spoken speech, which has a processor unit which is set up in such a way that the following steps are carried out. The keyword is subdivided into segments, it being possible to assign each segment a set of reference features. A test pattern in the spoken speech is subdivided into segments, it being possible to assign each segment of the test pattern a reference feature most similar to the segment the test pattern, from the set of reference features for the corresponding segment of the keyword. The test pattern is recognized as a keyword when the measure of similarity for the accumulated segment wise assignment of a reference feature of the keyword relative to the test pattern is below a prescribed bound. If the measure of similarity is not below the prescribed bound, the keyword is not recognized.
Other features which are considered as characteristic for the invention are set forth in the appended claims.
Although the invention is illustrated and described herein as embodied in a method and a device for recognizing at least one keyword in spoken speech using a computer, it is nevertheless not intended to be limited to the details shown, since various modifications and structural changes may be made therein without departing from the spirit of the invention and within the scope and range of equivalents of the claims.
The construction and method of operation of the invention, however, together with additional objects and advantages thereof will be best understood from the following description of specific embodiments when read in connection with the accompanying drawings.