This invention relates to a continuous speech or voice recognition system.
A speech recognition system has a number of advantages as a device for supplying commands and data to a machine system or a computer system as inputs thereof. A considerable number of speech recognition systems are already in practical use. Above all, a continuous speech recognition system is excellent in that such a system is capable of continuously supplying numerals and other data to machines and computer systems and has an accordingly high input speed.
Continuous speech recognition has been approached in a variety of ways. It appears that a system according to the two-level DP-matching, as it is called in the art, has most excellent performance. A system of this type is described, for example, in U.S. Pat. No. 4,049,913 issued to Hiroaki Sakoe, one of the present applicants, and assigned to Nippon Electric Co., Ltd., the instant assignee, and in an article contributed by Hiroaki Sakoe to IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-27, pp. 588-595 (No. 6, December 1979), under the title of "Two-Level DP-Matching--A Dynamic Programming-Based Pattern Matching Algorithm for Connected Word Recognition. The algorithm is for effectively carrying out those principles, according to which an input voice or speech pattern representative of a succession or sequence of continuously spoken words is matched to an optimum one of a plurality of reference pattern concatenations given by various concatenations of reference word patterns of preliminarily individually or discretely pronounced reference words. The excellent performance results from the fact that it is unnecessary according to the algorithm to preparatorily divide or segment the input voice pattern into input word patterns which are in one-to-one correspondence to the continuously spoken words.
As pointed out in the above-referenced Sakoe article, the second complete paragraph on page 589, the two-level DP-matching technique has still been objectionable in that no countermeasure is taken for the coarticulation effect, according to which the physical characteristics of a phoneme are influenced by a preceding phoneme and/or a succeeding one. Depending on the circumstances, the coarticulation effect degrades the matching between a reference word pattern and an input word pattern. To speak of Japanese numerals, let a continuously spoken word succession be a two-digit or two-word numeral /gojon/ (corresponding to /fivefour:/ in English) and the succession be supplied to a continuous speech recognition system in which two individual reference word patterns for the respective one-digit numerals /go/ (five) and /jon/ (four) are preliminarily stored or registered. In the neighborhood of a point of merge of the two continuously spoken words /go/ and /jon/, a strong coarticulation effect may take place to appreciably vary the physical characteristics of the preceding phoneme /o/ and the subsequent phoneme /j/ from those in the individual reference word patterns. The two-level DP-matching technique has nothing to do therewith. The coarticulation effect therefore gives rise to misrecognition as the case may be.
It is already known on the other hand that a finite-state automaton is effective in reducing misrecognition of a continuous speech recognition system. A system operable as such an automaton is revealed in U.S. patent application Ser. No. 175,798 (now U.S. Pat. No. 4,326,101) filed Aug. 6, 1980, by Hiroaki Sakoe, one of the instant applicants and assigned to the present assignee.
According to the system disclosed in U.S. Pat. No. 4,326,101, an input voice pattern representative of a succession of words continuously spoken in compliance with a regular grammar or the grammar of a regular language, is recognized with reference to a plurality of reference word patterns which are representative of individually pronounced reference words, respectively, and stored in the system preliminarily of supply thereto of the input voice pattern. The recognition is controlled by the finite-state automaton so that the input voice pattern is recognized as a concatenation of the reference word patterns which is not contradictory to the regular grammar.