This invention relates to a connected word recognition system for use in recognizing an input pattern representative of connected words or substantially continuously spoken words by carrying out pattern matching with a concatenation of reference patterns representative of reference words, respectively.
Various connected word recognition systems are already known. Compared with discrete word recognition of recognizing discretely spoken words, connected word recognition is carried out at a much higher speed and is therefore very effective in supplying control commands to a controlling system and in supplying instructions and input data to an electronic digital computer system. Among the connected word recognition systems, a most excellent one is believed to be that disclosed in U.S. Pat. No. 4,592,086 issued to Masaru Watari and Hiroaki Sakoe, assignors to Nippon Electric Co.. Ltd., the present assignee. According to the Watari et al patent, connected word recognition is carried out in compliance with a clockwise dynamic programming (DP) algorithm as called in the art at present.
For such a connected word recognition system, first through N-th reference patterns are preliminarily selected, where represents a first predetermined natural number. The reference patterns include an n-th reference pattern, where n is variable between 1 or unity and the first predetermined natural number N, both inclusive. It is possible to understand that the letter n serves as an n-th word identifier indicative of an n-th reference word represented by the n-th reference pattern and that the first through the N-th reference patterns are arranged at first through N-th consecutive word identifier points along a word identifier axis which may be denoted by the letter n.
The input pattern is represented by an input pattern time sequence which may alternatively be referred to simply as a pattern time sequence and consists of feature vectors positioned along a pattern time axis at first through I-th pattern time instants, respectively, where I represents a second predetermined natural number. The pattern time instants are spaced apart by a pattern time interval or period and include an i-th pattern time instant at which an i-th feature vector is positioned, where i is variable between 1 and the second predetermined natural number, both inclusive. The pattern time axis may be designated by the letter i.
Each reference pattern is represented by a reference pattern time sequence which may be simply called a reference time sequence and consists of reference vectors positioned at reference time instants, respectively. The reference time instants are spaced apart by a reference time interval which may or may not be equal to the pattern time interval.
The concatenation of reference patterns is formed with repetition of one or more of the reference patterns allowed and is selected as a result of recognition of the input pattern in the manner which will presently be described. The concatenation has a concatenation time axis which may alternatively be called a signal time axis for convenience of description of the present invention and is divided into first through J-th signal time instants, where J represents a third predetermined natural number. The signal time instants include a j-th signal time instant which is one of the reference time instants and at which a j-th reference vector of one of the reference patterns is positioned, where j is variable between 1 and the third predetermined natural number J, both inclusive. The signal time axis may be represented by the letter j.
On recognizing the input pattern, a mapping or warping function is used in mapping the pattern and the signal time axes to each other. The mapping function may be: EQU j=j(i),
which defines for the i-th pattern time instant one of a predetermined number of consecutive ones of the first through the J-th signal time instants. Furthermore, a similarity measure is used to represent either a similarity or a dissimilarity between the input pattern and the concatenation of reference patterns. For example, an elementary distance is calculated between the feature vector positioned at the i-th pattern time instant and the reference vector positioned at one of the signal time instants that is mapped to the i-th pattern time instant. Such elementary distances are summed up into a pattern distance as regards the first through the I-th pattern time instants to describe a dissimilarity between the input pattern and the concatenation of reference patterns.
In connection with the pattern distance, it should be noted that the reference patterns are selected so that the reference words may cover the connected words which should be recognized. The n-th refernece pattern is therefore either related or unrelated to a portion of the input pattern by a first function n(i) defined by a plurality of consecutive ones of the first through the I-th pattern time instants. The mapping function may be called a second function j(i).
It is possible to represent the elementary distance by a formula d(n, i, j) assuming that the i-th pattern time instant is mapped by a certain second function j(i) to the j-th signal time instant at which the n-th reference pattern has the j-th signal vector of a certain concatenation of reference patterns. According to the clockwise dynamic programming algorithm, pattern distances are calculated along various loci (n(i), j(i)) in a three-dimensional space (n, i, j) which is defined by the word identifier axis n or the first through the N-th word identifiers and the pattern and the signal time axes i and j.
Each locus starts at a starting point (n(s), 1, 1) in the space, namely, at a starting reference pattern n(s) and the first pattern and signal time instants, and ends at an end point (n(e), I, J), namely, at an end reference pattern n(e) and the i-th pattern and the J-th signal time instants through pertinent ones of the first through the N-th reference patterns, through the first to the I-th pattern time instants, and through those of the first to the J-th signal time instants to which the first through the I-th pattern time instants are mapped in compliance with the second function j(i). The start and the end reference patterns are included in the first through the N-th reference patterns and may be either different from each other or identical with each other.
An optimum locus (n(i), j(i)) is selected so as to provide a minimum distance among the pattern distances calculated along the respective loci by solving a minimization problem: ##EQU1## with the first and the second functions n(i) and j(i) varied. That is, the minimization problem is solved to determine a concatenation of optimum ones n(i) of the first functions. Although simultaneously determined, an optimum second function j(i) is unnecessary in obtaining the result of recognition.
Attention should be directed in connection with the clockwise dynamic programming algorithm to the fact that each reference pattern is represented by a mere reference pattern time sequence. As a consequence, it has been insufficient with the clockwise dynamic programming algorithm to cope with a wide variation in the input pattern representative of predetermined connected words, such as personal differences of articulation or pronunciation and temporary devocalization of voiced vowels. In other words, conventional connected word recognition systems are defective in recognizing connected words with a high precision or reliability.
On the other hand, a multi-layer neural network or net is revealed by Hiroaki Sakoe, the present inventor, in United States patent application Ser. No. 263,208, which was filed on Oct. 27, 1988 (hereinafter referred to as "prior patent application"), which is now U.S. Pat. No. 4,975,961, and with reference to an article contributed by Richard P. Lippmann to the IEEE ASSP Magazine, April 1987, under the title of "An Introduction to Computing with Neural Nets". The multi-layer neural network will hereinafter be referred to simply as a neural network and includes input neuron units of an input layer, intermediate neuron units of an intermediate layer, and at least one output neuron unit of an output layer. The neural network serves to recognize an input pattern of the type described above.
When a single output neuron unit alone is included, the neural network is used in recognizing whether or not the input pattern represents a particular reference pattern assigned to the single output neuron unit. When first through N-th output neuron units are included, they are assigned to the first through the N-th reference patterns of the above-described type. The neural network is used in recognizing the input pattern as one of the reference patterns.
According to the prior patent application, the input and the intermediate neuron units are arranged and interconnected as a time sequential structure having a signal time axis which is divided into first through J-th signal time instants, where J may be called the third predetermined natural number as above. The pattern and the signal time axes are related to each other by a mapping function of the type described above. In the manner described before, the i-th pattern time instant is related to one of a predetermined number of consecutive ones of the first through the J-th signal time instants. Alternatively, the j-th signal time instant is related to one of a predetermined number of consecutive ones of the first through the I-th pattern time instants. In either event, the input pattern time sequence is supplied to the input neuron units according to the mapping function.
The time sequential structure makes it readily possible to put the neural network into operation according to the dynamic programing algorithm or technique known in the art of pattern matching. The neural network is therefore named a dynamic neural network in the prior patent application. It is possible to train the dynamic neural network according to the back-propagation training algorithm described in the above-mentioned Lippmann article. It has, however, been not known to make the dynamic neural network represent various concatenations of reference patterns.