1. Field of the Invention
This invention relates to a method of speech recognition.
2. Description of the Prior Art
Some speech recognition systems require voices of a user to be preregistered. The preregistered voices are used as references in recognizing the contents of speech of the user.
Advanced speech recognition systems dispense with such voice preregistration and are usable by unspecified persons. The advanced systems include a word dictionary which holds standard voices in the form of parameters. During a speech recognition process, the patterns of input voices are compared with the patterns of standard voices.
U.S. Pat. No. 3,816,722 to Sakoe et al. relates to a pattern recognition system usable as a speech recognition system which includes a computer for calculating the similarity measure between at least two patterns. According to the system of U.S. Pat. No. 3,816,722, the feature vectors of a sequence representative of a first pattern are correlated to those in another sequence representative of a second pattern in such a manner that the normalized sum of the quantities representative of the similarity between each feature vector of a sequence and at least one feature vector of the other sequence may assume an extremum. The extremum is used as the similarity measure to be calculated between the two patterns. With the pattern recognition system, the similarity measure is calculated for each reference pattern and a variable-length partial pattern to be recognized. The partial pattern is successively recognized to be a permutation with repetitions of reference patterns, each having the maximum similarity measure. In the system of U.S. Pat. No. 3,816,722, before a comparison between similarity measures to determine a best-match reference pattern, the similarity measures are required to be normalized by a normalizing unit. This normalization increases the number of steps of calculation.
U.S. Pat. No. 4,751,737 to Gerson et al. relates to a method of generating word templates for a speech recognition system. The method of U.S. Pat. No. 4,751,737 includes the steps of generating an interim template, generating a time alignment path between the interim template and a token, mapping frames from the interim template and the token along the time alignment path onto an averaged time axis, and combining data associated with the mapped frames to produce composite frames representative of the final word template. U.S. Pat. No. 4,751,737 merely discloses the generation of standard patterns for a speech recognition system, and fails to show a main part of the speech recognition system.
U.S. Pat. No. 4,712,242 to Rajasekaran et al. relates a speaker-independent word recognizer in which the zero crossing intervals of the input speech are measured and sorted by duration to provide a rough measure of the frequency distribution within each input frame. The distribution of zero crossing intervals is transformed into a binary feature vector, which is compared with each reference template using a modified Hamming distance measure. A dynamic time warping algorithm is used to permit recognition of various speaker rates. A mask vector with each reference vector on a template is used to ignore insignificant (or speaker-dependent) features of the words detected. In the word recognizer of U.S. Pat. No. 4,712,242, pattern matching paths used in the calculation of a similarity between an input speech frame and a reference speech frame are fixed and therefore lack a flexibility.
Niyada et al. published "Simple Speech Recognition Method for Unspecified Speakers", Meeting of the Acoustical Society of Japan, March 1986, pp. 7-8. T. Kimura et al. published "A Telephone Speech Recognition System Using Word Spotting Technique Based on Statistical Measure", Proceedings of ICASSP, April, 1987, pp. 1175-1178. The speech recognition method by Niyada et al. and the speech recognition system by T. Kimura et al. use a common word-spotting speech recognition technique which will be described hereinafter.
The prior-art speech recognition technique is based on a pattern matching method by which a speech can be spotted from noise to enable a recognition process, and an interval of the speech can be detected. The pattern matching uses a distance measure (a statistical distance measure) as follows.
In cases where the speech length of an input word is linearly expanded or compressed to J frames and a parameter vector for one frame is expressed by x.sub.j (j=1, 2, . . . , J), the input vector X is given as: EQU X=(x.sub.1,x.sub.2, . . . , x.sub.J).sup.t
where each vector x.sub.j has dimensions "p".
When standard patterns of preset words .omega..sub.k (k=1, 2, . . . , K) are defined by average value vectors .mu..sub.k and covariance matrixes W.sub.k, the recognition result is given by one of the preset words which maximizes a posterior probability P(.omega..sub.k .vertline.X).
Bayes' theorem induces the following equation. EQU P(.omega..sub.k .vertline.X)=P(.omega..sub.k).multidot.P(X.vertline..omega..sub.k)/P(X)(1)
where the value P(.omega..sub.k) is regarded as a constant. When a normal distribution is assumed, the following equation is given. EQU P(X.vertline..omega..sub.k)=(2.pi.).sup.-d/2 .vertline.W.sub.k .vertline..sup.-1/2 .multidot.exp {-1/2(X-.mu..sub.k).sup.t .multidot.W.sub.k.sup.-1 .multidot.(X-.mu..sub.k)} (2)
where the superscript "t" denotes a transposed vector or matrix. It is assumed that the value P(X) follows a normal distribution of average value vectors .mu..sub.x and covariance matrixes W.sub.x. Thus, the value P(X) is given as: EQU P(X)=(2.pi.).sup.-d/2 .vertline.W.sub.x .vertline..sup.-1/2 .multidot.exp {-1/2(X-.mu..sub.x).sup.t .multidot.W.sub.x.sup.-1 .multidot.(X-.mu..sub.x)} (3)
The logarithmic expression of the equation (1) after being substituted by the equations (2) and (3) takes the following form (4), provided that the constant terms are omitted. EQU L.sub.k =(X-.mu..sub.k).sup.t .multidot.W.sub.k.sup.-1 .multidot.(X-.mu..sub.k)-(X-.mu..sub.x).sup.t .multidot.W.sub.x.sup.-1 .multidot.(X-.mu..sub.x)+log .vertline.W.sub.k .vertline.-log .vertline.W.sub.x .vertline. (4)
It is assumed that the matrixes W.sub.k and W.sub.x are common, and are given by the following same matrix W. EQU W=(W.sub.1 +W.sub.2 + . . . +W.sub.k +W.sub.x)/(k+1) (5)
When the equation (4) is developed, the following equation is obtained. EQU L.sub.k =B.sub.k -A.sub.k.sup.t .multidot.X (6)
where: EQU A.sub.k =2(W.sup.-1 .multidot..mu..sub.k -W.sup.-1 .multidot..mu..sub.x)(7) EQU B.sub.k =.mu..sub.k.sup.t .multidot.W.sup.-1 .multidot..mu..sub.k -.mu..sub.x.sup.t .multidot.W.sup.-1 .multidot..mu..sub.x ( 8)
when A.sub.k.sup.t =(a.sub.1.sup.(k)t, a.sub.2.sup.(k)t, . . . , a.sub.j.sup.(k)t), the equation (6) is transformed into the following equation. ##EQU1## where the character B.sub.k denotes a bias constant and the character d.sub.j.sup.(k) denotes the partial similarity for the word "k".
The calculation of the final similarly L.sub.k is simplified as described hereinafter. With reference to FIG. 1, in the case of collation between an input and a word "k", a partial period length "n" (n.sub.s.sup.(k) .ltoreq.n.ltoreq.n.sub.e.sup.(k)) is linearly expanded and compressed (extended and contracted) to a standard pattern length J, and similarities are calculated at fixed ends for respective frames. A similarity L.sub.k is calculated along the route from a point T in a line QR to a point P by referring to the equation (9).
Accordingly, the calculation of the similarities for one frame is performed within a range .DELTA.PQR. Since the values x.sub.j in the equation (9) mean j-th frame components after the expansion and compression of a period length "n", a corresponding input frame i' is present. Thus, partial similarities d.sub.j.sup.(k) (i) are expressed by use of an input vector and are specifically given as: EQU d.sub.j.sup.(k) (i')=a.sub.j.sup.(k)t .multidot.x.sub.i'.sup.t( 10)
where: EQU i'=i-rn(j)+1 (11)
In the equation (11), the character rn(j) represents a function between the lengths "n" and "j". Accordingly, provided that partial similarities between respective frames of an input and standard patterns a.sub.j.sup.(k) are predetermined, the equation (9) can be easily calculated by selecting and adding the partial similarities having portions related to the frame i'. In view of the fact that the range .DELTA.PQR moves rightward frame by frame, partial similarities between the vectors a.sub.j.sup.(k) and x.sub.i are calculated on the line PS, and their components corresponding to the range .DELTA.PQS are stored in a memory and are shifted every frame. In this case, since necessary similarities are all present in the memory, repetitive processes in similarity calculations can be prevented.
FIG. 2 shows a prior-art speech recognition apparatus using the previously-mentioned word-spotting technique. With reference to FIG. 2, the prior-art speech recognition apparatus includes an analog-to-digital (A/D) converter 1 which changes an input analog speech signal into a corresponding digital speech signal having 12 bits. In the A/D converter 1, the input analog speech signal is sampled at a frequency of 8 KHz. The digital speech signal is outputted from the A/D converter 1 to a speech analyzer 2. In the speech analyzer 2, the digital speech signal is subjected to LPC analyzation every 10 msec (one frame) so that 10-th order linear prediction coefficients and residual powers are derived. A feature parameter extractor 3 following the speech analyzer 2 calculates LPC cepstrum coefficients c.sub.1 -c.sub.5 and a power term c.sub.0 from the linear prediction coefficients and the residual powers. The calculated LPC cepstrum coefficients and power term constitute feature parameters. Accordingly, a feature vector x.sub.i for a frame "i" is given as: EQU x.sub.i =(c.sub.1, c.sub.2, . . . , c.sub.5) (12)
A frame sync signal generator 4 outputs timing signals (frame signals) at intervals of 10 msec. A speech recognition process is performed synchronously with the frame signals. The frame signals are applied to the speech analyzer 2 and the feature parameter extractor 3. The sync signal generator 4 also outputs a timing signal to a standard pattern selector 5.
A standard pattern storage 6 holds standard patterns of preset words identified by numbers k=1, 2, . . . , K respectively. The standard pattern selector 5 outputs a control signal to the standard pattern storage 6 in synchronism with the timing signal. During a one-frame interval, the output control signal from the standard pattern selector 5 sequentially represents the word numbers k=1, 2, . . . , K so that the standard patterns corresponding to the word numbers k=1, 2, . . . , K are sequentially selected and transferred from the standard pattern storage 6 to a partial similarity calculator 7. The partial similarity calculator 7 determines a partial similarity d.sub.j.sup.(k) (i) between a selected standard pattern a.sub.j.sup.(k)t and a feature vector x.sub.i by referring to the following equation: EQU d.sub.j.sup.(k) (i)=a.sub.j.sup.(k)t .multidot.x.sub.i.sup.t( 13)
where j=1, 2, . . . , J. The calculated partial similarities are sequentially stored into a similarity buffer 12. In general, each time a new partial similarity is stored into the similarity buffer 12, the oldest partial similarity is erased from the similarity buffer 12.
The word number signal outputted from the standard pattern selector 5 is also applied to a proposed period setting section 8. The proposed period setting section 8 sets a minimal length n.sub.s.sup.(k) and a maximal length n.sub.e.sup.(k) of a word designated by the word number signal. Signals representative of the minimal length and the maximal length of the word are fed from the proposed period setting section 8 to a time expansion and compression table 13. The time expansion and compression table 13 stores data of an input frame i' which are plotted as a function of a word length "n" and a frame "j" according to the relation of the equation (11). When a word length "n" and a frame "j" are designated as an address signal fed to the time expansion and compression table, data of the input frame i' corresponding to the designated word length "n" and the frame "j" is read out from the time expansion and compression table 13. Such a readout process is periodically reiterated while the designated word length "n" is sequentially updated in the range between the minimal length n.sub.s.sup.(k) and the maximal length n.sub.e.sup.(k). As a result, data representing different input frames i'are sequentially read out from the time expansion and compression table 13 for the respective word lengths "n" between the minimal length n.sub.s.sup.(k) and the maximal length n.sub.e.sup.(k). The readout data of the input frame i' is fed to the similarity buffer 12.
The partial similarity d.sub.j.sup.(k) (i') corresponding to the input frame i' is read out from the similarity buffer 12. The readout of the partial similarity is executed for each frame "j" (j=1, 2, . . . , J), so that the partial similarities are sequentially read out from the similarity buffer 12. A similarity adder 14 sums the partial similarities d.sub.j.sup.(k) (i') readout from the similarity buffer 12, and calculates a final similarity L.sub.k according to the equation (9). A signal representative of the calculated final similarity L.sub.k is outputted to a similarity comparator 10. The similarity comparator 10 selects the greater of the input similarity and a similarity fed from a temporary memory 11. The selected greater similarity is stored into the temporary memory 11. Accordingly, the similarity held by the temporary memory 11 is updated when the input similarity is greater. On the other hand, the similarity held by the temporary memory 11 remains unchanged when the input similarity is smaller. The similarity comparator 10 also serves to store the word number "k" into the temporary memory 11, the word number "k" corresponding to the similarity stored in the temporary memory 11. As a result, the greatest similarity and the corresponding word number remain in the temporary memory 11.
During a start of overall operation, a first frame i=io is processed. Specifically, the greatest similarity L.sub.1.sup.io (max) is determined for the period-length range of n.sub.s.sup.(1) .ltoreq.n.ltoreq.n.sub.e.sup.(1) with respect to a standard pattern k=1. The greatest similarity L.sub.1.sup.io (max) is stored in the temporary memory 11. Then, the greatest similarity L.sub.2.sup.io (max) is determined for the period-length range of n.sub.s.sup.(2) .ltoreq.n.ltoreq.n.sub.e.sup.(2) with respect to a standard pattern k=2. The similarity L.sub.2.sup.io (max) is compared with the previous similarity L.sub.1.sup.io (max) by the similarity comparator 10. The greater of the compared similarities is selected and is stored into the temporary memory 11 by the similarity comparator 10. Similar processes are repeated for respective standard patterns k=3, 4, . . . , K. As a result, the actually greatest similarity L.sub.k'.sup.io (max) is determined. The greatest similarity L.sub.k'.sup.io (max) and the corresponding word number k' are stored into the temporary memory 11.
During a stage following the start, subsequent frames i=io+.DELTA.i are processed in a way similar to the way of processing the first frame. After a final frame i=I is processed, the word number k=k.sub.m held in the temporary memory 11 represents the result of speech recognition. In cases where the frame number i=i.sub.m and the word length n=n.sub.m corresponding to the greatest similarity are stored into the temporary memory 11 and are allowed to be updated, the speech period corresponding to the result of speech recognition is also determined. The determined speech period is given as: i.sub.m -n.sub.m .about.i.sub.m.
The prior-art speech recognition method used by the apparatus of FIG. 2 has the following problem. The calculation of a similarity L.sub.k according to the equation (9) requires a very large number of calculating steps and a very large memory capacity. Specifically, the required memory capacity for processing one word corresponds to n.times.J/2 words, and the required number of times of calculation corresponds to n.times.J.times.P per frame where P denotes a parameter order number. In the typical case where J=16 and n=48 and P=6, the required memory capacity corresponds to 384 words and the required number of times of calculation corresponds to 4,608 per frame.