1. Field of the Invention
The present invention relates to a reference pattern generating apparatus and method for generating a reference pattern having a high representation efficiency for use in speech recognition. The invention relates to a computer readable medium having a reference pattern generating program embodied thereon that may be used to implement the method.
2. Description of the Prior Art
In general, speech recognition employs a pattern matching method of comparing an input voice pattern with a set of reference patterns associated with words by computing pattern matching distances between the input voice pattern and the reference patterns, and furnishing a word exhibiting a minimum pattern matching distance as the recognition result. A word""s reference pattern can be a time series of frames or feature vectors X(1), X(2), X(3), . . . , X(T) obtained from an input voice pattern of the word, where T is the length of the word (i.e. the number of frames). Since T depends on words, the size of the reference pattern depends on words. Therefore, the size of memory used for storing a set of reference patterns cannot be determined in advance, even if the number of words is predetermined. In addition, as the number of frames T assigned to each word increases, the amount of storage used for storing the set of reference patterns also increases.
In order to solve such problems, an apparatus for and method of compressing the time series of feature vectors X(1), X(2), X(3), . . . , X(T) obtained from an input voice pattern of each word and generating a predetermined number J( greater than 1) of states of reference pattern for each word, J being independent of the number of frames of each word, have been studied.
Referring now to FIG. 17, there is illustrated a block diagram showing the structure of such a prior art reference pattern generating apparatus as disclosed in Japanese Patent Application Publication (TOKKAISHO) No. 64-44997, for example. In the figure, reference numeral 1 denotes an input terminal for receiving a voice signal 2, applied thereto, numeral 3 denotes an analysis unit for performing an acoustic analysis on the input voice signal 2, numeral 4 denotes a time series of feature vectors, which is obtained as an acoustic analysis result from the input voice signal 2 by the analysis unit 3, numeral 5 denotes an initial reference pattern generating unit for generating an initial reference pattern 6 from the time series 4 of feature vectors, and numeral 7 denotes a reference pattern generating unit for generating a reference pattern 8 from the initial reference pattern 6.
In operation, a voice signal generated from a person""s voice is applied to the input terminal 1 to create a reference pattern. The analysis unit 3 analog-to-digital converts the voice signal 2 applied to the input terminal and then performs an acoustic analysis on the analog-to-digital converted voice signal on a per-frame basis (a certain short time interval is called a xe2x80x9cframexe2x80x9d). Based on the acoustic analysis result, the analysis unit 3 extracts a speech region (region in which speech is present) from the digital voice signal and then calculates a time series 4 of feature vectors X(1), X(2), X(3), . . . , X(T) from the speech region. Each feature vector X(t) (t=1, 2, 3, . . . T) is calculated on a per-frame basis. T is the number of frames included in the speech region extracted from the digital voice signal, i.e. the number of feature vectors. Since it is difficult to precisely extract the speech region (where voice is present) from the digital voice signal, a few leading and last frames can include a pause. Each feature vector X(t) can be an LPC cepstrum obtained with linear prediction or LPC analysis.
The initial reference pattern generating unit 5 receives the time series 4 of feature vectors X(1), X(2), X(3), . . . , X(T) from the analysis unit 3, and then generates an initial value 6 of the reference pattern following a procedure, which will be mentioned later. Referring next to FIG. 18, there is illustrated a flow diagram showing the procedure of generating the initial reference pattern 6.
The initial reference pattern generating unit 5, in step ST101 of FIG. 18, divides the time series 4 of feature vectors X(1), X(2), X(3), . . . , X(T) into J (J greater than 1) small sections B(1), B(2), B(3), . . . , B(J) in such a manner that any two adjacent small sections do not overlap each other and they are equal in length where possible. Start and end frames sz(j) and ez(j) of each small section B(j) (j=1, 2, 3, . . . , J) are given by the following equations (1) to (3):
L=[T/J]xe2x80x83xe2x80x83(1)                               sz          ⁡                      (            j            )                          =                  {                                                    1                                                                                  for                    ⁢                                          xe2x80x83                                        ⁢                    j                                    =                  1                                                                                                                          ez                    ⁡                                          (                                              j                        -                        1                                            )                                                        +                  1                                                                                                                        for                      ⁢                                              xe2x80x83                                            ⁢                      j                                        =                    2                                    ,                  …                  ⁢                                      xe2x80x83                                    ,                  J                                                                                        (2)                                          ez          ⁡                      (            j            )                          =                  {                                                                                          sz                    ⁡                                          (                      j                      )                                                        +                  L                  -                  1                                                                                                                        for                      ⁢                                              xe2x80x83                                            ⁢                      j                                        =                    1                                    ,                  …                  ⁢                                      xe2x80x83                                    ,                                      J                    -                    1                                                                                                      T                                                                                  for                    ⁢                                          xe2x80x83                                        ⁢                    j                                    =                  J                                                                                        (3)            
where [ ] of the equation (1) is an arithmetic operation to round off the number in the square brackets to produce an integer.
FIG. 19 shows an example in which the number T of frames or feature vectors X(1), X(2), X(3), . . . , X(T) is 15, and the number J of small sections B(1), B(2), B(3), . . . , B(J) or states of the reference pattern is 5. In this example, the time series 4 of feature vectors X(1), X(2), X(3), . . . , X(15) is divided into the plurality of small sections B(1), B(2), B(3), . . . , B(J) in such a manner that they are equal in length and B(1) includes feature vectors X(1) to X(3), B(2) includes feature vectors X(4) to X(6), . . . , and B(5) includes feature vectors X(13) to X(15).
The initial reference pattern generating unit 5 then advances to step ST102 in which it averages part of the time series 4 of feature vectors included in each small section B(j) obtained in step ST101 according to the following equation (4) to generate an initial value Rz(j) (j=1, 2, 3, . . . , J):                               Rz          ⁡                      (            j            )                          =                              1                                          ez                ⁡                                  (                  j                  )                                            -                              sz                ⁡                                  (                  j                  )                                            +              1                                ⁢                      xe2x80x83                    ⁢                                    ∑                              k                =                                  sz                  ⁡                                      (                    j                    )                                                                              ez                ⁡                                  (                  j                  )                                                      ⁢                          xe2x80x83                        ⁢                          X              ⁡                              (                k                )                                                                        (4)            
The process of generating the initial value Rz(j) (j=1, 2, 3, . . . , J) for each small section in the case of the number J of states of the reference pattern=5 is shown in FIG. 19. The initial value Rz(1) for the first state is produced by averaging the time series of the leading three feature vectors X(1) to X(3) included in the first small section B(1), the initial value Rz(2) for the second state is produced by averaging the time series of the three feature, vectors X(4) to X(6) included in the second small section B(2), . . . , and the initial value Rz(5) for the fifth state is produced by averaging the time series of the last three feature vectors X(13) to X(15) included in the fifth small section B(5), as shown in FIG. 19.
During the above-mentioned averaging process, the initial value Rz(j) for each small section is determined in such a manner that the sum D(j) of Euclidean distances between each state of the initial reference pattern Rz(j) and the feature vectors X(sz(j)) to X(ez(j)) included in each small section B(j), which is calculated according to the following equation (5), is minimized.                               D          ⁡                      (            j            )                          =                              ∑                          k              =                              sz                ⁡                                  (                  j                  )                                                                    ez              ⁡                              (                j                )                                              ⁢                      xe2x80x83                    ⁢                                    "LeftBracketingBar"                                                Rz                  ⁡                                      (                    j                    )                                                  -                                  X                  ⁡                                      (                    k                    )                                                              "RightBracketingBar"                        2                                              (5)            
In this way, the initial reference pattern generating unit 5 completes the process of generating the initial reference pattern 6.
The reference pattern generating unit 7 receives the initial reference pattern 6 including the plurality of states Rz(1), Rz(2), Rz(3), . . . , Rz(J) generated by the initial reference pattern generating unit 5 and the time series 4 of feature vectors X(1), X(2), X(3), . . . , X(T), which are calculated from the input voice signal by the analysis unit 3. The reference pattern generating unit 7 then generates a reference pattern having a plurality of states each of which is designated by R(j)) (j=1, 2, 3, . . . , J) following a procedure which will be explained later.
Referring next to FIG. 20, there is illustrated a flow diagram showing the procedure of generating each state of the reference pattern R(j). The reference pattern generating unit 7, in step ST201 of FIG. 20, sets the value c of a learning number counter (not shown) thereof at zero. The reference pattern generating unit 7 then advances to step ST202 in which it copies each state of the initial reference pattern Rz(j) (j=1, 2, 3, . . . , J) generated by the initial reference pattern generating unit 5 into a corresponding state of an intermediate reference pattern R(c)(j) (j=1, 2, 3, . . . , J) according to the following equation (6):
R(c)(j)=Rz(j) where j=1, 2, 3, . . . , Jxe2x80x83xe2x80x83(6)
where c in the (c) is the value of the learning number counter (not shown).
After that, the reference pattern generating unit 7, in step ST203, brings each state of the intermediate reference pattern R(c)(j) (j=1, 2, 3. . . , J) into correspondence with part of the time series 4 of feature vectors X(1), X(2), X(3), . . . , X(T). The correspondence process can be carried out using a Viterbi algorithm in such a manner that a pattern matching distance D described below is minimized. The algorithm can be implemented by performing an initial setting according to following equations (7) and (8), and repeating an arithmetic operation using the recurrence formulas given by the following equations (9) and (10), and the pattern matching distance D is given by the following equation (11):                                           G            ⁡                          (                              t                ,                0                            )                                =          ∞                ,                  t          =                      0            ~            T                                              (7)                                          G          ⁡                      (                          1              ,              1                        )                          =                              "LeftBracketingBar"                                          X                ⁡                                  (                  1                  )                                            -                              Rz                ⁡                                  (                  1                  )                                                      "RightBracketingBar"                    2                                    (8)                                          G          ⁡                      (                          t              ,              j                        )                          =                                            "LeftBracketingBar"                                                X                  ⁡                                      (                    t                    )                                                  -                                  Rz                  ⁡                                      (                    j                    )                                                              "RightBracketingBar"                        2                    +                      min            ⁢                          {                                                G                  ⁡                                      (                                                                  t                        -                        1                                            ,                      j                                        )                                                  ,                                  G                  ⁡                                      (                                                                  t                        -                        1                                            ,                                              j                        -                        1                                                              )                                                              }                                                          (9)                                          BTK          ⁡                      (                          t              ,              j                        )                          =                  {                                                    1                                                                                  for                    ⁢                                          xe2x80x83                                        ⁢                                          G                      ⁡                                              (                                                                              t                            -                            1                                                    ,                          j                                                )                                                                              ,                                      ≤                                          G                      ⁡                                              (                                                                              t                            -                            1                                                    ,                                                      j                            -                            1                                                                          )                                                                                                                                                0                                                                                  for                    ⁢                                          xe2x80x83                                        ⁢                                          G                      ⁡                                              (                                                                              t                            -                            1                                                    ,                          j                                                )                                                                              ,                                       greater than                                           G                      ⁡                                              (                                                                              t                            -                            1                                                    ,                                                      j                            -                            1                                                                          )                                                                                                                          }                                    (10)                                D        =                  G          ⁡                      (                          T              ,              J                        )                                              (11)            
where G(t,j) is a summed Viterbi distance, BTK (t,j) is back track information, D is the pattern matching distance between the time series of feature vectors X(1), X(2), X(3), . . . , X(T) and the plurality of states of the intermediate reference pattern R(c) (1), R(c) (2), . . . , R(c)(J), and min {.,.} is an operator for selecting a minimum one from two numbers in the brackets.
After the arithmetic operation using the recurrence formulas given by the equations (9) and (10) is complete, the reference pattern generating unit 7 can obtain a correspondence between each state of the intermediate reference pattern R(c) (j)(j=1, 2, 3, . . . , J) and part of the time series 4 of feature vectors X(1), X(2), X(3), . . . , X(T) extracted from the input voice signal, which minimizes the pattern matching distance D given by the above equation (11), by tracing the back track information BTK (t,j) all the way from the frame T in the reverse direction with respect to time. The correspondence will be referred to as Viterbi path hereinafter. The reference pattern generating unit 7 then, in step ST203, determines start and end frames sxe2x80x2(j) and exe2x80x2(j) for each of J new small sections Bxe2x80x2(j) (j=1, 2, 3, . . . , J) from the Viterbi path.
The reference pattern generating unit 7 advances to step ST204 in which it averages part of the time series 4 of feature vectors included in each new small section Bxe2x80x2(j) obtained in step ST203 to produce an updated state of intermediate reference pattern R(c+1) (j) (j=1, 2, 3, . . . , J) according to the following equation (12):                                           R            ⁡                          (                              c                +                1                            )                                ⁢                      (            j            )                          =                              1                                                            e                  xe2x80x2                                ⁡                                  (                  j                  )                                            -                                                s                  xe2x80x2                                ⁡                                  (                  j                  )                                            +              1                                ⁢                      xe2x80x83                    ⁢                                    ∑                              k                =                                                      s                    xe2x80x2                                    ⁡                                      (                    j                    )                                                                                                e                  xe2x80x2                                ⁡                                  (                  j                  )                                                      ⁢                          xe2x80x83                        ⁢                          X              ⁡                              (                k                )                                                                        (12)            
The reference pattern generating unit 7 then, in step ST205, increments the value c of the learning number counter (not shown) by 1, and, in step ST206, determines whether the value c reaches a predetermined threshold value CC. If the value c of the learning number,counter (not shown) reaches the predetermined threshold value CC, the reference pattern generating unit 7 branches to step ST207 in which it furnishes the time sequence of states of the updated intermediate reference pattern R(c)(1), R(c)(2), R(c)(3), . . . , R(c)(J) as the reference pattern and stops the reference pattern generating procedure. In contrast, unless the value c of the learning number counter (not shown) reaches the predetermined threshold value CC, the reference pattern generating unit 7 reverts back to step ST203 in which it repeats the above-mentioned reference pattern generating procedure. By repeating the above-mentioned reference pattern generating procedure, the pattern matching distance D can converge to a local minimum value. The fact that the pattern matching distance D is small means that the amount of information losses due to compression is reduced and the obtained reference pattern has high representation efficiency.
As previously mentioned, the prior art reference pattern generating apparatus determines feature vectors included in each small section B(j) (j=1, 2, 3, . . . , J) and an initial value Rz(j) in such a manner that the sum D(j) of Euclidean distances between each state of the initial reference pattern Rz(j) and the time series of feature vectors X(sz(j)) to X (ez(j)) included in each small section B(j) is minimized. Accordingly, a problem is that even when the reference pattern generating unit 7 copies each state of the initial reference pattern Rz(j) into a corresponding state of the intermediate reference pattern R(c)(j) and then brings each state of the intermediate reference pattern R(c)(j) into correspondence with part of the time series 4 of feature vectors X(1), X(2), X(3), . . . , X(T) in such a manner that the pattern matching distance D is minimized, the same part of the time series 4 of feature vectors as that assigned to each previous small section B(j) (j=1, 2, 3, . . . , J) can often belong to a corresponding one of a plurality of the new small sections Bxe2x80x2(j) (j=1, 2, 3, . . . , J) again, and therefore the reference pattern can often be trapped in the initial reference pattern and the reference pattern can converge to an undesired local minimum value.