The present invention relates to method and apparatus for speaker recognition and, more particularly, to a method of and an apparatus for recognizing or identifying a speaker.
Heretofore, speaker recognition independently of speech content is usually performed on the basis of the distance between a feature parameter of an input speech and a registered parameter of a speech, which has been produced by the speaker to be recognized.
Denoting the input speech parameter series by {right arrow over (x)}i, the registered speech parameter series by {right arrow over (y)}j (I and J are sample number) and the distance between these parameter series by Dold, Dold is obtained from the following Formulas. Symbol xe2x80x9c∥xc2x7∥xe2x80x9d represents Euclid distance.       D    old    =            ∑              i        =        1            I        ⁢          D      ⁡              (        i        )            xe2x80x83D(i)=min∥{right arrow over (x)}ixe2x88x92{right arrow over (y)}i∥2
In order to reduce the computational effort and the memory capacity, it is also in practice that, instead of directly storing the feature vector series of speeches, a feature vector series {right arrow over (c)}k obtained by vector quantization is stored as a reference pattern.       D    old    xe2x80x2    =            ∑              i        =        1            I        ⁢                  D        xe2x80x2            ⁡              (        i        )            xe2x80x83Dxe2x80x2(i)=min∥{right arrow over (x)}ixe2x88x92{right arrow over (c)}k∥2
In the above prior art techniques, for accurately determining the distance, speeches contained in an input speech should all be preliminarily stored and relatively long-time speech is used for registering the speaker to be recognized. From the standpoint of the user""s burden, speech necessary for the registration is preferably as little as possible. Reducing the necessary speech, however, results in an increase of non-registered phonemes contained in the input speech, thus reducing the accuracy of collation or matching.
As a means for solving this problem, a method disclosed in Japanese Patent Application No. 2-76296 (hereinafter referred to as Literature 1) is utilized. In this method, sizes of overlap parts of an input speech and a registered speech and also inter-overlap-part distances are utilized to determine the similarity measure.
FIG. 5 shows the system disclosed in Literature 1. As shown, the system comprises overlap size calculating part, which determines, as the size of overlap part, the number of input speech samples contained in an overlap parts of the distributions of an input speech and a reference speech, and an overlap part inter-element distance calculating part. The distance between the input and reference speech patterns is determined from the results of calculations in these parts according to the following Formula.                               D          new                =                                                            ∑                                  i                  =                  1                                I                            ⁢                              d                i                                      +                                          d                out                            ⁡                              (                                                      u                    max                                    -                  u                                )                                                          u            max                                                            d          i                =                  {                                                                                          min                    ⁢                                                                  "LeftDoubleBracketingBar"                                                                                                            x                              →                                                        i                                                    -                                                                                    c                              →                                                        k                                                                          "RightDoubleBracketingBar"                                            2                                                        ,                                                                                                  for                    ⁢                                          xe2x80x83                                        ⁢                                          A                      i                                                        ≠                  0                                                                              (                  1                  )                                                                                    0                                            otherwise                                                              (                  2                  )                                                              }                                                  A          i                =                  {                      k            ❘                          1              ≤              k              ≤                              Kand                ⁢                                  "LeftDoubleBracketingBar"                                                                                    x                        →                                            i                                        -                                                                  c                        →                                            k                                                        "RightDoubleBracketingBar"                                            ≤                              l                w                                              }                    
U: number of samples corresponding to (1)
Umax: maximum number of samples corresponding to (1) for all reference patterns
dout: fixed distance for samples corresponding to (2)
lk: coverage of k-th element {right arrow over (c)}k of reference pattern
∥xc2x7∥Euclid distance
More specifically, a coverage lk of each reference speech pattern element is previously determined, and when the distance di between the nearest element in the reference speech pattern and the input speech pattern exceeds its coverage, a separately determined penalty distance dout is added to all input speech pattern feature vectors, and the result is normalized by the overlap part size Umax.
In this method, however, the overlap part size Umax is determined from all reference patterns. Therefore, where registration is performed by using speeches of different contents with different speakers, the input speech content of a speaker may be close to the registered speech of a different speaker. In such a case, the Umax may be unfairly significantly evaluated, giving rise to performance deterioration. For this reason, substantially the same number of different kinds of phonemes should be contained in the contents of the registered speeches.
In addition, according to Literature 1, the coverage of each reference pattern element is determined on the basis of the distance from the center of a cluster (i.e., element {right arrow over (c)}k) to the farthest distance feature parameter contained in that cluster. However, even with the same phoneme, the feature parameter varies with different speakers, and this means that it is difficult to obtain stable distribution overlap estimation.
The present invention, accordingly, has an object of providing a speaker recognition system capable of stable recognition irrespective of speakers and registration by using various speeches through an identity/non-identity check of contents of an input speech and a registered speech by speech recognition.
(1) According to a first aspect of the present invention, there is provided a method of recognizing a speaker of an input speech according to the distance between an input speech pattern, obtained by converting the input speech to a feature parameter series, and a reference pattern preliminarily registered as feature parameter series for each speaker, comprising steps of:
obtaining contents of the input and reference speech patterns by recognition;
determining an identical section, in which the contents of the input and reference speech patterns are identical;
determining the distance between the input and reference speech patterns in the calculated identical content section;
normalizing the input speech pattern by one of copying the input speech pattern and weighting the distance determined between the input and reference speech patterns if the distance between the input and reference speech patterns is greater than a predetermined value, in which the distance between the input and reference speech patterns is decreased by normalization to reduce the adverse effects of noise; and
recognizing the speaker of the input speech on the basis of the determined distance.
(2) According to a second aspect of the present invention, there is provided a method of recognizing a speaker of an input speech independently of the content thereof by converting the input speech to an input speech pattern as a feature parameter series and determining the difference of the input speech pattern from a reference speech pattern registered for each speaker, the method comprising the steps of:
obtaining the contents of the input and reference patterns by speech recognition, and determining the distance by determining identical content sections of the input and reference speech patterns from the obtained pattern content data.
(3) According to a third aspect of the present invention, there is provided a method of recognizing a speaker of an input speech comprising steps of:
determining an identical section of the input speech and a reference speech;
copying the input speech and reference speech in an unspecified speaker""s acoustical model;
determining a distance between the copied input speech and a reference speech at least for the identical section;
normalizing the input speech pattern by one of copying the input speech pattern and weighting the distance determined between the input and reference speech patterns if the distance between the input and reference speech patterns is greater than a predetermined value, in which the distance between the input and reference speech patterns is decreased by normalization to reduce the adverse effects of noise; and
recognizing the speaker of the input speech.
(4) According to a fourth aspect of the present invention, there is provided a method of recognizing a speaker of an input speech comprising steps of:
copying the input speech and reference speech in an unspecified speaker""s acoustical model;
determining an identical section of the copied input speech and the reference speech;
determining a distance between the copied input speech and reference speech at least for the identical section;
normalizing the input speech pattern by one of copying the input speech pattern and weighting the distance determined between the input and reference speech patterns if the distance between the input and reference speech patterns is greater than a predetermined value, in which the distance between the input and reference speech patterns is decreased by normalization to reduce the adverse effects of noise; and
recognizing the speaker of the input speech.
(5) According to a fifth aspect of the present invention, there is provided an apparatus for recognizing a speaker of an input speech according to the distance between an input speech pattern, obtained by converting the input speech to a feature parameter series, and a reference pattern preliminarily registered as feature parameter series for each speaker, comprising:
a first means for obtaining contents of the input and reference speech patterns by recognition;
a second means for determining an identical section, in which the contents of the input and reference speech patterns are identical;
a third means for determining the distance between the input and reference speech patterns in the calculated identical content section; and
a fourth means for normalizing the input speech pattern by one of copying the input speech pattern and weighting the distance determined between the input and reference speech patterns if the distance between the input and reference speech patterns is greater than a predetermined value, in which the distance between the input and reference speech patterns is decreased by normalization to reduce the adverse effects of noise; and
a fifth means for recognizing the speaker of the input speech on the basis of the determined distance.
(6) According to a sixth aspect of the present invention, there is provided an apparatus for recognizing a speaker of an input speech comprising:
a first means for determining an identical section of the input speech and a reference speech;
a second means for copying the input speech and reference speech in an unspecified speaker""s acoustical model;
a third means for determining a distance between the copied input speech and reference speech at least for the identical section;
a fourth means for normalizing the input speech pattern by one of copying the input speech pattern and weighting the distance determined between the input and reference speech patterns if the distance between the input and reference speech patterns is greater than a predetermined value, in which the distance between the input and reference speech patterns is decreased by normalization to reduce the adverse effects of noise; and
a fifth means for recognizing the speaker of the input speech.
(7) According to a seventh aspect of the present invention, there is provided an apparatus for recognizing a speaker of an input speech comprising:
a first means for copying the input speech and reference speech in an unspecified speaker""s acoustical model;
a second means for determining an identical section of the copied input speech and reference speech at least for the identical section;
a third means for determining a distance between the copied input speech and reference speech at least for the identical section;
a fourth means for normalizing the input speech pattern by one of copying the input speech pattern and weighting the distance determined between the input and reference speech patterns if the distance between the input and reference speech patterns is greater than a predetermined value, in which the distance between the input and reference speech patterns is decreased by normalization to reduce the adverse effects of noise; and
a fifth means for recognizing the speaker of the input speech.
Other objects and features will be clarified from the following description with reference to attached drawings.