Automatic verification or identification of a person by their speech is attracting greater interest as an increasing number of business transactions are being performed over the phone, where automatic speaker identification is desired or required in many applications. In the past several decades, three techniques have been developed for speaker recognition, namely (1) Gaussian mixture model (GMM) methods, (2) vector quantization (VQ) methods, and (3) various distance measure methods. The invention is directed to the last class of techniques.
The performance of current automatic speech and speaker recognition technology is quite sensitive to certain adverse environmental conditions, such as background noise, channel distortions, speaker variations, and the like. The handset distortion is one of the main factors that contribute to degradation of the speech and speaker recognizer. In the current speech technology, the common way to remove handset distortion is the cepstral mean normalization, which is based on the assumption that handset distortion is linear.
In the art of distance metrics speech identification, it is well known that covariance matrices of speech feature vectors, or cepstral vectors, carry a wealth of information on speaker characteristics. Cepstral vectors are generally obtained by inputting a speech signal and dividing the signal into segments, typically 10 milliseconds each. A fast Fourier transform is performed on each segment and the energy calculated for each of N frequency bands. The logarithm of the energy for each band is subject to a cosine transformation, thereby yielding a cepstral vector having N elements. The frequency bands are not usually equally spaced, but rather are scaled, such as mel-scaled, for example, as by the equation mf=1125 log(0.0016f+1), where f is the frequency in Hertz and mf is the mel-scaled frequency.
Once a set of N cepstral vectors, c1, c2 . . . cN, has been obtained a covariance matrix may be derived by the equation:S=[(c1−m)T(c1−m)+(c2−m)T(c2−m)+ . . . +(cN−m)T(cN−m)]/N  (1)where T indicates a transposed matrix, m is the mean vector m=(c1+c2+ . . . +cK)/K where K is the number of frames of speech signal, and S is the N×N covariance matrix.
Let S and S be covariance matrices of cepstral vectors of clips of testing and training speech signals, respectively, that is to say that S is matrix for the sample of speech that we wish to identify and S is a matrix for the voice signature of a known individual. If the sample and signature speech signals are identical, then S=S, which is to say that SS−1 is an identity matrix, and the speaker is thereby identified as the known individual. Therefore, the matrix SS−1 is a measure of the similarity of the two voice clips and is referred to as the “similarity matrix” of the two speech signals.
The arithmetic, A, geometric, G, and harmonic, H, means of the eigenvalues I(i=1, . . . , N) of the similarity matrix are defined as follows:                               A          ⁡                      (                                          λ                1                            ,              …              ⁢                                                          ,                              λ                N                                      )                          =                                                                              1                                                                              N                                                      ⁢                                          ∑                                  t                  =                  1                                N                            ⁢                                                          ⁢                              λ                i                                              =                                    1              N                        ⁢                          Tr              ⁡                              (                                  S                  ⁢                                                                          ⁢                                      Σ                                          -                      1                                                                      )                                                                        (2a)                                          G          ⁡                      (                                          λ                1                            ,              …              ⁢                                                          ,                              λ                N                                      )                          =                                            (                                                ∏                                      i                    =                    1                                    N                                ⁢                                                                  ⁢                                  λ                  i                                            )                                      1              /              N                                =                                    (                              Det                ⁡                                  (                                      S                    ⁢                                                                                  ⁢                                          Σ                                              -                        1                                                                              )                                            )                                      1              /              N                                                          (2b)                                          H          ⁡                      (                                          λ                1                            ,              …              ⁢                                                          ,                              λ                N                                      )                          =                              N            ⁢                                          ∑                                  i                  =                  1                                N                            ⁢                                                          ⁢                                                (                                                                                    1                                                                                                                                      λ                          i                                                                                                      )                                                  -                  1                                                              =                                    N              ⁡                              (                                  Tr                  ⁡                                      (                                          S                      ⁢                                                                                          ⁢                                              Σ                                                  -                          1                                                                                      )                                                  )                                                    -              1                                                          (2c)            where Tr( ) is the trace of a matrix and Det( ) is the determinant of a matrix.
These values can be obtained without explicit calculation of the eigenvalues and therefore are significantly efficient in computation. Also, they satisfy the following properties:                               A          ⁡                      (                                                                                1                                                                                                              λ                      1                                                                                  ,              …              ⁢                                                          ,                              1                                  λ                  N                                                      )                          =                  1                      H            ⁡                          (                                                λ                  1                                ,                …                ⁢                                                                  ,                                  λ                  N                                            )                                                          (3a)                                          G          ⁡                      (                                                                                1                                                                                                              λ                      1                                                                                  ,              …              ⁢                                                          ,                              1                                  λ                  N                                                      )                          =                  1                      G            ⁡                          (                                                λ                  1                                ,                …                ⁢                                                                  ,                                  λ                  N                                            )                                                          (3b)                                          H          ⁡                      (                                                                                1                                                                                                              λ                      1                                                                                  ,              …              ⁢                                                          ,                              1                                  λ                  N                                                      )                          =                  1                      A            ⁡                          (                                                λ                  1                                ,                …                ⁢                                                                  ,                                  λ                  N                                            )                                                          (3c)            
Various distance measures have been constructed based upon these mean values, primarily for purposes of speaker identification, the most widely known being:                                           d            1                    ⁡                      (                          S              ,              Σ                        )                          =                                                            A                                                                    H                                              -          1                                    (4a)                                                      d            2                    ⁡                      (                          S              ,              Σ                        )                          =                                                            A                                                                    G                                              -          1                                    (4b)                                                      d            3                    ⁡                      (                          S              ,              Σ                        )                          =                                                                              A                  2                                                                                    GH                                              -          1                                    (4c)            d4(S,Σ)=A−log(G)−1  (4d)wherein if the similarity matrix is positive definite, the mean values satisfy the equation A≧G≧H with equality if and only if λ1=λ2= . . . =λN. Therefore, all the above distance measures satisfy the positivity condition. However, if we exchange S and S (or the position of sample and signature speech signals), S S−1>SS−1 and Ii>1/Ii, and find that d1 satisfies the symmetric property while d2, d3, and d4 do not. The symmetry property is a basic mathematic requirement of distance metrics, therefore d1 is generally in more widespread use than the others.
As stated, the cepstral mean normalization assumes linear distortion, but in fact the distortion is not linear. When applied to cross-handset speaker identification (meaning that the handset used to create the signature matrices is different than the one used for the sample) using the Lincoln Laboratory Handset Database (LLHD), the cepstral mean normalization technique has an error rate in excess of about 20%. Consider that the error rate for same-handset speaker identification is only about 7%, and it can be seen that channel distortion caused by the handset is not linear. What is needed is a method to remove the nonlinear components of handset distortion.