This invention relates to handset detection using cepstral covariance matrices and distance metrics.
Automatic verification or identification of a person by their speech is attracting greater interest as an increasing number of business transactions are being performed over the phone, where automatic speaker identification is desired or required in many applications. In the past several decades, three techniques have been developed for speaker recognition, namely (1) Gaussian mixture model (GMM) methods, (2) vector quantization (VQ) methods, and (3) various distance measure methods. The invention is directed to the last class of techniques.
The performance of current automatic speech and speaker recognition technology is quite sensitive to certain adverse environmental conditions, such as background noise, channel distortions, speaker variations, and the like. The handset distortion is one of the main factors that contribute to degradation of the speech and speaker recognizer. In the current speech technology, the common way to remove handset distortion is the cepstral mean normalization, which is based on the assumption that handset distortion is linear, but in fact the distortion is not linear. This creates a problem in real-world applications because the handset used to record voice samples for identification purposes will more than likely be different than the type of handset used by the person we wish to identify, commonly referred to as a xe2x80x9ccross-handsetxe2x80x9d identification problem.
When applied to cross-handset speaker identification using the Lincoln Laboratory Handset Database (LLHD), the cepstral mean normalization technique has an error rate in excess of about 20%. Consider that the error rate for same-handset speaker identification is only about 7%, and it can be seen that channel distortion caused by the handset is not linear. It is therefore desirable to remove the effects of these non-linear distortions, but before that""s possible, it will first be necessary to identify the handsets.
Disclosed is a method of automated handset identification, comprising receiving a sample speech input signal from a sample handset; deriving a cepstral covariance sample matrix from said first sample speech signal; calculating, with a distance metric, all distances between said sample matrix and one or more cepstral covariance handset matrices, wherein each said handset matrix is derived from a plurality of speech signals taken from different speakers through the same handset; and determining if the smallest of said distances is below a predetermined threshold value.
In another aspect of the method, said distance metric is selected from                     d        1            ⁢              (                  S          ,          Σ                )              =                  A        H            -      1        ,      xe2x80x83    ⁢                    d        5            ⁢              (                  S          ,          Σ                )              =          A      +              1                  H          _                    -      2        ,                    d        6            ⁢              (                  S          ,          Σ                )              =                            (                      A            +                          1                              H                _                                              )                ⁢                  (                      G            +                          1              G                                )                    -      4        ,      xe2x80x83    ⁢                    d        7            ⁢              (                  S          ,          Σ                )              =                            A                                    2              ⁢              H                        _                          ⁢                  (                      G            +                          1              G                                )                    -      1        ,      xe2x80x83    ⁢                    d        8            ⁢              (                  S          ,          Σ                )              =                            (                      A            +                          1                              H                _                                              )                          (                      G            +                          1              G                                )                    -      1        ,      xe2x80x83    ⁢                    d        9            ⁢              (                  S          ,          Σ                )              =                  A                  G          _                    +              G        H            -      2        ,
an fusion derivatives thereof.
In another aspect of the method, said handset matrices are stored in a database of handset matrices wherein each handset matrix is derived from a unique make and model of handset.
In another aspect of the method, said different speakers number ten or more.
In another aspect of the method, said different speakers is no less than twenty.
Disclosed is a program storage device, readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for automated handset identification, said method steps comprising receiving a sample speech input signal from a sample handset; deriving a cepstral covariance sample matrix from said first sample speech signal; calculating, with a distance metric, all distances between said sample matrix and one or more cepstral covariance handset matrices, wherein each said handset matrix is derived from a plurality of speech signals taken from different speakers through the same handset; and determining if the smallest of said distances is below a predetermined threshold value.
In another aspect of the invention, said distance metric is selected from                     d        1            ⁢              (                  S          ,          Σ                )              =                  A        H            -      1        ,      xe2x80x83    ⁢                    d        5            ⁢              (                  S          ,          Σ                )              =          A      +              1                  H          _                    -      2        ,                    d        6            ⁢              (                  S          ,          Σ                )              =                            (                      A            +                          1                              H                _                                              )                ⁢                  (                      G            +                          1              G                                )                    -      4        ,      xe2x80x83    ⁢                    d        7            ⁢              (                  S          ,          Σ                )              =                            A                                    2              ⁢              H                        _                          ⁢                  (                      G            +                          1              G                                )                    -      1        ,      xe2x80x83    ⁢                    d        8            ⁢              (                  S          ,          Σ                )              =                            (                      A            +                          1                              H                _                                              )                          (                      G            +                          1              G                                )                    -      1        ,      xe2x80x83    ⁢                    d        9            ⁢              (                  S          ,          Σ                )              =                  A                  G          _                    +              G        H            -      2        ,
and fusion derivatives thereof.
In another aspect of the invention, said handset matrices are stored in a database of handset matrices wherein each handset matrix is derived from a unique make and model of handset.
In another aspect of the invention, said different speakers number ten or more.
In another aspect of the invention, the number of said different speakers is no less than twenty.
Disclosed is an automated handset identification system, comprising means for receiving a sample speech input signal from a sample handset; means for deriving a cepstral covariance sample matrix from said first sample speech signal; means for calculating, with a distance metric, all distances between said sample matrix and one or more cepstral covariance handset matrices, wherein each said handset matrix is derived from a plurality of speech signals taken from different speakers through the same handset; and means for determining if the smallest of said distances is below a predetermined threshold value.
In another aspect of the invention, said means for receiving sample speech is in communication with an incoming line of communication.
In another aspect of the invention, said incoming line of communication is a phone line.