Speaker verification has received considerable attention in the community because of its attractable applications in financial transaction authentication, secure access, security protection, human-computer interfaces and other real-world applications. It aims at verifying a authenticating speaker's identity using pre-stored information within an access-controlled system and the speaker will be either accepted as a target speaker or rejected as an impostor under a certain matching criterion.
In general, speech not only conveys the linguistic information but also characterizes the speaker's identity, which can thus be utilized for speaker verification. Traditionally, the acoustic speech signals may be the most natural modality to achieve speaker verification. Although a purely acoustic-based speaker verification system is effective in its application domain, its performance would be degraded dramatically in the environment corrupted by the background noise or multiple talkers. Under the circumstances, as shown in FIG. 1, speaker verification system by taking into account some video information such as the still frames of face and temporal lip motions, has shown an improved performance over acoustic-only systems. Nevertheless, the access-controlled systems utilizing the still face images are very susceptible to the poor picture qualities, variations in pose or facial expressions, and are easily deceived by a face photograph placed in front of the camera as well. In recent years, speaker verification utilizing or fused with lip motions has received wide attention in the community. As a kind of behavior characteristics, the lip motions accompanying with the lip shape variations, tongue and teeth visibility, always contain rich and sufficient information to characterize the identity of a speaker. Nevertheless, the performances of the existing lip motion based speaker verification systems are far behind our expectation. The main reasons are two-fold: (1) The principal components of features representing each lip frame are not always sufficient to distinguish the biometric properties between different speakers; (2) The traditional lip motion modeling approaches, e.g. single Gaussian Mixture Model (GMM), single Hidden Markov Model (HMM), are not capable of providing the optimal model descriptions to verify some hard-to-classify speakers. For instance, some lip motions between different speakers are so similar that the corresponding models learned from these conventional approaches are not so discriminative enough to differentiate these speakers. In strengthening the security capacity of speaker verification systems, some researchers attempted to adopt multi-modal expert fusion system by combining audio, lip motion sequence and face information to improve the robustness and overall verification performance. Nevertheless, the appropriate fusion between different modalities is extremely difficult meanwhile it may not be easy to carry out multi-modal experts synchronously in real-world applications.
From a practical viewpoint, the password protected biometric based speaker verification system will hold a double security to the system, where a speaker is not only verified by his or her natural biometric characteristics, but also required to match a specific password. Unfortunately, the acoustic signals with private password information are easily perceived and intercepted by the listeners nearby, while the still face images could not be embedded with a secure password phrase directly. In contrast, the lip motion password (simply called lip-password hereinafter) protected speaker verification system is able to hold the double security to the system. That is, the speaker will be verified by both of the lip-password and underlying behavior characteristics of lip motions simultaneously. In addition, such a system has at least four merits as follows: (1) The modality of lip motion is completely insensitive to the background noise; (2) The acquisition of lip motions is somewhat insusceptible to the distance; (3) Such a system can be used by a mute person; (4) Lip-password protected speaker verification system has its unique superiorities of silence and hidden property. Therefore, the development of an effective and efficient approach to realizing the lip-password based speaker verification becomes quite desirable.
FIG. 2 is a block/flow diagram illustrating apparatus/procedure for speaker registration phase within the lip-password based speaker verification system. An authorized speaker/user may, for example, silently utters his/her private password by facing a video camera connected to a computer processing system. Then, the video camera and computer processing system capture, process and analyze the recorded video sequence to obtain the desired lip-password sequence. According to the selected password style (separate: the lip-password can be segmented into several visibly distinguishable units of visual speech elements; non-separate: the lip-password cannot be easily divided into several visual speech elements), the system shall model/code the password sequence automatically such that a registered lip-password database can be established.
FIG. 3 is a block/flow diagram illustrating apparatus/procedure for speaker verification phase within the lip-password based speaker verification system. By facing a video camera connected to a computer processing system, an unknown speaker/user attempts to obtain an access by uttering a password sequence. Then, the video camera and computer processing system capture, process, and analyze the recorded video sequence (e.g., lip region localization, feature extraction, lip motion segmentation) to extract the interested lip-password sequence. According to the pre-registered lip-password sequence, the system shall make a decision based on the matching calculation result, i.e., lip motion matching and password information matching.
In previous applications of lip motion based speaker verification systems, e.g., T Wark, S. Sridharan, and V Chandran, “An approach to statistical lip modelling for speaker identification via chromatic feature extraction,” in Proc. IEEE International Conference on Pattern Recognition, vol 1, 1998, pp. 123-125 vol. 1 and L. L. Mok, W H. Lau, S. H. Leung, S. L. Wang, and H. Yan, “Lip features selection with application to person authentication,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 3, 2004, pp. iii-397-400 vol. 3, the authors generally proposed to take the whole utterance as the basic processing unit. Although different speakers may have different lip motion activities, some of these motions are so similar that it is very difficult to match them based on the global modeling methods, e.g., single GMM, and single HMM. These types of approach are usually designed to support a small vocabulary of utterances such as isolated words, digits and letters, but which may not be suitable to cover a little bit long utterance (e.g., a password). The main reason lies in that a large number of sample sequences have to be collected to train all possible models that may appear in the long speech. Furthermore, the design for lip-password protected system should be capable of detecting the target speaker saying the wrong password. Unfortunately, such a method of taking the whole utterance as the basic processing units is incompetent for this task. In fact, the lip-password utterance generally comprises of multiple subunits (i.e., the smallest visibly distinguishable unit of visual speech elements). These subunits indicate a short period of lip motions and always have diverse styles between different elements. Hence, to investigate more detailed lip motion characteristics, these subunits should be considered individually instead of being taking jointly (i.e., whole utterance).
In this document, we mainly focus on digital lip-password based speaker verification problem, i.e. the password composed of the digits from 0 to 9 only, although the underlying concept and the techniques are also applicable to non-digit lip-password as well. To this end, we firstly extract a group of representative visual features to characterize each lip frame, and then propose a simple but effective algorithm to segment the digital lip-password sequence into a small set of distinguishable subunits. Subsequently, we integrate HMMs with boosting learning framework associated with random subspace method (RSM) and data sharing scheme (DSS) to model the segmental sequence of the input subunit discriminatively so that a precise decision boundary is formulated for these subunits verification. Finally, the digital lip-password whether spoken by the target speaker with the pre-registered password or not is determined via all verification results of the subunits learned from multi-boosted HMMs. Experimental results have shown its efficacy.
Overview of Related Works
During the past decade, a few techniques such as neural Network (NN), GMM and HMM have been developed for lip motion based applications. In general, the successful achievement of lip motion based speaker verification lies in a closer investigation of the physical process of the corresponding lip motion activities, which always contain strong temporal correlations between the adjacent observed frames. Hence, among these methods, the HMM has been the most popular methodologies because its underlying state structure can successfully model the temporal variations in lip motion activities. The following paragraphs shall firstly review discrimination analysis in HMM-based speaker verification, and then overview the framework of HMM-based speaker verification and Adaboost learning.
Discrimination Analysis
To the best of our knowledge, the performance of the existing HMM-based speaker verification systems using lip motions is still far behind our expectations. The plausible reasons are two-fold: (1) The visual features extracted from the lip movements are not so discriminative enough for lip motion modeling and subsequent similarity measurement; (2) The learned lip motion models are not sufficient to well characterize the corresponding motion characteristics. For robust speaker verification, discriminative learning is still desired, which can be roughly made along two lines: discriminative feature selection and discriminative model learning.
Discriminative feature selection methods aiming at minimizing the classification loss will not only emphasize the informative features but also filter out the irrelevant features. Ertan et al. in H. E. Cetingul, Y. Yemez, E. Engin, and A. M. Tekalp, “Discriminative analysis of lip motion features for speaker identification and speech-reading,” IEEE Transactions on Image Processing, vol 15, no. 10, pp. 2879-2891, 2006 adopted the strategy that the joint discrimination measure of any two features is less than the sum of their individual discrimination powers. Accordingly, they utilized the Bayesian technique to select the representative features of each frame discriminatively provided that the feature components are statistically independent. However, it is very difficult, if not impossible, to determine which single feature component has more discrimination power. Often, the feature components belonging to the same feature category are not statistically independent each other.
Discriminative model learning approaches featuring on parameter optimizations always achieve a better performance over non-discriminative learning approaches. In HMM, its parameters are normally estimated by Maximum Likelihood Estimation (MLE). Recently, some researchers have shown that the decision boundary obtained via discriminative parameters learning algorithms is usually superior than the decision boundary obtained from MLE. Typical methods include Maximum Mutual Information (MMI), conditional maximum likelihood (CML) and minimum classification error (MCE). These methods aiming at maximizing the conditional likelihood or minimizing the classification error usually achieved a better performance than MLE approach. Nevertheless, these methods cannot be implemented straightforwardly and are utilized for certain special tasks only.
However, the majority of the existing HMM-based speaker verification systems just adopt a fixed scheme of utilizing a single HMM for lip motion modeling and similarity measurement, which may not generate a good performance due to its limited discrimination power. Most recently, some multiple classifiers based systems trained on different data subsets or feature subsets have yielded a better result compared to a single classifier system. These classifier ensemble approaches are capable of generating more discrimination power to obtain the better classification result.
Among the existing ensemble algorithms, Adaboost is the most popular and effective learning methods. Different from the other traditional ensemble methods such as sum rule and majority vote, Adaboost aims at building a strong classifier by sequentially training and combining a group of weak classifiers in such a way that the later classifiers would focus more and more on hard-to-classify examples. Consequently, the mistakes made by the ensemble classifiers are reduced. Recently, some sequence modeling and classification methods, e.g., GMM, HMM, were successful in integrating with boosting learning framework to form a strong discriminative sequence learning approaches. Siu et aL in M. H. Siu, X. Yang, and H. Gish, “Discriminatively trained gmms for language classification using boosting methods,” IEEE Transactions on Audio, Speech and Language Processing, vol 17, no. 1, pp. 187-197, 2009 have utilized boosting method to discriminatively train GMMs for language classification. Foo et aL in S. W Foo, Y Lian, and L. Dong, “Recognition of visual speech elements using adaptively boosted hidden markov models,” IEEE Transactions on Circuits and Systems for Video Technology, vol 14, no. 5, pp. 693-705, 2004 have employed adaptively boosted HMMs to achieve visual speech elements recognition. From their experimental results, it can be found that traditional single modeling and classification methods fail to identify some samples with less discrimination capability while the boosted modeling and classification approaches hold the promise of identifying these hard-to-classify examples. Inspired by these findings, we shall integrate HMMs with the boosting learning framework to verify some hard-to-classify lip-passwords accordingly.
Overview of HMM-based Speaker Verification
Let the video databases comprise a group of lip motions generated from both of the target speaker and imposters. Each lip motion contains a series of lip frame sequences. For the HMM of the eth lip motion, its model λe=(πe, Ae, Be), is built with N hidden states, denoted by Se={S1e, S2e, . . . , SNe}. Suppose λe is trained from the observed lip sequence Oe={o1e, o2e, . . . , olee} and emitted from a sequence of hidden states se={s1e, s2e, . . . slee}, sieεSe={S1e, S2e, . . . , SNe}, where le is the total number of frames. Let the output of an HMM take M discrete values from a finite symbol set Ve={v1e, v2e, . . . , vMe}. For an N-state-M-symbol HMM, the parameter details of the model λe are summarized as follows:                1. The initial distribution of the hidden states πe=[πi]1×N=[P(s1e=Sie)]1×N (1≦i≦N), where s1e is the first observed state in the state chain.        2. The state transition matrix Ae=[ai,j]N×N=[P(st+1e=Sje|ste=Sie)]N×N (1≦i, j≦N, 1≦t≦le), where st+1e and ste represent the states at the (t+1)th and tth frame, respectively.        3. The symbol emission matrix Be=[bj(k)]N×M=[P(vke at t|ste=Sje)]N×M (1≦j≦N, 1≦k≦M). It indicates the probability distribution of a symbol output vke conditioned on the state Sje at the tth frame.In general, a typical estimate of λe can be iteratively computed using Baum-Welch algorithm. The model obtained via this type of approach can better describe the dynamics of the input sequence. Meanwhile, such a method has the advantages of easy implementation and fast speedy convergence. Given the test observation sequence Os={o1s, o2s, . . . olss}, the goal of the speaker verification task is to find a decision in terms of computing the likelihood between the test sequence with the target speaker model λ(T) and imposter model λ(I). By adopting conditional independence assumptions between the observed variables, the likelihood of observation sequence conditioned on the specified model is computed as follows:        
                                          P            ⁡                          (                                                O                  s                                |                                  λ                  i                                            )                                =                                    ∏                              t                =                1                                            l                s                                      ⁢                                                  ⁢                          P              ⁡                              (                                                      o                    t                    s                                    |                                      λ                    i                                                  )                                                    ,                              λ            i                    ∈                                    {                                                λ                  ⁡                                      (                    T                    )                                                  ,                                  λ                  ⁡                                      (                    I                    )                                                              }                        .                                              (        1        )            The likelihood score P(ots|λi) can be measured by means of forward-backward algorithm while its most probable path can be obtained via Viterbi decoding algorithm [27].
In general, the modality for HMM-based speaker verification can be regarded as a binary classification between the target speaker and impostor, which can be extensionally grouped into closed-set and open-set learning problem. In the closed-set case, the testing utterances of the speakers are recorded to be known, and the models of both the target-speaker and imposter can be learned during the training phase. Given a test observation sequence: Os={o1s, o2s, olss}, the classification for this type of speaker verification problem is performed based on the log likelihood ratio (LLR):
                                                                                          LLR                  ⁡                                      (                                          O                      s                                        )                                                  =                                  log                  ⁢                                                            P                      ⁡                                              (                                                                              O                            s                                                    |                                                      λ                            ⁡                                                          (                              T                              )                                                                                                      )                                                                                    P                      ⁡                                              (                                                                              O                            s                                                    |                                                      λ                            ⁡                                                          (                              I                              )                                                                                                      )                                                                                                                                                                    =                                  log                  ⁢                                                                                    ∏                                                  t                          =                          1                                                                          l                          s                                                                    ⁢                                                                                          ⁢                                              P                        ⁡                                                  (                                                                                    o                              t                              s                                                        |                                                          λ                              ⁡                                                              (                                T                                )                                                                                                              )                                                                                                                                    ∏                                                  t                          =                          1                                                                          l                          s                                                                    ⁢                                                                                          ⁢                                              P                        ⁡                                                  (                                                                                    o                              t                              s                                                        |                                                          λ                              ⁡                                                              (                                I                                )                                                                                                              )                                                                                                                                                                                            =                                                      ∑                                          t                      =                      1                                                              l                      s                                                        ⁢                                      [                                                                  log                        ⁢                                                                                                  ⁢                                                  P                          ⁡                                                      (                                                                                          o                                t                                s                                                            |                                                              λ                                ⁡                                                                  (                                  T                                  )                                                                                                                      )                                                                                              -                                              log                        ⁢                                                                                                  ⁢                                                  P                          ⁡                                                      (                                                                                          o                                t                                s                                                            |                                                              λ                                ⁡                                                                  (                                  I                                  )                                                                                                                      )                                                                                                                ]                                                                                      ⁢                                  ⁢                              ifLLR            ⁡                          (                              O                s                            )                                ≥                      τ            ⁢                          :                        ⁢                                                  ⁢                          accepted              .                                                          ⁢              Otherwise                        ⁢                          :                        ⁢                                                  ⁢                          rejected              .                                                          (        2        )            
In the open-set case, the imposters are recorded to be unknown. Hence, the imposter models could not be trained due to their arbitrariness. Given the observations that are recoded from unknown speakers, the task is to find whether it belongs to the target speaker registered in the database or not. Note that, in digital lip-password scenario, the utterance styles differing from the registered one are considered to be imposters even they come from the same speaker. Further, the frame length of the utterance may slightly change. Thereupon, this kind of verification problem is conducted based on normalized log likelihood (NLL):
                                                                                          NLL                  ⁡                                      (                                          O                      s                                        )                                                  =                                                      1                                          l                      s                                                        ⁢                  log                  ⁢                                                                          ⁢                                      P                    ⁡                                          (                                                                        O                          s                                                |                                                  λ                          ⁡                                                      (                            T                            )                                                                                              )                                                                                                                                              =                                                      1                                          l                      s                                                        ⁢                                                            ∑                                              t                        =                        1                                                                    l                        s                                                              ⁢                                          log                      ⁢                                                                                          ⁢                                                                        P                          ⁡                                                      (                                                                                          o                                t                                s                                                            |                                                              λ                                ⁡                                                                  (                                  T                                  )                                                                                                                      )                                                                          .                                                                                                                                ⁢                                  ⁢                              ifNLL            ⁡                          (                              O                s                            )                                ≥                      τ            ⁢                          :                        ⁢                                                  ⁢                          accepted              .                                                          ⁢              Otherwise                        ⁢                          :                        ⁢                                                  ⁢                          rejected              .                                                          (        3        )            Overview of Adaboost Learning
Let us consider a two-class classification problem. Given a set of Nt labeled training samples (x1, y1), (x2, y2), . . . , (xNt, yNt), where yiε{1, −1} is the class label for the sample xiεn. Each training sample has a weight wi (distribution), which is assigned to get the uniform value initially. Let h(x) denote a decision stump (weaker classifier), which generates ±1 labels. The procedure of AdaBoost involves a series of boosting rounds R of weaker classifier learning and weight adjusting under a loss minimization framework, featuring on producing a decision rule as follows:
                                                        H              R                        ⁡                          (              x              )                                =                                    ∑                              m                =                1                            R                        ⁢                                          α                m                            ⁢                                                h                  m                                ⁡                                  (                  x                  )                                                                    ,                            (        4        )            where αm represents the vote (i.e. confidence) of the decision stump hm. In general, the optimal value of am is accomplished via minimizing an exponential loss function [23]:
                              Loss          ⁡                      (                                          H                R                            ⁡                              (                x                )                                      )                          =                              ∑                          i              =              1                                      N              t                                ⁢                                    exp              ⁡                              (                                                      -                                          y                      i                                                        ⁢                                                            H                      R                                        ⁡                                          (                                              x                        i                                            )                                                                      )                                      .                                              (        5        )            
Given the current ensemble classifier Hr−1 (x) and newly learned weak classifier hr (x) at r boosting round, the optimal coefficient αr for the ensemble classifier H (x)=Hr−1 (x)+αrhr(x) is the one which can lead to the minimum cost:
                              α          r                =                  arg          ⁢                                    min              a                        ⁢                                          (                                  Loss                  ⁡                                      (                                                                                            H                                                      r                            -                            1                                                                          ⁡                                                  (                          x                          )                                                                    +                                              α                        ⁢                                                                                                  ⁢                                                                              h                            r                                                    ⁡                                                      (                            x                            )                                                                                                                )                                                  )                            .                                                          (        6        )            
According to the optimization algorithm [28], let εr be the weighted training classification error, i.e.,
                              ɛ          r                =                              ∑                          i              =              1                                      N              t                                ⁢                                    w              i              r                        ·                                          [                                                                            h                      r                                        ⁡                                          (                                              x                        i                                            )                                                        ≠                                      y                    i                                                  ]                            .                                                          (        7        )            The resultant αr and updated wi are formulated as:
                              α          r                =                              1            2                    ⁢                      log            ⁡                          (                                                1                  -                                      ɛ                    r                                                                    ɛ                  r                                            )                                                          (        8        )                                          w          i                      r            +            1                          =                              w            i            r                    ·                                    exp              ⁡                              (                                                      -                                          y                      i                                                        ⁢                                      α                    r                                    ⁢                                                            h                      r                                        ⁡                                          (                                              x                        i                                            )                                                                      )                                      .                                              (        9        )            
Following this framework as depicted in FIG. 4, the weight for hard-to-classify examples is increased. Meanwhile, the updated weights also determine the probability of the examples being selected to form a novel training data set for subsequent component classifier. For instance, if a training sample is classified accurately, its chance of being selected again for the subsequent component classifier is reduced. Conversely, if the sample is not accurately classified, then its chance of being selected again is raised. By calling the component classifier several times (i.e., boosting rounds), as long as the training error of the component classifier is less than 0.5, the training error of the ensemble classifier will also decrease as the boosting round continues. In Adaboost, the individual classifiers are built in parallel and independent each other. Consequently, it will generate a strong classifier by linearly combining these component classifiers weighted by their votes through a sequence of optimization iterations.
In the prior arts, such as the U.S. Pat. No. 6,219,639, the United States Patent Application Publication No. 2011/0235870 and the U.S. Pat. No. 6,421,453, it is disclosed that lip information is incorporated to enhance access security. Nonetheless, these prior arts invariably require incorporating at least one other biometric modalities such as face, acoustical signals, voice-print, signature, fingerprint and retinal print, to achieve speaker verification, and often require more complicated procedures to achieve the security goals. To the best of our current knowledge, there is no known prior art that is based on one modality of lip motions but also at the same time embeds the private password information as a double security to the access-controlled system, where the speaker is not only verified by his or her underlying dynamic characteristics of lip motions, but also required to match a specific password embedded in the lip motion simultaneously.
The objective of the present invention is to provide a method and apparatus for a lip-password based speaker verification approach that merely utilizes one modality of lip motions, in which the private password information is embedded into the lip motions synchronously. A further objective of the present invention is to provide a method and apparatus that guarantees that it maintains a double security to an access-controlled system, where the speaker is not only verified by his or her underlying dynamic characteristics of lip motions, but also required to match a specific password embedded in the lip motion simultaneously. That is, in the case where the target speaker saying the wrong password or even in the case where an impostor knowing and saying the correct password, the nonconformities will be detected and the authentications/accesses will be denied. Another objective of the present invention is to provide a method and apparatus that is not only easily implemented, but also generally comprises of at least four merits as follows: (1) The modality of lip motion is completely insensitive to the background noise; (2) The acquisition of lip motions is insusceptible to the distance; (3) Such a system is easily usable by a mute person; (4) The lip-password protected speaker verification system has its unique superiorities of silence and hidden property.
Citation or identification of any reference in this section or any other section of this document shall not be construed as an admission that such reference is available as prior art for the present application.