On Jun. 30, 2000, Federal law enacted the use of authenticated digital signatures as legally binding, especially for E-Commerce. The present Public Key Infrastructure (PKI) has utilized digital certificates with encryption code as authenticated digital signatures for external point-to-point electronic transaction/transmission. To complete the secured transmission, internal authenticated digital signatures are required of the personnel authorized to use these digital certificates. Personnel authentication must first be established, especially for remote authorization of the PKI.
Digital signatures require an uncompromisable (non-digital, biometric) core to authenticate the actual operators of PKI established transmissions for the PC/telephony infrastructure of our insecure digital world, as well as for in-person verification. Typically, cards having a coded magnetic strip are used, and these and the access codes enabling their use have to be distributed in advance.
Various biometric authentication methods which have been used, such as fingerprinting and iris or retinal scans, are difficult to implement as they require special hardware, and they can make people feel uncomfortable, or can even transmit illness. But, most importantly these methods have not proven to be accurate enough for absolute identification.
The human voiceprint, comprising the amplitudes of each frequency component, and being both non-digital and non-intrusive, is ideally suited for authentication purposes given its worldwide usage, within an already existing communications infrastructure, including wireless. With the ever present utilization of telephones and microphones, voice authentication or verification, also known as speaker recognition, is a natural, and certainly most cost-effective, evolution of technology which can be utilized from anywhere, at anytime.
In a paper entitled Nonlinear Speech Processing Applied To Speaker Recognition, Marcos Fandez-Zanuy points out that speaker recognition has many applications including voice dialing, banking over the telephone, security controls, forensic systems and telephone shopping. This has raised a great interest in improving the current speaker recognition systems. A key issue is the set of acoustic features extracted from the speech signal. This set is required to convey as much speaker dependent information as possible. The standard methodology to extract these features from the signal uses either filter bank processing or linear predictive coding (LPC). Both methods are, to some extent, linear procedures and are based on the underlying assumption that acoustic characteristics of human speech are mainly due to the vocal tract resonances, which form the basic spectral structure of the speech signal.
However, human speech is a nonlinear phenomenon, which involves nonlinear biomechanical, aerodynamic, acoustic and physiological factors. LPC-derived parameters can only offer a sub-optimal description of the speech dynamics. Therefore, there has been a growing interest for nonlinear models applied to speaker recognition applications. Speech signals are redundant and non-stationary in nature. LPC coding schemes take advantage of the redundancy, but do not offer a way to account for non-stationarity, and for nonlinear redundancies such as the fractal structure of frication, sub-harmonics in the voice source and nonlinear source tract interaction.
Therefore, further investigations are needed to identify the appropriate acoustic features. All agree that new, time-frequency representations, both acoustical and perception-based, are needed. Moreover, since the human decoding of the speech signal is based on decisions in narrow frequency bands processed independently from each other, sub-band processing techniques have not yet been exploited, such as feature extraction algorithms. Time dependent and multi-time dependent fractal dimensions, as well as Lyapunov exponents and dimension and metric entropy in phoneme signal, which have been mostly used for speech recognition applications, have not yet been modified for speaker recognition, using combinations of these features.
Voice (or speaker recognition or) verification is to be distinguished from voice identification, the latter being a tougher problem and at present the existing technology does not provide an adequate solution. Voice verification, on the other hand, is easier because the decision to establish the voice's authenticity is essentially binary in nature. In voice verification, it is principally determined whether the individual is who he claimed to be based on his utterance.
There are two approaches to voice verification, text dependent and text independent. In the text dependent approach, the user utters the same text (with possible variations on the order of the words constituting the text) whereas in the text independent approach, the user is not constrained to a single text or aggregation of words and can utter arbitrary text. The text independent approach, while having the advantage of user friendliness, requires extensive training and the performance is not satisfactory for practical applications.
In typical text dependent voice verification scenarios, the user registers a phrase by repeating it many times, and when he wishes to verify he utters the same phrase. The system examines the phrase uttered during the registration phase, collects information regarding the spectral and temporal manner of the phrase utterance, and deduces whether the phrase uttered during verification, possesses similar characteristics. Ultimately the success (or failure) of the process critically depends on user acceptance and user friendliness. One postulates the following axioms for making the voice verification process easier for the common person: the system should learn along with the user and adapt itself for harmonious performance.
There are two principal approaches to dependent text voice verification, namely Dynamic Time Warping (DTW) and Hidden Markov Modeling (HMM). The DTW approach was developed during the 1970's, predominantly by Texas Instruments (Doddington's group based on his doctoral dissertation on verification) breaks the utterance into feature vectors, and finds scores between utterances by matching these feature vectors. During this pattern matching, it will stretch or contract speech segments (within certain constraints) for maximal scores and uses an optimization principle called Dynamic Programming expounded by Bellman during the 1960's. The feature vector representing the initial utterance may or may not be updated by the next utterance (referred to as “smoothing”) based on the inter-utterance scores and the system discipline. Some verification systems may keep a cluster of feature vectors for each user whereas other systems may maintain only one feature vector.
The related field of speech recognition (recognizing the content of what is said, as opposed to who is saying anything) has traditionally used feature vectors comprising four or five dimensions, but because accuracy is not as important as in speaker recognition, a greater number of dimensions has not been applied. Because the task of speech recognition can tolerate a lower accuracy than that required for speaker recognition, the prior art has heretofore not applied feature vectors having added dimensions.
Nature exhibits chaotic, seemingly random behavior with an underlying, but unpredictable order. Fractals are objects with fractional dimension. They show self-symmetry as the scale is changed. In evaluating a shape (e.g., for a voiceprint frequency spectrum), it is often desirable to find the capacity dimension of the shape. A line is one-dimensional, a filled square is two-dimensional, and a cube is three-dimensional. Fractal geometry concerns objects whose capacity dimension is fractional rather than integer. For example, four unit squares can fill a 2-by-2 square, where the unit squares are one-quarter sized identical copies of the 2-by-2 square. Recursively, each unit square can be precisely filled with four squares 1/16 the size of the original 2-by-2 square, etc. In nature, a fern is a fractal, where each branch and sub-branch, etc., are a small fractional reproduction of the overall fern. The fractal signal model often can be applied for signals used for spectral and expressive control of the voice.
The set of states that a chaotic system visits, turns out to be a fractal. Dimension is a measure of irregularity. Scaling behavior of various quantities can be exploited to define dimensions, and chaotic attractors typically have a non-integer dimension. Likewise, attractors in dimensions are a quantitative characterization of the attractor. Lyapunov Exponents, λ are useful to describe the stability properties of trajectories. A positive Lyapunov exponent is an indication of the presence of chaos, since for λ>0, sufficiently small deviations from a trajectory grow, and it demonstrates a strong instability within the attractor. The inherent instability of chaotic systems implies limited predictability of the future if the initial state is known with only finite precision. Therefore, with the aid of attractor dimension and the Lyapunov exponent, chaos can be distinguished from noise.
In U.S. Pat. No. 6,510,415, to Talmor, dated Jan. 21, 2003 (filed 15 Mar. 2000), the inventor focuses on voice authentication algorithms. The system uses a voiceprint made in real-time to compare against stored voiceprints from a database of users. Access for a user of the system is only permitted on a one-time basis, if the fit is “most similar”. Talmor teaches the techniques of Cepstrum calculation and Linear Time Warping. Cepstrum calculation is a standard prior art parse, which uses constant/personal parameters, while doing calculations on user voices.
The Linear Time Warping approach of Talmor is a form of optimized, weighted DTW and can also be applied to HMM or Gaussian Mixture Model (GMM). This involves finding the optimum path, which involves calculated parameters. Primarily this patent addresses the voiceprint matching problem using voice features that are distinctive and which most closely characterize and thus identify a speaker, but does not provide the details of the algorithm Talmor uses.
In U.S. Provisional Pat. No. 20030135740 dated Jul. 17, 2003 (filed Oct. 2, 2003) entitled “Biometric-based System and Method Enabling Authentication of E-messages Sent over a Network”, Talmor addresses the problems of PKI encryption using voiceprint analysis and comparison of voice prints to stored data. The patent describes the use of digital signatures, secure on-line transactions and secure messaging, as part of the system of the invention. Biometric data, including palm prints, finger prints, face prints, and voiceprints, become the private keys when utilizing a PKI system of authentication. The biometric-based system of the invention provides a web-server with a three-tier security structure: (1) biometric sample; (2) unique device ID, and (3) a PIN. Again, there is no disclosure of voiceprint authentication details.
In U.S. Pat. No. 6,535,582, Voice Verification System, Harris discloses a plurality of interactive voice response (IVR) units connected to a verification server. The technology generally relates to the network layers and connections based on the internet, with an API module. Accordingly, there are no mathematics or innovative workings disclosed.
In Canadian patent CA2130211C, System and Method for Passive Voice Verification in a,Telephone Network, Bahler, et al, disclose verification of an identity that is derived from thecustomer calling card. The technical innovation and accuracy provided are minimal.
In U.S. Pat. No. 6,496,800, Speaker Verification System and Method Using Spoken Continuous, Random Length Digit String, Kong and Kim teach the use of passive voice verification in a telephone network-passively monitoring a conversation between calling and called parties to obtain a sample signal, and comparing with at least one reference set of speech features to determine whether calling party is a customer of a telephone network. The invention makes use of thresholds for establishing security. There are no details about the algorithm or the elements of the component blocks.
Kannan, et al, in U.S. Pat. No. 6,728,677, discloses a Method and System for Dynamically Improving Performance of Speech Recognition or Other Speech Processing Systems. The invention provides a speech recognition performance improvement method involving monitoring utilization of computing resources in a speech processing system, based on which performance of speech processing operations is improved. In U.S. Pat. No. 6,233,556, Teunen teaches A Voice Processing and Verification System. This patent discloses a voice processing method for automatic and interactive telephone systems, involves transforming an enrollment speech model to corresponding equipment type for incoming user speech. No details are provided in either of these two patents assigned to Nuance, Inc.
In European patent No. EP1,096,473A3 assigned to Persay, Background Model Clustering for Speaker Identification and Verification, Toledo-Ronen discloses the target likelihood score of an unknown speaker, which measures the degree to which the input speech of the unknown speaker matches the model of a target speaker. The method normalizes the target likelihood score and includes the step of selecting one of a plurality of background models as a selected background model. The method also includes the steps of measuring the degree to which the input speech matches the selected background model, thus producing a background likelihood score, and dividing the target likelihood score by the background likelihood score.
Therefore, there is a need to utilize combinations of time dependent and multi-time dependent fractal dimensions, FFT, Lyapunov exponents and other non-linear techniques for speaker verification systems.