People have been interested in analyzing and simulating the human voice since ancient times. Recently, interest in speech processing has grown with modern communication systems. Improved methods of speech processing can provide us with more sophisticated means for recording, securing, transporting, and imparting information. To illustrate, modern communications systems could include talking computers, information systems secured with a voiceprint password, and typing machines that receive data through voice input (as opposed to manual or keyed input). Voiceprints can be used to control access to protected services, such as phone cards, voice mail, customer accounts, and cellular service. See, for example, U.S. Pat. No. 5,414,755, issued to Lawrence G. Bahler on May 9, 1995, et al., entitled SYSTEM AND METHOD FOR PASSIVE VOICE VERIFICATION IN A TELEPHONE NETWORK, and assigned to ITT Corporation, the assignee herein, which discloses the use of speaker verification in connection with long-distance telephone services.
Automated speech processing may take many forms. For example, speech can be input into a device to (1) confirm or verify the identity of the speaker (known as speaker verification); (2) identify an unknown talker (speaker recognition); (3) convert what the speaker says into a graphical representation (speech recognition); (4) translate the substance of what the speaker has said, or have the device respond to the speaker's utterances (speech understanding); or (5) determine whether specific subjects were discussed or words uttered (word spotting). The method of this invention is particularly useful with regard to the first of these methods, namely, speaker verification, although it is contemplated that it could be applied to other methods as well.
In speaker verification, generally “reference patterns” or Models of the speech phenomenon to be recognized are developed. The Models are then compared to subsequently obtained speech samples (which may be unknown). In previous methods, both the Models and unknown speech samples have been analyzed by dividing the speech utterances into short segments of speech called frames. Previous methods analyze the frames separately and then combine the results under the assumption that the frames are statistically independent.
Methods for measuring speech across spectral bands and for comparing models of speech are well known. Speech, when uttered, produces an acoustic waveform which can be measured as a function of frequency, time, and intensity. A typical speech analyzer consists of a series of band pass filters which are otherwise arranged in a set which covers all the frequencies of the speech spectrum (typically from 50–10,000 Hz). The response frequency characteristics of a good band pass filter set are well known. These band pass filters are operated over the entire speech band and have fixed band widths over a given frequency range. Automatic switching of a level indicator or other device to each of the filter outputs in a frequency sequence can give one an octave spectrum plot of all the frequencies. When the sequence is rapidly cycled, one can monitor time variations of the spectrum. Hence, spectrographs of speech waves can be produced. In speaker verification systems, the analog waveform of the speaker's voice is converted to digital form. This is done for both the reference Models developed from enrollment data and for the speech samples developed from test data. Computer processing is then applied to measure the Euclidean distance between the reference Models and the speech samples, which may be determined by applying a linear discriminant function (LDF), i.e., Fisher's LDF, which is well known and described in the literature. See for example an article entitled, “The use of multiple measurements in taxonomic problems” by R. A. Fisher published in Contributions to Mathematical Statistics (John Wiley, N.Y.) (1950). See also “Pattern Classification & Science Analysis” pp. 114–114 by R. O. Duda, P. E. Hart published (Wiley, N.Y.) (1973). If the distance measured is less than a predetermined value, or in other words produces a certain “score,” a decision is made to accept or reject the test data as having been uttered by the same person providing the enrollment data.
To date, however, voice verification methods have divided the speech utterances into frames and analyzed the frames separately. The results are then combined with the assumption that the frames are statistically independent. For general background regarding speech verification systems and the use of frames of speech data, see for example, U.S. Pat. No. 5,339,385, issued to Alan L. Higgins, the inventor herein, on Aug. 16, 1994, entitled SPEAKER VERIFIER USING NEAREST-NEIGHBOR DISTANCE MEASURE; U.S. Pat. No. 4,720,863 issued to Kung-Pu Li on Jan. 19, 1988, entitled METHOD AND APPARATUS FOR TEXT-INDEPENDENT SPEAKER RECOGNITION; and U.S. Pat. No. 4,837,830, issued to Wrench, Jr. on Jun. 6, 1989, entitled MULTIPLE PARAMETER SPEAKER RECOGNITION SYSTEM AND METHODS. Each of these three patents were assigned to ITT Corporation, the assignee herein, and they are hereby incorporated by reference.
The independence assumption in analyzing speech frames as used in previous methods is a mathematical convenience; however, it is incorrect, because the frames of a speech utterance are shaped by the mechanical characteristics and constraints of the speech production mechanism, which remains constant throughout the utterance. In addition, noise and distortions may be added to the speech as it is being measured. Often, these artifacts remain steady over time (e.g. electrical “hum”), adding to the correlation between measured frames. The independence assumption causes these correlated effects to be “counted” as new information at each frame, overstating their true information content, and creating undue sensitivity of the verifier.
An object of this invention is, therefore, to provide a method for analyzing speech frames jointly without assuming independence to increase the accuracy of the speech processing. The invention herein is superior to previous methods because it frees the analysis of the faulty independence assumption.