The present invention is directed to an improved method and apparatus for identifying human sonic sources.
There are many situations where it is desirable to make a substantially positive identification of sounds originating through the mouth (oral-nasal cavity) of a human. In banking and credit card situations, correlation of the identity of the person presenting the card and the owner of the card requires some form of corroborating identification. As a further example, in police work, threatening, harassing and obscene telephone calls require substantially positive identification of the caller in conjunction with other evidence, before making an arrest of a suspect. As a further example, in extending credit over the telephone, vendors, in addition to credit card number verification, sometimes require caller correlation of fact indicia by voice with stored fact indicia, particularly where there is a controversy over whether or not certain goods were ordered.
The present invention is distinguished from speech recognition systems where spoken words are attempted to be identified by machine to produce a machine translation or identification thereof. In fact, in practice of the present invention, recognizable vowels, consonants and syllables, etc. sounds, making-up recognizable speech (which may be disguised by the suspect) are discarded and only the non-audible (typically infrasonic or sub-aural) portions or orally emitted sound through a human mouth are utilized in making a substantially positive recognition or identification of a suspect. The invention is based on the fact that while the voice may be disguised, certain physical structures in the voice box of a suspect, are for all practical purposes, not controllable by humans and that an initial rush of air develops an inaudible sound usually in the infrasonic range and that such sounds, although inaudible (below the audible frequency range) are unique to each individual. Thus, while audible portions of the sound emission from the human mouth or oral cavity may be used for identification purposes, the present invention is concerned with the low energy level portion of sound emitted by the human in making speech or other sounds emitted through the oral cavity.
According to the invention, sound recordings are first digitized (e.g., sequential samples of an analog wave are converted to binary words which may then be manipulated to first delete (in this preferred embodiment) the audible portion of the sound recording. Then the remaining portion is analyzed time-wise, frequency-wise and amplitude-wise (three domains), and displayed both on an electronic display and, if desired, a printed display. The signals from which the display is prepared can be electronically compared to find points of correlation or visually compared to find points of correlation with a suspect recorded rendition of the same sound emissions.
Where telephone lines form a part of the path to the medium used to capture the voice sound pattern, because of the sharp spectral cut-off due to telephone bandwidth filtering, the voice sound wave to be recognized should be passed through essentially the same or similar paths before the comparison process of the present invention.
Mouth shapings, tongue, lip and other physical structures which affect the "sound" of speech can be used as by training various muscles and control to disguise a voice. However, those situations which are responsible for forming the inaudible sounds at the front or leading or trailing portion of an audible speech wave are generally not subject to such control and are involuntary, and essentially the same and unique for a given person. The frequency band under about 500 Hz is found throughout the speech spectrum and the infrasonic and sub or inaudible portions of the speech spectrum are particularly useful according to the present invention.
Typically, most tape recorders are relatively good at recording the inaudible sounds emitted orally through the mouth and not so good at the high frequency ranges. Thus, while high quality recording equipment can be used in the practice of this invention, one of the advantages and features of the invention is that inexpensive magnetic tape recorders (VARs), and low cost sound processing equipment can be used in practicing the invention. Moreover, for acoustic surveillance of business establishments which are subject to robberies (banks, convenience stores, gas stations, etc.), low cost voice activated recorders, which are on continuously during normal business hours with automatic reversing tape cassettes, endless loops, etc. or digital recorders wherein large scale FIFO memory arrays are used to capture the infrasonic sounds. It will be appreciated that, if desired, the recording can be timed and dated and stored in a permanent memory. It will also be appreciated that photographic evidence such as video cameras may be used to record the transaction and used in conjunction with the present invention to obtain a positive identification of a suspect.
The digitized portions of the card owner's name (for example) voice samples, prepared as described herein, are recorded on the magnetic strip of a credit bank, or charge card. At the point of use, the card holder is requested to speak his or her name which is transduced to electrical signals, digitized and compared against a stored digital rendition using the principles of this invention.
According to one aspect of the invention, sound or voice activated recorders are located at a bank teller's window, for example, to record the infrasonic acoustical emissions from a bank robber and the recording later used to assist in making a positive identification of the bank robber. Each teller is equipped with an inconspicuously located voice activated recorder (VAR) or microphone of a VAR. Each VAR records for 1/2 or 1 hour continuously recording over and over use an endless loop--15 minutes or automatic tape reversing machine. It could also be a digital recorder, or a solid-state memory which includes an analog-to-digital to FIFO memory. Then when there is a bank hold-up, the VAR is activated by the bank robber stating, for example, "This is a hold-up--put all of the money in the bag", "I have a bomb", etc. These short, commonly used vocalizations of the bank robber are recorded in the event of a robbery, and are saved and given to the police for analysis according to the invention, to identify possible suspects.
As a further use of the invention, prerecorded samples of the inaudible sounds produced by a bona fide telephone customer may be used to validate further telephone orders by that customer. The invention can also be used to identify persons at banks or for admittance to protected or restrictive areas.
The invention system involves first acquiring and storing a digitized voice sample. This is accomplished when a human subject speaks into a microphone connected to an analog-to-digital (AD) conversion circuit. Such AD circuit must be capable of sampling the analog input from the microphone at a rate of about 5,500 samples per second (or at least twice as high as the highest frequency needed) and must provide operator controls which permit precise selection of the sampling rate. The AD circuit is connected to the data bus of a computer which can retrieve the resulting stream of digital values for storage. The voice sample, thus converted, is stored on magnetic disk in which may be called a voice file.
The data in the voice file created by the above procedure is then imported by computer software capable of translating it into two types of on-screen graphic display. The first display type is a two-axis graph representing time (x) and amplitude (y). The software must permit selection ("marking") of a portion of the data thus visually represented using a pointing device such as a computer mouse. The software then translates that portion of the data represented by the selected ("marked") section of the visual display into a second type of on-screen display.
The second display type is a three-axis graph representing time (x), frequency (y) and amplitude (z). The selected portion of data is charted as multiple samples or "slices" of the original voice sample data. The number of slices charted is controlled by the software which selects the values of each nth sample from the selected portion of the voice file data to create a display group of the multiple slices separated by equal time intervals. By spacing each slice equidistant from the previous and the next slice, a cascade pattern is created in the display which, when complete, provides a three-dimensional visual array. For our purposes we term this display a "digital acoustic hologram" which becomes a digital hologram.
The juxtaposition of the lines and curves in any one slice to those of its neighbors, disclosed by the cascade pattern of the acoustic hologram, constitutes the fabric of the processed voice sample which is the basis for voice comparisons in the present invention. The contours and trends evident in the hologram are the "marks" by which a voice sample can be identified.
The voice identification system of this invention for police use is intended for comparison of a voice sample recorded on magnetic tape, obtained from a human subject either by telephone or microphone. A digitized sample is acquired and stored when the taped sample is replayed and the analog signal thus created is connected to an analog-to-digital (AD) conversion circuit. Such AD circuit must be capable of sampling the analog input acquired from the tape at a rate of at least 5,500 samples per second (or twice as high as the highest frequency needed) and must provide operator controls which permit precise selection of the sampling rate. The AD circuit is connected to the data bus of a computer which can retrieve the resulting stream of digital values for storage. The voice sample, thus converted, is stored on magnetic disk which may be called a voice file.
The data in the voice file created by the above procedure is then imported by computer software capable of translating it into an on-screen graphic display. The display generated by the computer software is a two-axis graph representing time (x) and amplitude (y).
The software also provides for the simultaneous display of a second two-axis graph beside the first graph. The second graph is generated by the software from a second voice file containing the digitized values from a voice sample of a person of known identity. The two voice samples thus displayed side-by-side can be compared visually by the operator who then forms a judgment as to whether the two voice samples are sufficiently similar tentatively to be considered a match.
In order to obtain two graphs with comparable data samples, the software must permit editing of each sample to delete from the displayed graph extraneous noises, excessively long spoken phrases and periods or gaps of silence. Through such editing, each sample can be reduced such that it represents only a sound known to have been produced by the subject's vocal cords, exclusive of any irrelevant sounds such as coughs, sneezes, breathing, etc., and reduced to approximately the same length of time as the other sample. Editing the samples can best be achieved through the use of a pointing device such as a computer mouse which can allow "marking" and subsequent deletion of a portion of the data. The software then retraces the display showing the remaining data.
It should be noted that the voice identification system of this invention is not intended to provide police agencies with a voice analysis for use as evidence in a criminal proceeding. While it may be possible that voice analyses could be accepted as evidence in the future, the purpose of the system as herein disclosed is to provide a tool for police use. It would be considered successful if, for example, it allowed officers to concentrate their efforts in an investigation on only a few suspects by eliminating dozens of others through the use of voice analysis.
It is to be clearly understood that, as discussed above, the invention does not require the use of the audible portion of recorded speech for identification purposes, but it may be used in conjunction with other speech recognition methods and apparatus.
The system described in U.S. Pat. No. 4,837,804 submits the analog voice signal as received by telephone to a process wherein numerical values are assigned to the differences between analog soundwave features. Although the language in the patent includes the term "digitize", it is referring to the arithmetic value assigned to the difference in waveform features, and not involve digitization of the voice sample itself.
This invention differs from the model in U.S. Pat. No. 4,837,804 in that the voice samples are digitized prior to comparison. This allows a more precise measurement of waveform features, storage of the digitized sample on magnetic media, and manipulation of the sample to optimize the comparison process. Also the system of U.S. Pat. No. 4,837,804 because it is intended to acquire a voice sample from telephone lines which clip frequencies below 500 hz, ignores the sub-aural portion of the frequency spectrum containing the waveforms most important in this comparison scheme.
The scheme disclosed in U.S. Pat. No. 4,827,518 includes digitization of voice samples and extraction of characteristic features from an articulated phrase. However, acknowledging that variations will occur from one voice sample to the next, it requires storage of a "plurality of cepstral coefficient sets" taken from multiple samples, apparently so as to match the one most similar to that from the test sample.
The present invention system differs form this scheme in that the patent method computes "closeness metrics" from multiple samples and bases its "decision" on those metrics. The system of this invention has no need to reference multiple samples because it analyzes only the sub-aural portion of the frequency spectrum, recognizing that activity in these lower frequencies does not change substantially from one voice sample to the next among samples taken from the same subject.
Although the system of U.S. Pat. No. 4,827,518, like the present system, provides digitization of its voice samples, it fails to acknowledge the limitations imposed by the Nyquist Effect which limits the highest frequencies of the digitized sample to a value that is one-half the sampling rate. By failing to specify that samples must be acquired at a sample rate sufficient to produce at least all frequencies within the aural range of the spectrum (which is purports to measure), and that each of two samples to be compared must be acquired with the same sample rate, the system reduces itself to a level of performance that can never be of practical benefit.
Another feature of the system of U.S. Pat. No. 4,827,518 is that voice features are stored on a card. However, the system of the present invention proposes storing the digitized data on a magnetic strip mounted on a card similar in size to a credit card, whereas the card of U.S. Pat. No. 4,827,518 would actually contain printed circuits which would interact with the user interface terminals.
The system of U.S. Pat. No. 4,833,713, like the U.S. Pat. No. 4,837,518 system, submits an analog voice signal for comparison. The device recognizes that the waveforms comprising a spoken word or phrase will not be precisely the same from one sample to the next. It purports to compensate for this, not by storing multiple samples, but by storing a consolidated voice pattern created from the superposition of "a plurality of voice patterns". The present invention does not denigrate or blur the stored voice sample by superimposition of multiple samples.