1. Field of the Invention
The present invention relates to the field of speech analysis, and in particular to the analysis of an individual's speech to determine psychological, physiological or other characteristics.
2. Description of the Related Art
Scientists have long known that qualities of the human voice may indicate the emotions of the speaker. Speech is the acoustic response to motion of the vocal cords and the vocal tract, and to the resonances of openings and cavities of the human head. Air pressure from the lungs is modulated by muscular tension of the vocal cords, among other influences. Human emotions, as well as certain physiological conditions not typically associated with the voice, affect this muscular tension, and thereby affect voice modulation. Further, speech may also be affected by certain physiological conditions, such as dementia, learning disabilities, and various organically-based speech and language disorders.
Others have attempted to associate emotional qualities quantitatively with physical speech characteristics. In U.S. Pat. No. 3,855,417, issued to Fuller, the normalized peak energy ratio from two frequency bands of a subject's voice is used to determine whether the subject is telling the truth. In U.S. Pat. No. 3,855,416, issued to Fuller, a skilled interrogator asks the subject questions designed to elicit a true or false response. Fuller's system weighs a measure of the vibrato content of the subject's speech with the peak amplitude from a selected frequency band. The interrogator derives the veracity of the subject's statement through a comparison of the resulting quantity with a known truthful response.
In U.S. Pat. No. 4,093,821, issued to Williamson, a speech analyzer operates on the frequency components within the first formant band of a subject's speech. The analyzer examines occurrence patterns in differential first formant pitch, rate of change of pitch, duration, and time distribution. The analyzer produces three outputs. The first output indicates the frequency of nulls or "flat" spots in a FM-demodulated first-formant speech signal. Williamson discloses that small differences in frequency between short adjacent nulls indicate stress, and that large differences in frequency between adjacent nulls indicate relaxation. The second output indicates the duration of the nulls. According to Williamson, the longer the nulls, the higher the stress level. The third output is proportional to (1) the ratio of the total duration of nulls during a word period to (2) the total length of the word period. According to Williamson, an operator can determine the emotional state of an individual based upon these three outputs.
U.S. Pat. No. 5,148,483, issued to Silverman, describes a method for detecting suicidal predisposition based upon speech. The voice analyzer examines the signal amplitude decay at the conclusion of an utterance by a test subject, and the degree of amplitude modulation of the utterance. The subject's speech is filtered and displayed on a time-domain strip chart recording. A strip chart recording of a similarly filtered speech signal from a mentally healthy person is obtained. A skilled operator compares the parameters of interest from these two strip charts to determine whether the test subject is predisposed to suicide.
U.S. Pat. No. 4,490,840, issued to Jones, is based upon a relationship between so-called "perceptual dimensions" and seven "vocal profile dimensions." The seven vocal dimensions include two voice and five speech dimensions, namely: resonance, quality, variability-monotone, choppy-smooth, staccato-sustain, attack-soft, and affectivity-control. The voice, speech and perceptual dimensions require assembly from 14 specific properties representative of the voice signal in the frequency domain, plus four arithmetic relationships among those properties, plus the average differences between several hundred consecutive samples in the time domain. To arrive at voice style "quality" elements, the system relies upon relationships between the lower set and the upper set of frequencies in the vocal utterance. The speech style elements, on the other hand, are determined by a combination of measurements relating to the pattern of vocal energy occurrences such as pauses and decay rates. The voice style "quality" elements emerge from three spectral analysis functions, whereas the speech style elements result from four other analysis functions. The voice style quality analysis elements include spectrum spread, spectrum energy balance, and spectrum envelope flatness. The speech style elements are spectrum variability, utterance pause ratio analysis, syllable change approximation, and high frequency analysis.
Jones relates the seven vocal dimensions and seven perceptual style dimensions only to the above-described sound style elements. Each dimension is described as a function of these selected sound style elements. According to Jones's theory, the seven perceptual style dimensions or even different perceptual, personality or cognitive dimensions can be described as a function of the seven sound style elements.
The limitation in the Jones system to seven speech elements apparently constrains the psychological characteristics that can be measured by the system. Jones states that "[t]he presence of specific emotional content such as fear, stress, or anxiety, or the probability of lying on specific words, is not of interest to the invention disclosed herein." Col. 5, lines 42-45.
Each prior art voice analyzer generally relies upon one or more highly specific frequency or time characteristics, or a combination thereof, in order to derive the emotional state of the speaker. None of the references provides flexibility in the frequency or time domain qualities that are analyzed. Jones allows a variation in the weighting of the seven sound style elements, but does not permit variation of the elements themselves. Further, all the known prior art characterizations of speech rely upon a priori knowledge of speech patterns, such as knowledge of vibrato content, properties of speech within the first formant, amplitude decay properties, staccato-sustain and attack-soft. The prior art does not contemplate allowing a flexible variation of the disclosed specific time and frequency qualities even though such a variation may enable a speech-based assessment to correlate strongly with traditional psychological assessments, such as the Myers Briggs test and MMPI. Such flexibility is highly desirable given that the psychological profile of an individual is already difficult to quantify. Further, it is desirable to provide a speech analysis system that can also be easily adapted to assessing physiological traits of an individual.