1. Field of the Invention
This invention relates to a method and apparatus that segments vocal sounds, such as human voice and speech utterances, and reassembles digital representations of the segments into a specified set of style elements, which, with specified instructions, derives style dimension values of voice, speech, and behaviorally related perceptual processes for measurement and research of a subject's vocal/perceptual profile.
2. Description of the Prior Art
The underlying concepts of the present invention can be best understood by first realizing that an excited voice, for instance, naturally sounds different from a dull voice, and that, while working, a newscaster talks differently from a softball coach, coaching, or a poet reading poetry, or a mother talking expressively or angrily to her baby. That there are literally hundreds of rather universal forms or styles of voice and speech used for human communication for rather universal tasks or situations, is natural.
A second concept that is necesary for one to understand the technical aspects of this invention is that there is something naturally different in our "frame-of-mind" or perceptual processes when we are involved in different tasks such as instructing children in algebra, as compared to, when we are demanding to be heard at a town meeting, trying to sell something, or trying to talk a reluctant mate into making love.
This invention considers building blocks of "frame-of-mind" to be "perceptual dimensions" and then allows the user to compare both trial and built-in or machine standard perceptual dimensions with either built-in machine standard or trial voice or speech style elements, so that the user might then obtain an individual's profile or discover new relationships. This invention, thereby will allow psychologists, and speech and cognition researchers, to measure machine standard style elements and dimensions or assist in the standardardization of new ones. The user can measure a vocal sound and compare it to those of people or animals in similar invironments or doing similar tasks. The user can thereby determine probabilities tht the perceptual profile of the speaker or even animal is similar to that of specified groups having a similar vocal profile.
A book by the inventor, a professional engineering psychologist, attempts to lay the foundation for understanding the sub-division of cognitive or preceptual processes of awareness into a specified set of dimensions that are sensitive to a similarly specified set of vocal elements.
A description of the differences in the disclosed invention and the prior art requires first an overview of the functional categories of theory and devices. Vocal measurement apparatus has consistently described psychological relationships to voice and/or speech primarily in terms of a specific single variable rather than the multiple dimensions necessary to constitute a vocal/perceptual profile of the subject. The variables referred to in other patents are generally too vague to be meaningful to psychologists.
"Stress", "emotion" and "normal", in relation to either vibratto, pitch, nasality or one or two voice formants, can only be useful to determine a single event, such as lying or helping a subject to alter his volume or pitch or nasality for better speech effect, or to indicate "stress" or "emotion".
The fact that all people are always under some degree of numerous stresses and always expressing a very complex array of a combination of many emotions and also other cognitive, perceptual, or awareness processes not related directly to "emotion" is generally ignored. The usual orientation is as if there is a single "degree" of stress, and a single kind of "emotion" which, as if by toggle-switch, is either on or off.
Psychologists give batteries of pencil and paper tests, inkblot tests, block placement tests, logic tests, I.Q. tests, etc. (hundreds have been developed) in an attempt to arrive at a composite profile of a subject that can be described in terms of numerical values along a meaningful set of several major psychological dimensions. At least three such dimensions are usually necessary to derive a single profile. A half dozen dimensions is a frequent approximate number utilized to derive a single profile. That a useful awareness or perceptual profile can be derived from a vocal sample, unobtrusively, has not only been unavailable, but is not being suggested nor contemplated in any related scientific literature or prior patent art, except in the present disclosure and the inventor's scientific papers, reports, and book on the subject.
Before broad relationships between voice and mental processes could occur, an appropriate perceptual processing profile description with theory had to be evolved relating to a specific set of vocal dimensions. Psychological processes significant to a profile must include normal, not just abnormal processes, and also include those relating to logic utility by an individual and their occupation, sensory and abstract awareness, and self-to-system parameters. None of these necessarily accent either stress or emotion, but rather social hierarchy, altruism, beliefs and loyalties, planning and general social interaction dynamics.
The awareness attributes of these dynamics must be reducible to specific perceptual or cognitive processing dimensions, such as value sensitivities, self-other ratio, sensory-internal ratio, attachment variables such as love-repulsion or independence-dependence, and career or task affinities such as perceptual emphasis on feelings versus logic, or either versus physical attributes. Such a collection of several nearly orthogonal dimensions can constitute a meaningful perceptual profile of an individual.
The idea that the human voice conveys these complex relationships as dimensions of awareness or perceptual processes is unconventional, not theorized scientifically, not generally contemplated, and not now available through the prior art nor described in any index of scientific literature.
The set of vocal dimensions that relate to perceptual processes of awareness sufficient in number and selection to constitute a profile can simultaneously provide speech therapists with a vocal profile. This is because the speech dimensions of interest to speech therapists tend to be those which are frequently abused by patients and tend to relate to psychological problems. The inventor's book, cited below, details both the vocal and awareness or perceptual process interrelationships made possible for the first time by his own key discoveries (see presentation to scientific society below) and his theory (thirty years in development). The seven vocal profile dimensions include two voice and five speech dimensions, namely: resonance, quality, variability-monotone, choppy-smooth, stacatto-sustain, attack-soft, and affectivity-control.
These vocal dimensions which relate to perceptual dimensions are not directly accessible by machine the way pitch, nasality, formants, vibratto or volume are. The voice, speech and perceptual dimensions of the present disclosure require assembly from fourteen specific fundamental properties representative of the voice signal in the frequency domain, plus four arithmatic relationships among these, plus the average differences between several hundred consecutive such "time slices" in the time domain. Only by such a complex assembly, (using a cooperative arithmatic and logic algorithm) of a specific set of machine disassembled vocal signal properties can normal speech be unobtrusively related to speech, voice and perceptual dimensions. The analysis of continuous, normal speech, rather than an obtrusive, elicited, specified vocal sound or a specific phrase, unlike much preceding art, requires great flexibility and complexity in order to ascertain pertinent style dimensions.
All related prior art attempts to measure specified voice or speech features directly, such as pitch, loudness, formant positions, etc. in order to demonstrate stress, preferred speech, or vocal qualities, etc. The present invention segments the vocal utterances into six peaks in the frequency domain, none of which is pitch, and not all in recognized ranges of specific formants, and develops specific ratios for these. No prior art does this. While stress, emotion, pitch and formants are specifically of interest to prior art in this area, none are of specific interest to the present disclosed invention.
Speech style dimensions are assembled from disassembled vocal elements to produce two voice and five speech dimensions, namely: "resonance" and "quality", "variability-monotone", "choppy-smooth", "stacatto-sustain", "attack-soft", "affectivity-control".
In U.S. Pat. No. 4,335,276 to Bull, et al. nasalization has the same four major sections as does the present invention: (1) analog pre-processing using filters, (2) analog to digital conversion, (3) a microcomputer with controlling logic, (4) display with control logic and key-pad or keyboard for operator control.
However, Bull uses two inputs to two filters. One of the inputs is from an accelerometer mounted on the external nasal wall of the subject. The present disclosure is not concerned with nasal resonance and thus does not use a second input mounted on the speaker.
There are many kinds and degrees of resonance. The disclosed invention measures two voice quality parameters, one of which is labeled resonance, but it is not nasal resonance, has nothing to do with "nasality", and is not measured using any of the Bull apparatus, method or vocal aspects. The present disclosed invention does not attempt to measure or analyze nasal resonance disorders, but the resonant properties, of the voice, associated with natural, social hierarchy elements of a normal subject's perceptual profile, unobtrusively.
The present invention, unlike prior art and certainly unlike Bull's, is an unobstrusive measurement tool, a method for measuring perceptual or awareness processing without disturbing the subject, which otherwise would distort the results unless one is dealing strictly with some physical attribute. Even recorded speech samples can be used with the disclosed invention and the presence of the participant is not necessary.
The present disclosure uses 1/3 octave filters covering the full audio range, unlike Bull's. The present disclosure provides twenty elements of the speech signal, multiple times per second, rather than one or two features, plus a composite psychological profile not even vaguely attempted by any other prior art.
In U.S. Pat. No. 4,063,035 to Appellman, et al. the first two formants are converted to a single display point on a screen. While this, like Bull's invention is useful and novel, it bears no relation to the present disclosure. While Appellman uses banks of 1/3 octave filters, as does the present disclosure, so has laboratory equipment for voice analysis dating back many years, Bull uses only peaks from two regions, where prior literature has established that the first two formants are supposed to reside. My research shows that peaks not normally thought of as formants, carry significant information. Appellman states " . . . there is a false frequency peak in the second formant region that may be of greater amplitude than the described formant". He uses considerable effort to pick the "right" two peaks.
By comparison, the present disclosure uses six peaks including the "false" peak, plus specific ratios, for a total of twenty different building block elements. All of the elements are accessible for manipulations by the operator unlike prior art and also the machine can be switched to automatic so that the algorithms can be operated on by built-in logic to ultimately produce two voice quality and five speech dimensions. These then are utilized by algorithms to produce seven perceptual dimensions making up the final vocal/perceptual profile of the subject. In the present disclosure, all these values are displayed, rather than a single dot indicating the first two formants. Appellman also displays the full spectrum as a bar graph on an oscilloscope. This type of display has been in practice for years, as seen in frequency spectrum analyzers and sound balancing equipment operated by the sound controllers for bands and in recording studios.
The on-board controlling firmware for most if not all commercial 1/3 octave filter banks for use with small computers, including the one used in the present disclosure, displays these bar graphs upon user request. However, this is incidental and not described herein.
Several additional patents relate to speech. One is an apparatus used in teaching speech to the vocally handicapped, including the deaf, by providing information as to when loudness or frequency range limits are exceeded or are on target (U.S. Pat. No. 3,760,108). However, this prior art specifically limits the sound spectrum of interest to the fundamental frequency and cannot perform the function of assembling vocal utterance elements into voice, speech and perceptual style dimension values for measurement, comparisons, and reseach. Another patent teaches the measurement of pitch perturbations to determine an individual's emotional state (U.S. Pat. No. 4,142,067). However, this prior art specifically does not concern itself with most of the speech spectrum, measuring instead only the first formant region. The presence or absence of emotion is then determined. However, emotion of some degree is always present in people, including stress, and variations in pitch can indicate expressiveness not associated with stress. Also this prior art does not address speech or cognitive style components.
Another patent teaches the measurement of the presence or absence of a low frequency vocal component as it relates to physiological stress (U.S. Pat. No. 3,971,034). However, this prior art is not concerned with most of the speech spectrum, and must be calibrated to each individual, meaning that the stress level obtained cannot be compared to a population mean or standard and does not involve normal perceptual style dimensions. Each of these measures one or two specific vocal parameters and then indicates the presence or absence of these. The assumption is made that stress, lying, or proper speaking is, or is not being exhibited by the user. These inventions do not measure the entire amplitude frequency distribution, determine speech or vocal style elements and dimensions or relate these to perceptual style dimensions through both a built-in and a user supplied coefficient array.
Another speech analyzer reads lip and face movements, air velocities and acoustical sounds which are compared and digitally stored and processed (U.S. Pat. No. 3,383,466). A disadvantage is that the sound characteristics are not disassembled and reassembled into speech style elements or dimensions nor related to perceptual style dimensions. There is a great deal of art relating to speech recognition devices wherein a voice's digital representation is compared to a battery of previously stored ones. Some of these use filters, others use analytic techniques, but none relate normal and typical voice and speech styles to normal and typical perceptual or cognitive style dimensions.
Another technique for analyzing voice involves determining the emotional state of the subject as disclosed in Fuller, U.S. Pat. Nos. 3,855,416; 3,855,417; and 3,855,418. These analyze the amplitude of the speech, voice vibrato, and relationships between harmonic overtones of higher frequencies. However, these inventions are not concerned with natural and typical voice and speech style elements and dimensions and typical perceptual style dimensions, and are limited to stress measurement and the presence or absence of specific emotional states.
The presence of specific emotional content such as fear, stress, or anxiety, or the probability of lying on specific words, is not of interest to the invention disclosed herein. The invention disclosed herein also is not calibrated to a specific individual, such as is typical of the prior art, but rather measures all speakers against one standard because of the inventor's scientific discovery that there exist universal standards of style.
The user can evaluate the similarlity of the various vocal style dimensions of his or her voice (in biofeedback mode) or his client's voice (in therapy setting) to those of target groups such as recording and entertainment stars, successful and unsuccessful people, psychologically dysfunctional people (or a variety of different dysfunctions), self-actualizing people, etc. Any and all naturally occurring groupings of people, occupationally or cognitively, can be assumed to have one or more specific and predictable vocal style components with ranges characteristic of that specific category of people, according to the following citations.
Jones, J. M., 98th Meeting: Acoustical Society of America, Fall 1979; Jones, J. M., Differences in the Amplitude-Frequency Distribution of Vocal Energy Among Ph.D. managers, Engineers, and Enlisted Military Personnel, Masters Thesis UWF 1979; Voice Style, Perceptual Style and Process Psychology, book in Press 1982; and, Jones, J., Vocal Differences Between Members of Two Occupations: An Example of Potential Vocal/Mental Relationships That May Affect Voice Measurement of Pilot Mental Workload, AD-TR-80-57, July 1980.