This invention relates to a method and apparatus for exploring the physical characteristics of voiced sounds, and more particularly to improvements in measuring the power distribution in the harmonics of voiced sound signals for spectrum analysis in real time.
There has been a growing interest in exploring the physical characteristics of voiced sounds for such purposes as machine synthesis of speech, machine recognition of speech for identification of an individual, and machine recognition of speech for operation of a typewriter that would thus take spoken dictation. The latter purpose requires speech analysis in real time, but all purposes would benefit by a method of analysis which permits speech recognition in real time.
Prior art techniques have not utilized the harmonic composition of speech as a recognition parameter. It is known that voiced sound may be described in terms of fundamental frequency, harmonic structure, phase and intensity. The pitch of the sound is due to the fundamental frequency, and the quality (timbre) is due to the harmonic structure.
In producing a voiced sound the vocal cords produce small puffs of air the repetition rate of which establishes the fundamental frequency. That rate depends primarily upon the mass, length and elasticity of folds in the vocal cords of the individual. Consequently, the pitch of a speaker is normally fixed in the range from about 80 Hz for men to about 350 Hz for women, although any increase of pressure in the air, as while speaking under tension, or with emphasis or intonation, will increase the fundamental frequency. The converse will of course, produce the opposite effect, i.e., extreme relaxation while speaking will decrease the pressure of the air to decrease the pitch.
Accompanying the fundamental frequency of voiced sound is a complex of simple harmonics which are modulated in intensity and phase by cavities controlled by the speaker. These cavities function as controlled resonators for the harmonics. Modulating the relative amplitude of the harmonic components will produce the different sounds of vowels and consonants. Significantly more power is contained in the sounds of vowels, so that voice recognition will depend largely on the sounds of vowels, although the sounds of consonants are not to be discounted altogether in the speech analysis.
Recognizing that the characteristics of voiced sounds are contained in the modulations of harmonics, the principal method of exploring the characteristics of voiced sounds is power spectrum analysis to determine the power and phase of the harmonic components. One could use a bank of filters, one filter for each harmonic, to isolate the harmonic components and measure the power of each, but since the fundamental frequency will vary significantly from one speaker to the next, and may vary from one moment to the next for an individual speaker, it is sometimes necessary to record the speech sounds and employ repetitive filtering techniques with different banks of filters to determine the harmonic composition with accuracy. Consequently, speech recognition in real time with a high degree of accuracy is not possible with prior art filtering techniques.
An additional parameter useful in speech recognition, is the phase of harmonic components. Such a parameter has not heretofore been used, particularly in real time analysis. It would be desireable to track the harmonics of a voiced sound signal in order to continually measure not only the power but the phase of the harmonics. Such phase data may aid in making more positive voice identification.