The process of voiced human speech begins when air from the lungs passes through the vocal cords (glottis) causing them to vibrate. This vibration interrupts the air flow at a quasi-periodic interval perceived as pitch. The vocal cord wave is spectrally wide mainly due to the abrupt closure every cycle which produces a discontinuity in the volume velocity waveform. This wave (if it were possible to hear in isolation) sounds like a raspy buzz, rich in harmonics, like a trumpet mouthpiece blown without the trumpet. The vocal tract from glottis to lips and soft palate to nose (the nasal branch) is a complex resonant cavity which acts as a multiple pole and zero linear filter (1). Different vocal cord harmonics are selectively amplified or attenuated depending on the position of the articulators (tongue, lips, jaw, etc.). This results in an array of amplitude peaks called “formants” seen in the frequency domain. They are usually labeled F1, F2, F3, F4 . . . in order of increasing frequency. In continuous speech, the migration of these resonances to different frequencies encodes the linguistic information on the vocal cord wave. It is necessary that these formants be both narrow enough to be well defined, and loud enough to be perceived, i.e. the formant resonances must be of high Q. Table I. shows the location of the first three formant peaks for the average adult male along with the approximate location of the main point of constriction along the tract which is responsible for the vowel.
TABLE I(2)VowelF1F2F3main constrictionuniform tube 17 cm50015002500 Hznoneee27022903010fronteh53018402480frontah73010902440midae66017202410midoo300 8702240backer49013501690back
The term “Q”, which is a dimensionless figure, stands for quality factor, is equal to frequency/bandwidth (Q=f/bw), which means narrow bandwidths are equivalent to high Q. In speech, higher Q is also associated with greater formant amplitudes which contributes to their perception. The formant bandwidths of the human vocal tract are astonishingly narrow (3) corresponding to Q values in excess of 40.
In order to construct a device which will produce continuous speech, a material had to be found that produced high Q resonances when used as the wall of the simulated vocal tract. In addition to this quality, it must be easily and quickly deformable to transition smoothly through the needed cross sectional area profiles.
There have been numerous prior attempts both to simulate the natural human voice and to provide assisted speech for people with damaged vocal chords.
An early attempt at a speech simulator was the Von Kempelen Speaking Machine referenced in “Speech Analysis, Synthesis and Perception”, published 1983 by Springer-Verlag, pages 205/6; FIG. 10 which taught provision of a reed vibrated by air from a bellows and having a manually manipulated resonating tube made from leather. As disclosed on page 207 of the above reference, Reisz also teaches soft rubber for parts of an artificial mouth and pharynx of a speech simulator.
More recent prior art “Speech Production by a Mechanical Model: Construction of a Vocal Tract and its Control by Neural Network’ by Higashimoto et al, Faculty of Engineering, Kanagawa University teaches, on page 2, construction of a vocal tract and chord from silicone rubber molded with the softness of human skin.
However, when various non-porous, flexible materials were tried by us in multiple attempts to duplicate simple resonances (uniform tube and cross section for vowel /a/), it was found that the bandwidths of the resonances that were observable on a spectrograph were too wide to be perceived aurally. Also, some formants were misplaced in frequency and others were non-existant. Vowel recognition for the static /a/ model was nil. Rubber, and rubber like, materials ranging from soft latex to semi-rigid tire-like rubber were tried with unacceptable results. The test equipment could produce perfectly acceptable formants when a rigid material was used such as cast stone or hard plastic pipe. The question therefore arose as to how humans with softly compressible, flexible tongues and cheeks are able to produce such wonderfully defined formants with high Q.
A survey using a surface acoustic resonator was then begun by us to find a material that was both soft and flexible, yet possessed acoustic properties similar to those of glass, stone, metal, or hard plastic. All these materials have very high acoustic reflectivities (4). Plastics and rubbers, even when only minimally deformable, were found to be too soft (acoustically absorbant) to make a flexible high-Q resonator directly.
The acoustic reflectivity of liquid water is almost total and it was realized and then verified experimentally by us that the liquid water contained within human and animal flesh made narrow formant bandwidths possible.