The present invention relates to speech analyzing, synthesizing and coding.
The analyzing, synthesizing and coding processes of human speech encounter major difficulties resulting from the high complexity of the frequency spectrum of the produced sounds, spectrum closeness of resembling phonemes, the number of different phonemes used in a same language and a fortiori in different languages and dialects, and mainly the plurality of ways the sounds are actually formed as a function of the preceding or following sounds (co-utterance phenomena). It is therefore extremely difficult either to (i) identify a train of phonemes generated at a high rate for reconstituting the words that were spoken or (ii) to synthesize trains of sounds and words that will be effectively identified together with their meaning by those who hear them.
A well-known process for speech synthesizing consists in using a device simulating the behaviour of an acoustic tube having a variable cross sectional area representing the vocal tract through which human speech is produced. The vocal tract starting with the vocal cords (that act as an excitation source at the upstream extremity of the tube) extends from the larynx to the lips, through the pharynx, and the buccal cavity. The vocal tract forms a conduit having a variable cross sectional area over the length of the conduit. Cross sectional area of the vocal tract varies over a large range, and is approximately 2 cm.sup.2 in the larynx, from 3 to 7 cm.sup.2 in the from 0 to 15 cm.sup.2 in the buccal cavity, 0 cm.sup.2 at the lips if they are closed, etc.
This vocal tract can be represented as an acoustic tube constituted by a series of individual portions having a constant length, the cross sectional area of which has a determined value at rest. The works of G. FANT, Acoustic Theory of Speech Production, 1960, Mouton and Co, Gravenhage, Netherlands, and J.L. FLANAGAN, Speech Analysis Synthesis and Perception, 1972, Springer-Verlag, New York, refer to this type of representation wherein the vocal tract is divided into successive portions of about one centimeter in length, the cross sectional areas of which can be classified. The sound production can be expressed as a function of the cross sectional areas of the individual sections. It is It is possible to produce sounds recognizable as human speech phonemes by using a train of acoustic tube portions provided with an air flow source at the input, this source exhibiting characteristics similar to those of human vocal cords, and by causing the cross sectional areas of the various portions to vary.
With the advent of modern computer signal processing techniques, it is not necessary to construct a physical acoustic tube with mechanically cross variable sectional areas. Instead air source and vocal tract simulation using either analog electric circuits or a digital computer wherein one is able to vary parameters representing especially the tube cross sectional areas, the overall tube length, and the air flow spectrum from the source.
At the output, the computer supplies a loudspeaker (for speech synthesis) with an electric signal, the spectrum and spectrum variations of which reproduce as faithfully as possible the spectrum and spectrum variations of the sound or sound train it is desired to generate. For speech analysis, a microphone receives the acoustic message and converts it into electric signals, received and processed by the computer, for example after analog/digital conversions. The analysis result can be used directly in a speech recognition mode or can be coded and transmitted for speech reconstitution. Coding can be a scalar or vectorial type.
Although the principle of the vocal tract simulation by means of a series of acoustic tube portions, each having a variable cross sectional area, is known, it has never been implemented in a satisfactory way to permit analysis or synthesis of continuous speech. Most often, attempts are made for example for vowels or consonant/vowel sets ; but it has thus far not been possible to synthesize or identify trains of sounds such as produced by human speech.
This is because the automatic control from text is difficult and not well known. The voice tract acoustic tube has to take a high number of parameters into account : there are many tube portions, the cross sectional areas of each portion can present important variations (when articulating "a" or "o" it is clearly seen that the air flow volume between the lips varies) and, if one calls "surface function" the curve of the cross sectional area values of the tube portions along the successive portions, there is no direct relationship between the surface functions of the acoustic tube and the sounds produced.
On the other hand, the sound spectra generated by human speech are characterized by "formants" (which are successive maxima present in the spectrum : first formant for the lowest resonance frequency, second formant, third formant, etc.). Those formants represent the resonances of the vocal tract, i.e., resonances which modulate the spectrum of the sound source (vocal cords) resulting in a modulated spectrum at the vocal tract output. Vowels for example are characterized by constant values of the formant frequencies (that is, the frequency values of the spectrum having a maximum amplitude). Consonants are by relative variations in the formant frequencies.
However, the combination of a train of syllables is difficult to express as a function of formant frequency variations because, for one element of the considered train, the formant frequencies depend upon the preceding and following sounds (co-uttering phenomenon).
It has been possible to realize speech synthesizers so-called "formant synthesizers": they use (or simulate) resonant circuits, the resonant frequency of which can be individually controlled. By combining several resonance frequencies corresponding to the formant frepuencies of a particular vowel, this vowel can be synthesized. By causing the circuit resonance frequencies to vary in the same way as the formant frequencies of a consonant, this consonant can be artificially reproduced.
Generally, the knowledge of the first three formants or their variations as a function of time provides a good approximation for analyzing or synthesizing sounds. However it could be sufficient to use two formants for a simplified analysis or synthesis, or on the contrary include up to four formants, and even more, for a more sophisticated analysis or synthesis.
In the formant synthesis mode, one analyzes or reconstitutes signal spectra ,exhibiting amplitude maxima for determined frequencies. However, it is not known how to accurately analyze or reconstitute the whole spectrum and the spectrum variations which exactly determine the constitution of a given sound. The problem is even more complicated if, due to the co-uttering phenomenon between successive vowels and consonants, the spectra, and spectrum variations of the signal are intermixed.