1. Field of the Invention
This invention relates to a speech processing system, and more particularly to such a system which makes use of the resonant modes of the human vocal tract associated with speech sounds, these being known as the formant frequencies.
2. Discussion of Prior Art
Formant frequencies usually appear as peaks in the short-term spectrum of speech signals. For many years it has been recognised that they are closely related to the phonetic significance of the associated speech sounds. This relationship means that there are many applications in automatic processing of speech signals for which an effective method of formant frequency measurement would be useful, such as:
(a) Formant vocoders, ie devices for coding low-bit-rate speech transmissions; PA1 (b) Visual display of formant frequency variation with time, to aid the deaf to interpret speech, or to assist in their speech training; PA1 (c) Automatic authentication of identity from an individual's speech; and PA1 (d) Speech signal analysis for input to an automatic speech recognition system. PA1 a) spectral processing means for producing spectral cross-sections of input speech signals; PA1 b) storing means for storing comparison spectral cross-sections and respective formant frequencies associated therewith; and PA1 c) comparing means for matching comparison spectral cross-sections with input speech signal spectral cross-sections and for providing formant frequencies derived from those associated with a comparison spectral cross-section in response to a match between that cross-section and an input speech signal spectral cross-section. PA1 (a) a warping function with a slope in the range 0.5 to 2, PA1 (b) a maximum frequency shift not exceeding .+-.375 Hz, and PA1 (c) maximum frequency shift values in respect of a first formant frequency range comprising .+-.125 Hz below 500 Hz and .+-.250 Hz above 500 Hz but not above 1000 Hz. PA1 a) producing spectral cross-sections of input speech signals; PA1 b) storing comparison spectral cross-sections and respective formant frequencies associated therewith; and PA1 c) matching comparison spectral cross-sections with input speech signal spectral cross-sections and providing formant frequencies derived from those associated with a comparison spectral cross-section in response to a match between that cross-section and an input speech signal spectral cross-section.
The requirements of these applications could be met by determining the formant frequencies from a succession of spectral cross-sections at regular time intervals. In addition, it is also useful to determine the associated formant amplitudes because the phonetic quality of speech sounds depends on both. For some sounds (vowels in particular) the relative formant amplitudes are determined largely by the pattern of formant frequencies. However, the relative amplitudes for most consonants will be very different from those typical of vowels, and even for vowels they will vary with vocal effort and from speaker to speaker.
Unfortunately, in spite of the usefulness of formant information, automatic formant-frequency measurement is notoriously difficult. The primary cause of this difficulty arises because speech processing involves analysis of sounds of short duration to produce short-term spectral cross-section, but the spectral peaks which define the formants are not necessarily clearly apparent in such a cross-section. The acoustic theory of speech production shows that under ideal conditions the human vocal tract has a series of resonant modes at an average frequency spacing of about 1 kHz, the actual frequencies of the resonances being determined by the precise positions of the jaw, tongue, lips and other articulators at any particular time. The fact that the formants are inherently associated with acoustic resonances of the human vocal system means that their frequencies will normally change smoothly with time as the articulatory organs move to produce different speech sounds.
The influence of the formant frequencies in determining the phonetic properties of speech almost entirely relates to only the lowest three of these resonances (usually referred to as F1, F2 and F3), and resonances above the third are of little importance. In fact resonances above F4 are often not detectable in speech signals because of bandwidth limitation. In the case of telephone bandwidth signals even F4 is often not present in the available signal.
There are many reasons why the elegant theory about speech production often does not yield a clear picture of the theoretical formants during real speech sounds. First, the theory treats the response of the vocal tract, and takes no account of the spectral properties of the sound sources which excite the tract. The main sound sources are air flow between the vibrating vocal folds, and turbulent noise caused by flow through a constriction in the vocal tract. Most of the time these sources have a spectral structure that is not likely to obscure the resonant pattern of the vocal tract response. The spectral trends of these sources as a function of frequency are either fairly flat (in the case of turbulent noise) or have a general decrease in intensity as frequency increases (in the case of flow between the vocal folds). However, in the latter case, particularly for some speakers, there will be occasions where the generally smooth spectral trend will be disturbed at some frequencies, sometimes with minor spectral peaks, but more usually with pronounced dips in the spectrum. If such a dip coincides with a vocal tract resonance, the expected spectral peak of that formant may be almost completely obscured.
The second reason for the difficulty of identifying formant peaks, particularly during some consonant sounds, is that there can be a severe constriction of the vocal tract at some intermediate point so that it is acoustically almost completely separated into two substantially independent sections. For these types of speech sound, the sound source is normally caused by air turbulence generated at the constriction. The sound radiated from the mouth in these circumstances is then influenced mainly by the resonant structure of the tract forward from the constriction, and the formants associated with the back cavity (notably F1) are so weakly excited that they are often not apparent at all in the radiated speech spectrum. In these cases F1 has no perceptual significance, but it is advantageous to associate other resonances with appropriate higher formantnumbers from continuity considerations. The behaviour of formant frequencies as a function of time is described in terms of formant trajectories; each formant trajectory is a series of successive values of a respective individual formant frequency such as F1 as a function of time. There is therefore a set of three formant trajectories for the formant frequencies F1, F2 F3. Continuity considerations imply continuity of formant trajectories across vowel/consonant boundaries.
Turbulence-excited consonant sounds have a further difficulty for formant analysis because during these sounds the glottis (the space between the vocal folds, in the larynx) is open wide, so causing more damping of the formant resonances because of coupling into the sub-glottal system (the bronchi and lungs).
The third difficulty of formant analysis applies specifically to high-pitched speakers for which the frequency of vibration of the vocal folds may be fairly high, perhaps 400 Hz or even higher. This high frequency yields harmonics for which the spacing may be larger than the spectral bandwidth of the formant resonances. Thus a formant peak may lie between two harmonics and therefore not be obvious, and spectral peaks caused by harmonics may be mistaken for formants.
The fourth difficulty of formant analysis applies to nasalized sounds. The basic speech production theory does not apply to these sounds, because it is based on the response of an unbranched acoustic tube. In the presence of nasalization (either nasal consonants or nasalized vowels) the soft palate is lowered, and the nasal cavities become coupled with the vocal tract. The acoustic system then has a side branch, which introduces a complicated set of additional resonances and antiresonances into the response of the system. In these cases the simple description of a speech signal in terms of the three most important formants no longer strictly applies. However, some of the resonances of the vocal tract with nasal coupling are more prominent than others, and it is often possible to trace temporal continuity of these resonances into adjacent periods when nasalization is absent. It can therefore still be useful to describe nasal sounds also in terms of F1, F2 and F3. Although the three-formant concept is still useful for nasal sounds, the more complicated acoustic system usually causes the resonances to be less prominent than in non-nasal sounds. It is thus often extremely difficult to decide, when looking at a spectral cross-section, what the formant frequencies should be.
Determination of the formant frequencies of speech sounds, particularly as features to use in automatic speech recognition, has been described by M. J. Hunt in "Delayed decisions in speech recognition--the case of formants", Pattern Recognition Letters 6, 1987, pp. 121-137. Here initial speech signal processing was by means of linear prediction analysis (LPA). A description of linear prediction techniques applied to speech signals is given, for example, in J. D. Markel and A. H. Gray, "Linear Prediction of Speech", Berlin, Springer, 1976.
Linear prediction is a technique which can model the human vocal tract as a linear filter with a small number of poles but no zeros in its transfer function. The poles can occur in complex conjugate pairs or they can be real. For the conjugate pairs, each such pair represents a resonator. If certain very idealised assumptions about the vocal tract and its excitation source are correct, it can be shown that these resonant poles correspond accurately to the formants of the vocal tract.
For those occasions when the formants of speech are well defined and well separated in frequency, LPA gives a reasonable description of the formant frequencies, at least for the lowest three formants. However, as discussed previously, some or all of the formants are often not well defined in the short-term spectrum of the signal. In these cases LPA will normally give at least one resonance corresponding to each clear peak in the spectrum, but some other poles (either heavily damped resonances or single real poles) will be set to improve the modelling of the general spectral shape. Pairs of formants that are fairly close in frequency will sometimes be modelled correctly by two resonators, but will often be modelled by only one, with the extra poles so released being used more effectively to model some other aspects of spectral shape. For sounds in which some resonances are of much lower intensity than others, LPA will rarely assign poles to the weaker resonances.
Where the speech power is extremely low at a formant frequency (e.g. F1 in a typical [s] sound, LPA analysis would not assign a formant to the true lowest vocal tract resonance, so correct labelling would not be possible. Similar considerations apply to formant frequencies derived from a spectral cross-section obtained by other means, such as from a smoothed Fourier transform.
The problem of obtaining useful formant data is illustrated in the case of a vocoder, which is a typical application of the use of formant frequencies. It is a system for coding speech signals for low-bit-rate transmission or storage; it depends on separating the general shape of the short-term spectrum of input sound from fine spectral detail, which is determined by the type of sound source exciting the speaker's vocal system at any given time. A number of different types of vocoder are known in the prior art, and they describe the short term spectral shape in different ways, see for example J. L. Flanagan, "Speech Analysis Synthesis and Perception", Springer-Verlag, 1972. Vocal tract resonances mostly change smoothly with time, and they are dominant in determining the phonetic properties of speech signals. Thus the transmission parameters of formant vocoders offer the potential for good speech intelligibility at lower bit rates than other vocoders such as channel vocoders and linear prediction vocoders. However, it has proved difficult to develop an acceptable formant vocoder in the absence of reliable production of formant data.