1. Field of the Invention
The present invention relates to a method and apparatus for extracting information from speech. The invention has particular, although not exclusive relevance to the extraction of articulatory feature information from a speaker as he/she speaks.
2. Description of the Prior Art
There are several known techniques for diagnosing speech disorders in individual speakers, most of which rely on a comparison of various articulatory features, i.e. the positions of the lips, tongue, mouth, etc, of a "normal" speaker with those of the individual being diagnosed. One technique relies on a clinician extracting from the individual's speech the phonetic content, i.e. the string of phones that make up the speech. Each phone is produced by a unique combination of simultaneously occurring distinct articulatory features, and therefore the articulatory features can be determined and compared with those of a "normal" speaker. However, there are several disadvantages of this technique.
The first disadvantage with this technique is that it is not practical to have a phone for every possible combination of articulatory feature values. Consequently, only the most frequent combination of articulatory feature values are represented by the set of phones, and so many possible articulations are not represented.
A second disadvantage of a phonetic technique is that the speech is considered as being a continuous stream of phones. However, such a concept of speech is not accurate since it assumes that all the articulatory features change together at the phone boundaries. This is not true since the articulatory features change asynchronously in continuous speech, which results in the acoustic realisation of a phone being dependent upon its neighbouring phones. This phenomenon is called co-articulation. For example for the phrase "did you" the individual phonemes making up this phrase are:
"/d ih d y uw/" PA1 "/d ih j h uw" PA1 i) there is a danger of radiation exposure, therefore, the size of the speech sample must be restricted; PA1 ii) the acquisition of data must be under supervision of a skilled radiologist which results in high cost; PA1 iii) the body must be stabilised which might result in an unnatural body posture which may affect the articulation; and PA1 iv) variations in the x-ray data obtained from individual to individual results in reduced reliability of the data measurements.
However, the phonetic realisation of the phrase given above during continuous speech, would be:
The final d in "did" is modified and the word "you" becomes converted to a word that sounds like "juh".
A third disadvantage with this technique is that a clinician has to make a phonetic transcription of the individual's speech which is (i) time consuming; (ii) costly, due to the requirement of a skilled clinician; and (iii) unreliable due to possible human error.
Another type of technique uses instruments to determine the positions of the articulatory structures during continuous speech. For example, cinefluorography which involves the photographing of x-ray images of the speaker is one such technique. In order to analyse movement of the articulatory structures, sequences of individual cinefluorographic frames are traced, and measurements are made from the tracings using radiopaque beads, skeletal structures, and/or articulators covered with radiopaque substances.
However, there are a number of disadvantages associated with the use of cinefluorographic techniques--
Ultrasonic imaging is another instrumental technique that allows observation of the dynamic activity of the articulatory structures, but does not interfere with the articulatory structures activity, nor does it expose the subject to radiation. Ultrasonic imaging uses the reflection of ultrasonic waves from the interface between two media. Since the time between the initiation of the ultrasonic pulses and the return is proportional to the distance from the transmitter to the boundary, information relating to the reflected waves may be used to produce a time-amplitude display indicative of the structure reflecting the waves. This technique, however, suffers from the problem that the observer is not exactly sure of the point on the structure that he is measuring the return from, and also the transmitter and receiver must be at 90.degree. to the interface. Therefore, when trying to characterise speech disorders by structural anomalies, it may be particularly difficult to identify the point on the structure being monitored.
A technique for extracting articulatory information from a speech signal has been proposed in "A linguistic feature representation of the speech waveform" by Ellen Eide, J Robin Rohlicek, Herbert Gish and Sanjoy Mitter; International Conference on Acoustics, Speech and Signal Processing, April 1993, Minneapolis, USA, Vol. 2, pages 483-486. In this technique, a whole speech utterance, for example a sentence, is input into the speech analysis apparatus, the utterance then being segmented. This segmentation process uses a computationally intensive dynamic programming method that determines the most likely broad phonetic sequence within the utterance. Consequently, whilst this system allows analysis of the input speech to produce some indication of the positions of some of the articulators, delays are produced due to the necessity of inputting whole speech utterances before any analysis takes place.
U.S. Pat. No. 4,980,917 discloses an apparatus and method for determining the instantaneous values of a set of articulatory parameters. It achieves this by monitoring the incoming speech and selecting a frame of speech for further processing when the monitoring identifies a significant change in the energy of the input speech signal. The further processing includes a spectral analysis and a linear mapping function which maps the spectral coefficients from the spectral analysis into articulatory parameters. However, the system described in U.S. Pat. No. 4,980,917 does not process all the input speech, and those frames of input speech that are processed are treated as separate entities. In other words, the system does not use context information, i.e. it does not consider neighbouring frames, when it determines the articulatory parameter values.