1. Technical Field
The invention relates generally to automatic speech recognition. More specifically, the invention relates to techniques for improving automatic speech recognition by using the most robust and relevant aspects of the speech signal, including temporal information, and patterns derived from perceptual clusters and processing that information using novel machine learning techniques.
2. Description of the Related Art
Speech perception information is non-uniformly distributed in frequency, amplitude and time. In every aspect, speech is highly variable. Most automatic speech recognition systems extract information at uniformly spaced intervals at a single scale. In human speech perception, some speech classes are known to be distinguished by appeal to temporal characteristics, but in typical state-of-the-art speech recognition systems the temporal aspects of speech are not fully exploited.
Most state-of-the-art automatic speech recognition systems include a process which extracts information from the speech signal at uniform time steps (typically 10-15 milliseconds) using uniform short duration (typically 20-30 milliseconds) analysis frames. Classification of speech based on a single short term observation vector is not reliable because the speech signal is highly dynamic and constantly transitioning as the various speech sounds are made. Indeed, longer term patterns must be employed to create usable systems.
A method, known in the art, which makes longer term patterns available, is to retain a memory of a number of short term observation vectors which then are presented to a speech classifier simultaneously. The classifiers used with this approach are often artificial neural networks or correlation templates. While retaining a memory of short term observation vectors brings improved results, there are several remaining problems.
First, the uniform time step sampling, common to all frame based methods, is not synchronous with the speech signal. Therefore the relationship of speech events and observation frames is random. This results in increased variability of extracted features and a quantizing of temporal details.
Next, extraction based on uniform analysis frames is not optimal. The information used for human perception of speech sounds occurs at many different time scales. For example the plosive burst of a spoken “t” sound may be as little as a few milliseconds in duration whereas a vowel may be sustained for more than a second. A sequence of many short term observations does not present the same information as a long term observation does and vice versa.
Some aspects of speech are highly variable in the temporal dimension. For example the length that a vowel is sustained depends on the speaker, the rate of speech, whether the vowel is in a stressed syllable or not, and where in the sentence the word containing the syllable is found. This temporal variability causes speech information to move to different relative observation frames, significantly increasing the variability of the extracted values for different examples of the same speech class and making the detection of meaningful patterns in the memory difficult.
Additionally, frame based systems typically treat all frames as equally important. In contrast, human perception uses the portions of the signal which have the best signal to noise ratio and which contain the characteristics most relevant and reliable to make the required distinctions.
Most state-of-the-art automatic speech recognition systems incorporate Hidden Markov Models. Hidden Markov Models are stochastic state machines. Hidden Markov Models map class probabilities estimated from observation vectors into likely sequences of hidden (unobserved) class productions. Using Hidden Markov Models, the temporal variability problem mentioned above is addressed by allowing each non-emitting state to transition to itself. By using self-transitioning states the temporal variability is “absorbed.” Unfortunately, unless the approach is modified to explicitly extract durational information, the approach removes both unwanted and desirable temporal information. The temporal relationships of speech events carry significant information for perception of speech sounds particularly in the discrimination of plosives, affricatives, and fricatives. Furthermore, robust estimation of class probabilities requires large quantities of training data. When the conditions of use differ from the training conditions, the probability estimates become very inaccurate leading to poor recognition.
The features used by most state-of-the-art automatic speech recognition systems are primarily derived from short term spectral profiles. That approach is often taken because many speech sounds have somewhat characteristic frequency peaks called formants. A very different approach employed by other current systems is to focus on the long term trajectories of frequency bands. In a method called TRAPs (Temporal Patterns) speech sounds are modeled as the mean long term (˜1 sec.) trajectories of examples of the sounds. Classification is performed based on the correlation of the speech signal envelopes with each of the TRAP models. Some versions of this approach have results reported to be comparable to the short term spectral methods. These results show that information useful to the identity of speech sounds is spread over time beyond the bounds of phoneme segments. Because of the averaging and windowing used in the method, information near the center of the TRAP is emphasized over information further away. TRAP's capture gross trends but do not capture temporal details.
Yet another alternate approach to frame based feature extraction is to segment the speech at the location of certain detectable signal conditions called “events”. Each segmented portion is considered to have a single class identity. Usually temporal alignment with a model is performed by dynamic time warping, which allows the feature trajectories to be projected into a common time scale. Then, in the warped time scale the feature trajectory is re-sampled and correlated with a template or used as observations for a Hidden Markov Model. The process of dynamic time warping removes much of the time variability of the speech segments. However, finding reliable segmentation events presents a challenge for event based methods. Event insertions or deletions result in catastrophic misalignments.
Clearly there is a need in the art for improved techniques to increase the efficiency and effectiveness of automatic speech recognition.
Human perception of speech relies, in significant part, on the relative timing of events in the speech signal. The cues to speech perception occur over various time scales and may be offset in time from the perception itself. Changing the temporal relationships of speech events can change the perception of the speech. This is demonstrated in B. Repp, et al., Perceptual Integration of Acoustic Cues for Stop, Fricative, and Affricative Manner, Journal of Experimental Psychology: Human Perception and Performance 1978, Vol. 4, Num. 4, 621-637, by perceptual experiments where the durations of silence and frication were manipulated. One such experiment introduces a short interval of silence between the words “Say” and “Shop”, which causes listeners to hear “Say Chop.” Another example of how the relative timing of events influences perception is referred to as voice onset time, commonly abbreviated VOT. VOT is the length of time that passes from when a stop is released to when the vibration of the vocal cords begins. VOT is an important cue in distinguishing various stop consonants. The importance of timing also derives from the variability of the duration of speech phenomena. Some perceivable speech phenomena are very brief while others are quite long. For example, the TIMIT corpus of phonemically transcribed English speech has stop burst segments with durations of less than 5 milliseconds, while some vowel segments last more than 500 milliseconds.
Though relative event timings are important cues for perception, the most common methods of feature extraction are not sensitive to the timing of speech events. Almost all current speech and speaker recognition applications extract features by utilizing a signal segmentation approach based on fixed length analysis frames stepped forward in time by a fixed step size: Because these analysis frames are fixed in size, they are nearly always either significantly shorter or significantly longer than the lengths of the perceptual phenomena they are attempting to capture.
Though easy to implement, the common approach makes the extraction of features subject to the arbitrary relationship between the signal and the starting point of the first frame and to the arbitrary relationship between the size of the analysis frame and the time scale of various speech phenomena. A frame-based speech recognition system described in S. Basu, et al., Time shift invariant speech recognition, ICSLP98, is based on twenty-five millisecond frames stepped by ten milliseconds, shifts in the starting relationship of the signal and the first frame of less than ten milliseconds caused “significant modifications of the spectral estimates and [mel-frequency cepstral coefficients] produced by the front-end which in turn result in variations of up to [ten percent] of the word error rate on the same database.”
There are many sources of variability in speech signals: such as the speaker's vocal tract length, accent, speech rate, health, and emotional state, as well as background noise, etc. However, the variation reported by Basu et al. is entirely due to using a method of feature extraction in which the frame size and frame alignment have arbitrary relationships with the signal. In U.S. Pat. No. 5,956,671 (filed Jun. 4, 1997) to Ittycheriah et al. disclosed techniques aimed at reducing feature variability caused by the arbitrary relationship between analysis frames and the speech signal. One aspect of their invention expands the variability of the training set by subjecting multiple time-shifted versions of the signal to the fixed frame analysis process as separate training examples. They also disclose a technique used at recognition time where the feature values are computed by averaging the results of fixed frame analysis to multiple time-delayed versions of the signal.
These techniques do not fully mitigate the problems caused by extracting features using fixed frames and fixed time steps. Moreover, expanding the number of examples increases training time and incorporates additional variability into the model which is not present in the original speech signal. Time-shifted averaging increases computational complexity and may “average out” some perceptually relevant speech characteristics.
In U.S. Pat. No. 6,470,311 (filed Oct. 15, 1999) to Moncur, a method of pitch synchronous segmentation of voiced speech based on the positive zero crossings of the output of a band pass filter with a center frequency approximately equal to the pitch partially addresses synchronization. Unvoiced speech is segmented using the average pitch period computed over some unspecified time frame. It should be noted that low signal-to-noise conditions and signals with small DC signal offsets are known to cause problems for zero crossing based segmentation. For high quality speech signals, Moncur's approach represents an improvement over the common fixed frame analysis method during voiced speech. Unfortunately for unvoiced speech the approach reverts to arbitrary fixed frames and time steps. The use of fixed frames and time steps still leaves the accurate location of events such as closures and stop bursts unsolved. Furthermore, no solution at all is provided for whispered speech.
Clearly a solution is needed which extracts features synchronously with the events of the speech signal itself rather than by fixed uniform frames having arbitrary and changing relationships with speech phenomena. The segmentation technique should apply to the entire signal including both voiced and unvoiced speech. Additionally, speech analysis should be performed over time scales appropriate for each of the particular types of events being detected.
The typical automatic speech recognition engine of today waits for a detected silence to analyze and produce output because this allows for natural segmentation and therefore results in higher accuracy due to the increased context. Waiting until the end of an utterance may cause the output to be delayed anywhere from five to twenty-five seconds. When an application must produce output in near real time, as required in applications such as automatic production of closed captions for television broadcast, smaller segmentation would reduce the available context available for analysis, and lower accuracy is expected and produced. For these types of applications, what is needed is high accuracy with low latency.