Signal processing techniques for estimating characteristic parameters of a signal or for transforming a signal into a more desirable form are well known. Such techniques are advantageously utilized in such diverse fields as acoustics, data communications, radar and speech recognition. For example only, in a speech recognition system, a speech signal is processed to extract characteristic information encoded in frequency, amplitude and time. This information is then processed to extract various recognizable features in the speech signal that are used to aid in the recognition process. Since the performance of the overall speech recognition system is largely dependent on the accuracy of the original extraction process, highly efficient signal processing techniques are required.
Speech recognition systems have evolved a great deal over the past twenty (20) years. Vocabulary sizes have increased, connected speech capabilities have improved, and speaker-independent systems have appeared in the commercial market. Of these three embodiments, speaker-independent recognition capabilities have improved the least amount. Vocabulary sizes are typically restricted to less than twenty (20) words. There are no commercially available systems capable of handling connected, speaker-independent word recognition.
All of the problems in machine recognition of speech can be described as problems of normalization. This applies to the amplitude, spectrum, and temporal attributes of speech. The variability within a single speaker has proven to be sufficiently difficult for machines to contend with, let alone the variability among speakers.
Thus, automatic speech recognition (ASR) has consistently proven to be one of the more difficult tasks that digital computers have been asked to do. Vast resources have been dedicated to this problem over the past four decades, and yet, even today, there is no consensus among researchers as to the "right" way to do any of the major tasks necessary for the automatic recognition of speech.
One of the most difficult facets of speech recognition is speaker independence. The variations in vocal quality and dialect across speakers is often much greater than the distinctions across words.
The three major tasks involved in speech recognition are: (1) signal processing, (2) training, and (3) matching.
In psychological terms, these three tasks might be called sensing, learning, and perceiving. All recognition systems perform these three basic tasks. However, the specific functions of these three tasks may vary from one recognition system to another. For example, a recognizer that uses linear time normalization does its normalization during the signal processing task exclusively. A recognizer using dynamic time warping (DTW) does its normalization during both the training and matching tasks. Normalization is accomplished based upon a reference and the reference is what determines where normalization belongs in the recognition process. Linear time normalization uses a fixed number of frames, or histograms, as a reference whereas DTW uses a word template as a reference. Since the number of frames used with linear time normalization is predetermined, this function must be done prior to training and matching. Since DTW requires a learned reference, its time normalization cannot precede training and matching.
Philosophically, the issue being alluded to above is the issue of nature versus nurture. In other words, which of the functions necessary to speech recognition should be incorporated into the signal processing (sensory) phase of the process, and which should be learned?
In the development of the recognizer described herein, the position is taken that speech normalization is a sensory process rather than a perceptual one. This is congruent with current thought on language development in children. Recent research has shown that infants are born with the ability to discriminate among phonetic categories regardless of non-phonetic variables such as loudness and pitch.
In commonly assigned copending U.S. Ser. No. 372,230, filed June 26, 1989, a method and apparatus is described for generating a signal transformation that retains a substantial part of the informational content of the original signal required for speech processing applications. As described therein, the transformation is generated by converting all or part of the original signal into a sequence of data samples, selecting a reference position along a first subpart of the sequence and generating a frame or histogram for the referenced position according to the sliding average magnitude difference function (SAMDF). Thereafter, a referenced position along a second sub-part of the sequence is selected and an additional histogram is generated for this referenced position using the SAMDF function. The plurality of frames or histograms generated in this fashion comprise the transformation. This transformation is then used as the signal itself in signal processing applications.
The present invention describes a method and apparatus for normalizing the amPlitude and spectral variations in signals related to such fields as acoustics, data communications, radar and speech recognition. While the invention relates to each of those fields, it will be disclosed herein, in respect to the preferred embodiment, as a method and apParatus for normalizing the time, amplitude and spectral variations among speakers for speaker-independent word and connected speech recognition. The two weighted histograms, generated as disclosed in commonly assigned U.S. Ser. No. 372,230, and incorporated herein by reference, are utilized as the output signals from the first stage of the present invention.
One of the SAMDF histograms is used in the second stage as a broadband digitized input signal while the other is used as a differentiated and infinitely peak-clipped input signal. The resulting signals serve as low pass and high pass versions of the input signal to be processed. Also, in the second stage of the invention, three exponentially related "channels" are derived from the two SAMDF function signals. The signals are processed in thirty-two histograms per word or utterance with nineteen data buckets or measurements per histogram. The first channel samples the input signal four measurements at a time from the infinitely peak-clipped version of the input signal. The second and third channels are derived from the broadband version of the input signal. The second channel is derived by compressing the first eight measurements into four measurements by averaging adjacent measurements. The third channel is derived by compressing sixteen measurements of the broadband representation into four measurements by averaging four adjacent measurements at a time. Thus, the output of the three channels from the averaging networks each includes four consecutive measurements. This second stage of the invention results in three exponentially related channels of information about the signal, each consisting of four measurements. The three channels are said to be "logarithmically spaced" and emphasize spectral variations in the input signal.
The third stage derives a fourth channel that consists of average amplitude measurements from each of the three logarithmically spaced channels discussed in relation to the second stage, plus a global average taken across all three of the channels. Thus, the fourth channel consists of four measurements, or "buckets", representing the average signal amplitude of each of the three logarithmically spaced channels and the global amplitude. For the purposes of this application, a "channel" is defined as a set of related measurements that are operated on in a like manner.
The preferred embodiment of the present invention relates to the recognition of an isolated word or phrase meaning that each utterance recognized by the system must be preceded and followed by a short period of silence. The utterance is captured and framed by a process designed to determine beginning and ending points of the utterance. This process is usually called an end point algorithm and in the fourth stage of the present invention, the beginning and end point of the utterance is determined.
Post-end point processing and normalization occurs in the fifth stage of the present invention. In this stage, both time normalization and amplitude normalization of the data signal is accomplished.
In the sixth stage, matching, training and adaptation is performed. The present invention recognizes input speech by matching the input "test" utterance against the vocabulary templates. The template yielding the lowest score is returned by the system as being the recognized utterance, although, as is well-known in the art, the recognition choice may be "rejected" if the score is too high or if the second best match score is too close to the best score.
The isolated word recognizer described herein is a template based system, meaning that the system matches input (or "unknown") utterances against a set of prototypes, or templates, representing the utterances accepted by the system (the "vocabulary"). In the preferred embodiment, one composite template is formed for each utterance in the vocabulary. A composite template is formed by merging a number of examples of the specified utterance spoken by a number of speakers. This composite template may then be "adapted" by a process described hereafter. A novel aspect of the present invention is the maintenance of two forms of the composite template for each utterance: the "summed" template that enables the system to add training to any template and the "averaged" template, generated from the summed template, that is the form matched against the unknown utterance.
When operating in the "adapt" mode, the present invention adjusts the vocabulary templates to the speaker. Adaptation is controlled using feedback as to the recognition accuracy. A novel feature of the adaptation process is the subtraction of the test template from intruding vocabulary templates.
The word "template" is normally used to mean "model". A true template represents the spaces between events. So long as comparisons fall within the variances (whether a signal is present or not), such comparisons "fall through the template". This means that in speech recognition, phonemes can be missing without creating adverse results in the speech recognition system. However, information which falls outside of the variance template presents evidence against a match.
One important feature of the invention is the logarithmic spacing of the channels, each of which analyzes the same number of data buckets or measurements, but each of which, through compression, encompasses a wider range of data buckets than the previous channel. Another important feature of the invention is the signal amplitude normalization of all four of the channels. Still another important feature of the present invention is the time normalization of all four of the channel signals using not only beginning and end points but also a center reference point.
Thus, in the present invention, linear normalization forces peaks on the four channels. Speech recognition with the present template is also related to the spacing between the peaks. Thus, the information bearing portion of the signal is the area between peaks. Each data bucket or measurement is compared to the average template peaks and then variances are subtracted to give a point score (per time increment) and the total score for one word is the sum of the points for all time increments.