The invention relates generally to speech recognition and more particularly to the use of nonacoustic information in combination with acoustic information for speech recognition and related speech technologies.
Speech Recognition
The development history of speech recognition (SR) technology has spanned four decades of intensive research. In the '50s, SR research was focused on isolated digits, monosyllabic words, speaker dependence, and phonetic-based attributes. Feature descriptions included a set of attributes like formants, pitch, voiced/unvoiced, energy, nasality, and frication, associated with each distinct phoneme. The numerical attributes of a set of such phonetic descriptions is called a feature vector. In the '60s, researchers addressed the problem that time intervals spanned by units like phonemes, syllables, or words are not maintained at fixed proportions of utterance duration, from one speaker to another or from one speaking rate to another. No adequate solution was found for aligning the sounds in time in such a way that statistical analysis could be used. Variability in phonetic articulation due to changes in speaker vocal organ positioning was found to be a key problem in speech recognition. Variability was in part due to sounds running together (often causing incomplete articulation), or half-way organ positioning between two sounds (often called coarticulation). Variability due to speaker differences were also very difficult to deal with. By the early '70s, the phonetic based approach was virtually abandoned because of the limited ability to solve the above problems. A much more efficient way to extract and store acoustic feature vectors, and relate acoustic patterns to underlying phonemic units and words, was needed.
In the 1970s, workers in the field showed that short "frames" (e.g., 10 ms intervals) of the time waveform could be well approximated by an all poles (but no zeros) analytic representation, using numerical "linear predictive coding" (LPC) coefficients found by solving covariance equations. Specific procedures are described in B. S. Atal and S. L. Hanauer, "Speech analysis and synthesis by linear prediction of the speech wave," J. Acoust. Soc. Am. 50(2), 637 (1971) and L. Rabiner, U.S. Pat. No. 4,092,493. Better coefficients for achieving accurate speech recognition were shown to be the Cepstral coefficients, e.g., S. Furui, "Cepstral analysis technique for automatic speaker verification," IEEE Trans. on Acoust. Speech and Signal Processing, ASSP-29 (2), 254, (1981). They are Fourier coefficients of the expansion of the logarithm of the absolute value of the corresponding short time interval power spectrum. Cepstral coefficients effectively separate excitation effects of the vocal cords from resonant transfer functions of the vocal tract. They also capture the characteristic that human hearing responds to the logarithm of changes in the acoustic power, and not to linear changes. Cepstral coefficients are related directly to LPC coefficients. They provide a mathematically accurate method of approximation requiring only a small number of values. For example, 12 to 24 numbers are used as the component values of the feature vector for the measured speech time interval or "frame" of speech.
The extraction of acoustic feature vectors based on the LPC approach has been successful, but it has serious limitations. Its success relies on being able to simply find the best match of the unknown waveform feature vector to one stored in a library (also called a codebook) for a known sound or word. This process circumvented the need for a specific detailed description of phonetic attributes. The LPC-described waveform could represent a speech phoneme, where a phoneme is an elementary word-sound unit. There are 40 to 50 phonemes in American English, depending upon whose definition is used. However, the LPC information does not allow unambiguous determination of physiological conditions for vocal tract model constraints. For example it does not allow accurate, unambiguous vocal fold on/off period measurements or pitch. Alternatively, the LPC representation could represent longer time intervals such as the entire period over which a word was articulated. Vector "quantization" (VQ) techniques assisted in handling large variations in articulation of the same sound from a potentially large speaker population. This helped provide speaker independent recognition capability, but the speaker normalization problem was not completely solved, and remains an issue today. Automatic methods were developed to time align the same sound units when spoken at a different rate by the same or different speaker. One successful techniques was the Dynamic Time Warping algorithm which did a nonlinear time scaling of the feature coefficients. This provided a partial solution to the problem identified in the '60s as the nonuniform rate of speech.
For medium size vocabularies (e.g., about 500 words), it is acceptable to use the feature vectors for the several speech units in a single word as basic matching units. During the late 1970s, many commercial products became available on the market, permitting limited vocabulary recognition. However, word matching also required the knowledge of the beginning and the end of the word. Thus sophisticated end-point (and onset) detection algorithms were developed. In addition, purposeful insertion of pauses by the user between words simplified the problem for many applications. This approach is known as discrete speech. However, for a larger vocabulary (e.g., &gt;1000 words), the matching library becomes large and unwieldy. In addition, discrete speech is unnatural for human communications, but continuous speech makes end-point detection difficult. Overcoming the difficulties of continuous speech with a large size vocabulary was a primary focus of speech recognition (SR) research in the '80s. To accomplish this, designers of SR systems found that the use of shorter sound units such as phonemes or PLUs (phone-like units) was preferable, because of the smaller number of units needed to describe human speech.
In the '80s, a statistical pattern matching technique known as the Hidden Markov Model (HMM) was applied successfully in solving the problems associated with continuous speech and large vocabulary size. HMMs were constructed to first recognize the 50 phonemes, and to then recognize the words and word phrases based upon the pattern of phonemes. For each phoneme, a probability model is built during a learning phase, indicating the likelihood that a particular acoustic feature vector represents each particular phoneme. The acoustic system measures the qualities of each speaker during each time frame (e.g. 10 ms), software corrects for speaker rates, and forms Cepstral coefficients. In specific systems, other values such as total acoustic energy, differential Cepstral coefficients, pitch, and zero crossings are measured and added as components with the Cepstral coefficients, to make a longer feature vector. By example, assume 10 Cepstral coefficients are extracted from a continuous speech utterance every 10 ms. Since phonemes last about 100 ms on average, the HMM phonemic model would contain 10 states (i.e., ten 10 ms segments) with 10 symbols (i.e., Cepstral values) per state. The value of each symbol changes from state to state for each phoneme because the acoustic signal in each 10 ms time frame is characterized by a different set of acoustic features captured by the Cepstral coefficients. The HMM approach is to compute the statistics of frequencies of occurrence of the symbols in one state related to those in the next state from a large training set of speakers saying the same phonemes in the same and differing word series. For example, a set of state transitional probabilities and the accompanying array of 10 symbols by 10 state array values that best describes each phoneme are obtained. To recognize an unknown phoneme, the user computes the 10 by 10 array and matches it to the pre-computed probabilistic phonemic model using the maximum likelihood detection approach. The HMM statistical approach makes use of the fact that the probability of observing a given set of 10 states in a time sequence is high for only one set of phonemes.
The best laboratory performance of a highly trained, single user HMM based recognizer today is about 99% correct recognition of words. In a normal work place with ambient office noise, with average training, on large vocabulary natural speech, the accuracy drops well below 90%. For almost all applications, this is not adequate; for high value applications, a &gt;10% error rate is intolerable. A typical error performance specification of a reliable human communication system is usually in the range from 1 error in 1000 to as low as 1 error in 10,000, depending upon how much error correction communication between speaker and listener is used or allowed.
Thus, to reach this goal, factors of 100 to 1000 improvement in speech recognition accuracy are required. HMM based recognizers, or variants thereon, have been in intense development for more than 15 years, and are unlikely to deliver such a major breakthrough in accuracy. One major reason for this is that the acoustic signal contains insufficient information to accurately represent all of the sound units used in a given human language. In particular, variability of these speech units through incomplete articulation or through coarticulation makes for great difficulty in dealing with day to day variations in a given speaker's speech. Yet, even greater problems occur with different speakers and with the inability to do complete speaker normalization, and finally with the problems of human speakers who like to use large vocabularies with rapid, run together speech. Even as computer processors and memories drop in price and size, the complexity of processing to supply all of the missing acoustic information, to correct mistakes in articulation, and to deal with noise and speaker variability will be difficult or impossible to handle. They will not be able to supply real time recognition meeting the demands of the market place for accuracy, cost, and speed.
Present Example of Speech Recognition
J. L. Flanagan, "Technologies of Multimedia Communications", Proc. of IEEE 82, 590, April 1994 on p. 592 states: "The research frontier in speech recognition is large vocabularies and language models that are more representative of natural language . . . Systems for vocabularies greater than 1000 words are being demonstrated. But word error rate is typically around 5% or more, and hence sentence error rate is substantially higher."
A current speech signal processing model with the characteristics described by Flanagan uses a microphone to detect acoustic speech information. The acoustic signals are processed by LPC algorithms to produce a feature vector which is compared to a stored feature vector library and further processed for word and sentence assembly. The details of estimating the feature vector are that it uses an open loop, 10th order, short time stationary model for the vocal tract. The excitation signal X(t) is assumed to be random broadband white noise. A fast Linear Predictive Coding (LPC) algorithm is used to compute the model coefficients. A direct mapping of LPC coefficients to the Cepstral coefficients provides a robust and compact representation of the short time power spectrum which is the basis for statistical matching. FIG. 1 shows the essential processes of a modern prior art speech recognition system.
The open loop speech processing model has many drawbacks. First, the unknown excitation signal is not really spectrally white, but it is a pattern of air bursts (for vocalized speech) that take place at a rate of 70 to 200 times per second. Second, the complexity of the vocal tract model changes as a function of voice patterns with the lips opening and closing, the nasal tract opening, the tongue touching the palate, and several other important organ configurations. Third, there is an inherent limitation in estimating both the tract model coefficients and the excitation source with an all pole LPC model from one acoustic signal. The reason is that zeros in the excitation function (i.e., zero air flow) and anti-resonances in the tract model (i.e., zeros in the transfer function) cannot be mathematically modeled with LPC, and their presence can not be measured unambiguously using a microphone. As a result, the presently estimated Cepstral (i.e., LPC derived) coefficients representing the transfer function which characterize the vocal system of a speaker are inaccurate and not uniquely correlated with only one specific articulator configuration. Such errors in the feature vector coefficients directly limit the statistical pattern matching performance. Thus searching for a better matching algorithm or using more computer processing power to enhance performance may be futile. In addition, artifacts associated with ambient noise, speaker articulation variations from day to day, and speaker to speaker variability add difficulty and also training expense. Finally, developing large vocabulary systems for multiple, natural speakers, in many languages, is very expensive, because automated techniques for this process can not be well defined. It has been estimated (Rabiner and Juang, "Fundamentals of Speech Recognition", p. 493, Prentice Hall, 1993) that using the best models, it will take 10 CRAY YMP-16 equivalents to do the highest desired quality speech recognition.
It has been long recognized by linguists and phoneticians that human speech organ motions and positions are associated with the speech sound. Olive et al. "Acoustics of American English Speech", Springer 1993, describe the vocal system for almost all singles, pairs, and triplets of phonemes in American English, and their associated sonograms. Many decades ago, workers at Bell Laboratories (see J. L. Flanagan "Speech Analysis, Synthesis, and Perception" Academic Press, 1965) used x-ray images of the vocal organs and detailed modeling to determine organ shapes for given sounds. These workers and others described how optical devices were used to measure the glottal area (i.e., vocal fold positions) vs. time for voiced speech, and published detailed models of the speech system based upon well understood acoustic principles.
All of these physical measurement techniques suffer from not being usable in real time, and the detailed models that connect the organ information into phoneme identification don't work because the primary organ measurements are not available in real time. Therefore the models can not be accurately or easily fit to the speaker's macroscopic characteristics such as vocal tract size, compliance, and speed of speech organs. In addition, very idiosyncratic physiological details of the vocal tract, such as sinus cavity structure, cross sectional pharynx areas, and similar details, are not possible to fit into present model structures. However, they are needed to quantify more exactly individual speech sounds. Nevertheless, the above studies all show that associated with any given speech phonetic unit (i.e., syllable, phoneme or PLU) the speech organ motions and positions are well defined. In contrast, however, these workers (e.g., J. Schroeter and M. M. Sondhi, IEEE ASSP, 2(1) 133 (1994) and references therein) also have shown that acoustic information alone is insufficient to do the inverse identification of the speech tract organ configuration used to produce a sound. It is this incapacity, using acoustic speech alone, that leads to many of the difficulties experienced with present speech recognizer systems.
Researchers have searched for methods to measure the positions and shapes of the vocal tract elements during speech, but have found no effective way of doing this in real time. Papcun of Los Alamos National Laboratory described a vocal-tract-constrained speech recognition system, in the Journal of the Acoustic Society of America, 22 (2) August 1992, pp. 688-700, "Inferring Articulation and Recognizing Gestures from Acoustics with a Neural Network Trained on X-Ray Microbeam Data" and in PCT/US91/00529 titled "Time Series Association Learning." He measured vocal organ motions and their constrained patterns and locations, by using low power x-ray images of gold balls glued to a subject speaker's tongue and other vocal organs. He used this information to improve recognition algorithms based upon conventional mathematical techniques, but with additional phoneme pattern constraints imposed by the measurements obtaining using x-ray data. His algorithms are based upon allowed vocal tract motions, but do not use the motions in real time to enhance the word recognition reliability. He also showed that vocal organ positions and sequences of positions were uniquely associated with many speech sounds. However, it is both dangerous and impractical to consider using small x-ray machines or real time speech recognition.
U.S. Pat. No. 4,769,845 by H. Nakamura, issued Sep. 6, 1988, describes a "Method of Recognizing Speech Using a Lip Image". Several such patents describe electro-mechanical-optical devices that measure speech organ motion simultaneously with acoustic speech, e.g., U.S. Pat. No. 4,975,960. In this case, the formation of the lips helps define the identification of a phoneme in a given speech period, by the degree to which the acoustic identification agrees with the lip image shape. Such devices are helpful, but sufficiently expensive and limited in the information they provide, that they are not widely used for speech recognition. They have been proposed for the purpose of synchronization of lip motions to movie or video frames for the purpose of automatically synchronizing speech to images.
U.S. Pat. No. 4,783,803, 1988 "Speech Recognition Apparatus and Method" by Baker et al. assigned to Dragon Inc. (a prominent U.S. speech recognition company) lays out the details of a modern all acoustic speech recognition system, followed by six more patents, the latest being U.S. Pat. No. 5,428,707, 1995 "Apparatus and Method for Training Speech Recognition . . . " by Gould et al. Similarly Kurzweil Applied Intelligence, Inc. has patented several ideas. In particular, U.S. Pat. No. 5,280,563 by Ganong in 1994 describes a method of a composite speech recognition expert (system). This patent describes how to use two separate sets of constraining rules for enhancing speech recognition--an acoustic set of rules and a linguistic set of rules. The probabilities of accuracy (i.e., "scores") from each system are combined into a joint probability (i.e., "score") and a multi-word hypothesis is selected. This method of joining constraining rule sets is common in speech recognition.
EM SENSORS
U.S. Pat. Nos. 5,345,471 and 5,361,070 by Thomas E. McEwan at LLNL describe a micropower impulse radar (MIR) receiver and motion sensor based on very simple, low cost electronic send-and-receive modules that have millimeter resolution over measuring distances of 10's of centimeters to meters. These devices can be used for wood or metal "stud-finders" in building walls (U.S. Pat. No. 5,457,394), for automobile collision or obstacle avoidance radars, and for many other applications. In addition, McEwan, and others, have shown that the EM waves emitted from these devices at frequencies near 2 GHz (and at other frequencies) can propagate through human body tissue. He has also shown, Ser. No. 08/287,746, that such a propagating wave experiences enough of a dielectric (or more complex) discontinuity between human tissue and blood (e.g., heart) or human tissue and air (e.g., lungs), that the time varying reflected signal from a beating heart or other body organ motion can be detected and has value.
Professor Neville Luhmann, Director of the Department of Applied Science of the University of California at Davis, has described how low cost, solid state millimeter wave generators similar to the designs of McEwan and others can be made using microelectronics fabrication techniques. These can be fabricated into transmit-receive modules which provide millimeter resolution, and which can be tuned to optimize body transmission and minimize body tissue heating, or body chemical resonances.
U.S. Pat. Nos. 5,030,956 and 5,227,797 to Murphy describe a radar tomography system for examining a patient for medical purposes. A radar transmitter capable of transmitting rf or microwave frequencies is used with multiple receivers and time of flight timing units. The locations of body organs are measured from multiple depths into the patient and from multiple directions using both a multiplicity of receiver units (multistatic system), and by moving the transmitting unit to view the patient from multiple directions. A reflection tomography device uses EM wave reflections to build up an image of the interior of a patient's body for medical imaging applications. There is no description of the importance of time varying organ interface information, nor of the value of single directional, non-imaging systems. Murphy provided no experimental data on image formation that show his ideas can be reduced to practice, and several of the proposed embodiments in the Murphy patent are not expected to be technically realizable in any commercially important imaging system.
U.S. Pat. Nos. 3,925,774 to Amlung and 4,027,303 to Neuwirth et al. describe small radar units generating frequencies that can pass through human body tissue. Amlung describes a printed circuit board sized radar device made from discrete components that project rf waves in a particular direction at a frequency of about 0.9 GHz. The principle of operation is that as long as there is no change in the reflected rf signal to the receiver unit from any objects in the line of sight of EM wave propagation within a defined time unit, appropriate filtering in the receiver provides a null signal to an alarm device. If an object moves into the field of the transmitting device at an appropriate rate, greater than the filter time, a signal is detected and an alarm can be made to drive a sounding unit. This device is called a field disturbance motion detection device. It and several other devices referenced by Amlung as prior art could have been used to detect vocal fold and other vocal organ motions as early or earlier than 1975 in a fashion similar to the present invention. Neuwirth et al. describe similar devices.
Although it has been recognized for many decades in the field of speech recognition that speech organ position and motion information could be useful, and radar units were available to do the measurements for several decades, no one has previously suggested a speech recognition system using transmitted and reflected EM waves to detect motions and locations of speech organs and to use the information in an algorithm to identify speech.