1. Field of the Invention
The present invention relates generally to systems and methods for automatically describing human speech, and more particularly to systems and methods for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing human/animate speech.
2. Discussion of Background Art
Sound characterization, simulation, and noise removal relating to human speech is a very important ongoing field of research and commercial practice. Use of EM sensors and acoustic microphones for purposes of human speech characterization has been described in the referenced application, Ser. No. 08/597,596 to the U.S. patent office, which is incorporated herein by reference. Said patent application describes methods by which EM sensors can measure positions versus time of human speech articulators, along with substantially simultaneous measured acoustic speech signals for purposes of more accurately characterizing each segment of human speech. Furthermore, the said patent application describes valuable applications of said EM sensor and acoustic methods for purposes of improved speech recognition, coding, speaker verification, and other applications.
A second related U.S. patent issued on Mar. 17, 1998 as U.S. Pat. No. 5,729,694, titled xe2x80x9cSpeech Coding, Reconstruction and Recognition Using Acoustics and Electromagnetic Waves,xe2x80x9d by J. F. Holzrichter and L. C. Ng is also incorporated herein by reference. Patent ""694 describes methods by which speech excitation functions of human (or similar animate objects) are characterized using EM sensors, and the substantially simultaneously acoustic speech signal is then characterized using generalized signal processing technique. The excitation characterizations described in ""694, as well as in application Ser. No. 08/597,596, rely on associating experimental measurements of glottal tissue interface motions with models to determine an air pressure or airflow excitation function. The measured glottal tissue interfaces include vocal folds, related muscles, tendons, cartilage, as well as, sections of a windpipe (e.g. glottal region) directly below and above the vocal folds.
The described procedures in application Ser. No. 08/597,596, enable new and valuable methods for characterizing the substantially simultaneously measured acoustic speech signal, by using the non-acoustic EM signals from the articulators and acoustic structures as additional information. Those procedures use the excitation information, other articulator information, mathematical transforms, and other numerical methods, and describes the formation of feature vectors of information that numerically describe each speech unit, over each defined time frame using the combined information. This characterizing speech information is then related to methods and systems, described in said patents and applications, for improving speech application technologies such as speech recognition, speech coding, speech compression, synthesis, and many others.
Another important patent application that is herein incorporated by reference is U.S. patent Ser. No. 09/205,159 entitled xe2x80x9cSystem and Method for Characterizing, Synthesizing, and/or Canceling Out Acoustic Signals From Inanimate Sound Sources,xe2x80x9d filed on Dec. 2, 1998 by G. C. Burnett, J. F. Holzrichter, and L. C. Ng. This invention application relates generally to systems and methods for characterizing, synthesizing, and/or canceling out acoustic signals from inanimate sound sources, and more particularly for using electromagnetic and acoustic sensors to perform such tasks.
Existing acoustic speech recognition systems suffer from inadequate information for recognizing words and sentences with high probability. The performance of such systems also drops rapidly when noise from machines, other speakers, echoes, airflow, and other sources are present.
In response to the concerns discussed above, what is needed is a system and method for automated human speech that overcomes the problems of the prior art. The inventions herein describe systems and methods to improve speech recognition and other related speech technologies.
The present invention is a system and method for characterizing voiced speech excitation functions (human or animate) and acoustic signals, for removing unwanted acoustic noise from a speech signal which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller.
The system and method of the present invention is particularly advantageous because a low power EM sensor detects tissue motion in a glottal region of a human speech system before, during, and after voiced speech. This is easier to detect than a glottis itself. From these measurements, a human voiced excitation function can be derived. The EM sensor can be optimized to measure sub-millimeter motions of wall tissues in either a sub-glottal or supra-glottal region (i.e., below or above vocal folds), as vocal folds oscillate (i.e., during a glottal open close cycle). Motions of the sub-glottal wall or supra-glottal wall provide information on glottal cycle timing, on air pressure determination, and for constructing a voiced excitation function. Herein, the terms glottal EM sensor and glottal radar and GEMS (i.e., glottal electromagnetic sensor) are used interchangeably.
Air pressure increases and decreases in the sub-glottal region, as vocal folds close (obstructing airflow) and then open again (enabling airflow), causing the sub-glottal walls to expand and then contract by dimensions ranging from  less than 0.1 mm up to 1 mm. In particular, a rear wall (posterior) section of a trachea is observed to respond directly to increases in sub-glottal pressure as vocal folds close. Timing of air pressure increase is directly related to vocal fold closure (i.e., glottal closure). Herein xe2x80x9ctracheaxe2x80x9d and xe2x80x9csub-glottal windpipexe2x80x9d refer to a same set of tissues. Similarly, supra-glottal walls in a pharynx region, expand and contract, but in opposite phase to sub-glottal wall motion. For this document xe2x80x9cpharynxxe2x80x9d and the xe2x80x9csupra-glottal regionxe2x80x9d are synonyms; also, xe2x80x9ctime segmentxe2x80x9d and xe2x80x9ctime framexe2x80x9d are synonyms.
Methods of the present invention describe how to obtain an excitation function by using a particular tissue motion associated with glottis opening and closing. These are wall tissue motions, which are measured by EM sensors, and then associated with air pressure versus time. This air pressure signal is then converted to an excitation function of voiced speech, which can be parameterized and approximated as needed for various applications. Wall motions are closely associated with glottal opening and closing and glottal tissue motions.
The windpipe tissue signals from the EM sensor also describe periods of no speech or of unvoiced speech. Using the statistics of the user""s language, the user of these methods can estimate, to a high degree of certainty, time periods wherein no vocal-fold motion means time periods of no speech, and time periods where unvoiced speech is likely. In addition, unvoiced speech presence and qualities can be determined using information from the EM sensor measuring glottal region wall motion, from a spectrum of a corresponding acoustic signal, and (if used) signals from other EM sensors describing processes of vocal fold retraction, or pharynx diameter enlargement, jaw motions, or similar activities.
The EM sensor signals that describe vocal tract tissue motions can also be used to determine acoustic signals being spoken. Vocal tract tissue walls (e.g., pharynx or soft palate), and/or tissue surfaces (e.g., tongue or lips), and/or other tissue surfaces connected to vocal tract wall-tissues (e.g., neck-skin or outer lip surfaces), vibrate in response to acoustic speech signals that propagate in the vocal tract. The EM sensors described in the ""596 patent and elsewhere herein, and also methods of tissue response-function removal, enable determination of acoustic signals.
The invention characterizes qualities of a speech environment, such as background noise and echoes from electronic sound systems, separately from a user""s speech so as to enable noise and echo removal. Background noise can be characterized by its amplitude versus time, its spectral content over determined time frames, and the correlation times with respect to its own time sequences and to the user""s acoustic and EM sensed speech signals. The EM sensor enables removal of noise signals from voiced and unvoiced acoustic speech. EM sensed excitation functions provide speech production information (i.e., amplitude versus time information) that gives an expected continuity of a speech signal itself. The excitation functions also enable accurate methods for acoustic signal averaging over time frames of similar speech and threshold setting to remove impulsive noise. The excitation functions also can employ xe2x80x9cknowledge filteringxe2x80x9d techniques (e.g., various model-based signal processing and Kalman filtering techniques) and remove signals that don""t have expected behavior in time or frequency domains, as determined by either excitation or transfer functions. The excitation functions enable automatic setting of gain, threshold testing, and normalization levels in automatic speech processing systems for both acoustic and EM signals, and enable obtaining ratios of voiced to unvoiced signal power levels for each individual user. A voiced speech excitation signal can be used to construct a frequency filter to remove noise, since an acoustic speech spectrum is restricted by spectral content of its corresponding excitation function. In addition, the voiced speech excitation signal can be used to construct a real time filter, to remove noise outside of a time domain function based upon the excitation impulse response. These techniques are especially useful for removing echoes or for automatic frequency correction of electronic audio systems.
Using the present invention""s methods of determining a voiced excitation function (including determining pressure or airflow excitation functions) for each speech unit, a parameterized excitation function can be obtained. Its functional form and characteristic coefficients (i.e., parameters) can be stored in a feature vector. Similarly, a functional form and its related coefficients can be selected for each transfer function and/or its related real-time filter, and then stored in a speech unit feature vector. For a given vocabulary and for a given speaker, feature vectors having excitation, transfer function, and other descriptive coefficients can be formed for each needed speech unit, and stored in a computer memory, code-book, or library.
Speech can be synthesized by using a control algorithm to recall stored feature vector information for a given vocabulary, and to form concatenated speech segments with desired prosody, intonation, interpolation, and timing. Such segments can be comprised of several concatenated speech units. A control algorithm, sometimes called a text-to-speech algorithm, directs selection and recall of feature vector information from memory, and/or modification of information needed for each synthesized speech segment. The text-to-speech algorithm also describes how stored information can be interpolated to derive excitation and transfer function or filter coefficients needed for automatically constructing a speech sequence.
The present invention""s method of speech synthesis, in combination with measured and parameterized excitation functions and a real time transfer function filter for each speech unit, as described in U.S. patent office application Ser. No. 09/205,159 and in U.S. patent ""694 enables prosodic and intonation information to be applied to a synthesized speech sequences easily, enables compact storage of speech units, and enables interpolation of one sound unit to another unit, as they are formed into smoothly changing speech unit sequences. Such sequences are often comprised of phones, diphones, triphones, syllables, or other patterns of speech units.
These and other aspects of the invention will be recognized by those skilled in the art upon review of the detailed description, drawings, and claims set forth below.