This invention generally relates to machine-based speech and speaker recognition and, more particularly, to machine-based speech recognition using a learned relationship between acoustic and articulatory parameters. This invention is the result of a contract with the Department of Energy (Contract No. W-7405-ENG-36).
In conventional speech recognition, input acoustic waveforms are obtained and quantized for comparison with stored acoustic waveforms with correlative dictionary data. This approach requires substantial computer memory and requires that the speech patterns of the speaker be sufficiently similar to the stored patterns that a pattern match can be obtained. However, it will be appreciated that speech acoustics are affected by the speaking rate, linquistic stress, emphasis, intensity, and emotions. Further, in fluent speech, the speakers normally modify speech sounds by adding or deleting sounds and by assimilating sounds adjacent to each other.
U.S. Pat. No. 4,769,845 "Method of Recognizing Speech Using a Lip Image," issued Sep. 6, 1988, to Nakamura, teaches speech recognition using a computer with a relatively small memory capacity where a recognition template is formed to include at least lip pattern data. Lip pattern data are initially obtained from external sensor equipment and are collated with stored word templates. However, speech recognition then requires that an external device, such as a TV camera, also ascertain lip pattern data from the speaker whose speech is to be recognized.
It is hypothesized in J. L. Elman et al., "Learning the Hidden Structure of Speech," 83 J Acoust. Soc. Am., 4, pp. 1615-1626 (April 1988), that inappropriate features have been selected as units for recognizing and representing speech. A backpropagation neural network learning procedure is applied to develop a relationship between input/output pattern pairs using only a single input time series. The network developed rich internal representations that included hidden units that corresponded to traditional distinctions as vowels and consonants. However, only abstract relationships were developed since only acoustic tokens were input.
It would be desirable to provide a representation of a speech signal that is relatively invariant under variations in speech rate, stress, and phonetic environment. It would also be desirable to train a system, e.g., an artificial neural network, to recognize speech independent of the speaker. These and other aspects of speech recognition are addressed by the present invention wherein a relationship is learned between an acoustic signal and articulatory mechanisms which generate it. The articulatory representation is then used to recognize speech by others based solely on acoustic inputs.
Accordingly, it is an object of the present invention to provide speech recognition under variations in speech rate, stress, and phonetic environment.
It is another object of the present invention to provide a learned relationship between speech acoustics and articulatory mechanisms.
One other object of the present invention is to obtain speech recognition from learned articulatory gestures without gestural input.
Additional objects, advantages and novel features of the invention will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.