The present invention relates to a system and method for tracking individual voices in a group of voices through time so that the spoken message of an individual may be selected and extracted from the sounds of the other competing talker""s voices.
When listeners (whether they be human or machine) attempt to identify a single taker""s speech sounds that are imbedded in a mixture of sounds spoken by other takers, it is often very difficult to identify the specific sounds produced by the target talker. In this instance, the signal that the listener is trying to identify and the xe2x80x9cnoisexe2x80x9d the listener is trying to ignore have very similar spectral and temporal properties. Thus, simple filtering techniques to remove the noise are not able to remove only the unwanted noise without also removing the intended signal
Examples of situations where this poses a significant problem include operation of voice recognition software and hearing aids in noisy environments where multiple voices are present. Both hearing-impaired human listeners and machine speech recognition systems exhibit considerable speech identification difficulty in this type of multi-talker environment. Unfortunately, the only way to improve the speech understanding performance for these listeners is to identify the talker of interest and isolate just this voice from the mixture of competing voices. For stationary sounds, this may be possible. However, fluent speech exhibits rapid changes over relatively short time periods. To separate a single talker""s voice from the background mixture, there must therefore exist a mechanism that tracks each individual voice through time so that the unique sounds and properties of that voice may be reconstructed and presented to the listener. While there are currently available several models and mechanisms for speech extraction, none of these systems specifically attempt to put together the speech sounds of each individual talker as they occur through time.
To solve the foregoing problem, the present invention provides a system and method for tracking each of the individual voices in a multi-talker environment so that any of the individual voices may be selected for additional processing. The solution that has been developed is to estimate the fundamental frequencies of each of the voices present using a conventional analysis method and, then follow the trajectories of each individual voice through time using a neural network prediction technique. The result of this method is a time-series prediction model that is capable of tracking multiple voices through time, even if the pitch trajectories of the voices cross over one another, or appear to merge and then diverge.
In a preferred embodiment of the invention, the acoustic speech waveform comprised of the multiple voices to be identified is first analyzed to identify and estimate the fundamental frequency of each voice present in the waveform. Although this analysis can be carried out by using a frequency domain analysis technique, such as a Fast Fourier Transform (FFT), it is preferable to use a time domain analysis technique to increase processing speed, and decrease complexity and cost of the hardware or software employed to implement the invention. More preferably, the waveform is submitted to an average magnitude difference function (AMDF) calculation which subtracts successive time shifted segments of the waveform from the waveform itself As a person speaks, the amplitude of their voice oscillates at a fundamental frequency. As a result, because the AMDF calculation is subtractive, the pitch period of a particular voice will produce a small value near the frequency period F0 of the voice since the AMDF at that point is effectively subtracting a value from itself After the AMDF is calculated, the F0 of each voice present can then be estimated as the inverse of the AMDF minima.
Once the fundamental frequencies of the individual voices have been identified and estimated, the next step implemented by the system is to track the voices through time. This would be a simple matter if each voice was of a constant pitch, however, the pitch of an individual""s voice changes slowly over time as they speak. In addition, when multiple people are simultaneously speaking, it is quite common for the pitches of their voices to cross over each other in frequency as one person""s voice pitch is rising, while another""s is falling. This makes it extremely difficult to track the individual voices accurately.
To solve this problem, the present invention tracks the voices through use of a recursive neural network that predicts how each voice""s pitch will change in the future, based on past behavior. The recursive neural network predicts the F0 value for each voice at the next windowed segment. Because the predicted values are constrained by the frequency values of prior analysis frames, the F0 tracks tend to change smoothly, with no abrupt discontinuities in the trajectories. This follows what is normally observed with natural speech: the F0 contours of natural speech do not change abruptly, but vary smoothly over time. In this manner, the neural network thus predicts the next time value of the F0 for each talker""s F0 track.
The output from the neural network thus comprises tracking information for each of the voices present in the analyzed waveform This information can either be stored for future analysis, or can be used directly in real time by any suitable type of voice filtering or separating system for selective processing of the individual speech signals. For example, the system can be implemented in a digital signal processing chip within a hearing aid for selective amplification of an individual""s voice. Although the neural network output can be used directly for tracking of the individual voices, the system can also use the AMDF calculation circuit to estimate the F0 for each of the voices, and then use the neural network output to assign each of the AMDF-estimated F0""s to the correct voice.