This invention relates to detecting segments of usable speech in a speech-degraded environment, and, more specifically, to the detection of usable speech for identifying and separating out a speaker in a common channel where there exist two or more simultaneous speakers and their corresponding speech patterns. Speech is defined as usable speech where an interfering signal (which could be speech or noise) does not significantly degrade the information content of the target speech. The prior art lacks a method and apparatus for making decisions and algorithmic computations to extract usable speech and to identify each of such speakers.
Most signal processing involves processing a signal without concern for the quality or information content of that signal. In speech processing, speech is processed on a frame-by-frame basis, usually only with concern that the frame is either speech or silence. However, knowing how reliable the information is in a frame of speech can be very important and useful. This is where usable speech detection and extraction can play a very important role. The usable speech frames can be defined as frames of speech that contain higher information content compared to unusable frames with reference to a particular application. The prior art lacks a speaker identification system that defines usable speech frames and then determines a method for identifying those frames as usable.
Speaker separation in an environment where multiple speakers speak simultaneously over a common channel has challenged researchers for thirty years. Traditional methods for speaker extraction from a common channel enhance the target (desired) speech or suppress the non-target (undesired) speech, or both. Various features, such as the speaker's voice pitch, have been used to (1) enhance the harmonic components of the target speaker's voice, (2) suppress the harmonic components of the non-target speaker's voice, or (3) simultaneously enhance and suppress the harmonic components of both speakers' voices. These methods then enable one trained in the art to extract a particular speaker's voice from the composite of all speakers' voices on the channel.
There are many drawbacks of these prior-art approaches to speaker separation. First, they have historically treated the entire speech detection process as being co-channel at all times. Though this approach yields results, it is suboptimal. Only one of several speakers may be speaking on the channel, so the other speakers do not interfere with the target speech. In this case, the channel actually contains usable co-channel speech. Results can be obtained in this case only at the expense of performance, efficiency, and accuracy.
Furthermore, the prior art does not discriminate between unusable and usable segments of co-channel speech. Rather, all incoming co-channel speech is processed by either enhancing target speech or suppressing non-target speech. The result is that a segment of usable co-channel speech (i.e., two or more contiguous frames of speech) becomes so degraded that information is lost through processing. Here, efficiency and speed of detection are sacrificed and processing resources wasted.
Historically, the prior art has not examined the structure of co-channel speech as part of the process of speaker detection and extraction. Mid 1970's approaches to speech extraction examined relatively short frames of co-channel speech, about 10 to 30 milliseconds duration, where the target speech was enhanced. Methods to suppress non-target speech developed in the 1980's, but they still processed relatively short (10 to 30 millisecond) co-channel speech frames.
Today co-channel speech detection and extraction combines, through filters, both target speaker enhancement and non-target speaker suppression. Co-channel speech is processed by computer, which yields an output without making any decision about the speech. The prior art takes no advantage of any possible fusion of time, cepstral, and frequency domain attributes of a given sample of speech to identify usable segments.
In an operational environment speech is degraded by many kinds of interferences. The operation of many speech processing techniques are plagued by such interferences. Usable speech extraction is a novel concept of processing degraded speech data. The idea of usable speech is to identify and extract portions of degraded speech that are considered useful for various speech processing systems. Yantorno [1] performed a study on co-channel speech and concluded that the Target-to-Interferer Ratio (TIR) was a good measure to quantify usability for speaker identification. However, the TIR is not an observable value [1] from the co-channel speech data. A number of methods termed usable speech measures which are indicators to the TIR have been developed and studied under co-channel conditions [2, 3, 4, 5, 6 ]. These measures are used as features in decision fusion systems to make an overall decision [7, 8]. On similar lines the effects of silence removal on the performance of speaker recognition were studied in [9]. In all of the above methods mentioned, usability in speech is considered to be application independent. However the concept of usable speech by definition is application dependent, i.e. usable speech for speech recognition may not be usable for speaker identification and vice versa.