Co-channel audio source separation is a challenging task in audio processing. For audio sources exhibiting acoustic properties, such as pitch, current methods operate on short-time frames of mixture signals (e.g., harmonic suppression, sinusoidal analysis, modulation spectrum [1-3]) or on single units of a time-frequency distribution (e.g., binary masking [4]).
Estimating the pitch values of concurrent sounds from a single recording is a fundamental challenge in audio processing. Typical approaches involve processing of short-time and band-pass signal components along single time or frequency dimensions [12].
The Grating Compression Transform (GCT) has been explored [5-8] primarily for single-source analysis and is consistent with physiological modeling studies implicating 2-D analysis of sounds by auditory cortex neurons [9]. Ezzat et al. performed analysis and synthesis of a single speaker as the source using two-dimensional (2-D) demodulation of the spectrogram [7]. In [8], an alternative 2-D modulation model for format analysis was proposed. Phenomenological observations in [5, 6] also suggest that the GCT invokes separability of multiple sources.
In [10], the GCT's ability in analysis of multi-pitch signals is demonstrated. Finally, U.S. Pat. No. 7,574,352 to Thomas F. Quatieri, Jr., the teachings of which are incorporated by reference in its entirety, relates to determining pitch estimates of voiced speech [13].
FIG. 1A illustrates schematic diagrams of 2-D frequency-related representation and compressed frequency-related representations as described in [13]. FIG. 1A demonstrates an acoustic signal waveform 104. In order to obtain pitch estimates, a first frequency-related representation of the acoustic signal is obtained over time 106. A 2-D compressed frequency-related representation 108 is then obtained by performing a 2-D transform 107 on the first frequency-related representation 106 with respect to a localized portion 110. The localized portions 110 are illustrated as windows. Each window 110 includes a frequency range that is substantially less than the frequency range of the first representation. The compressed frequency related representations 108 that result from taking a two dimensional transform of the localized portions 110 correspond to respective windows 110 in the frequency-related representation of the original waveform (shown by arrows connecting each compressed frequency-related representation to its corresponding window).
FIG. 1B is a flow diagram of the process of speaker separation described in [13]. A frequency-related representation 105 of an acoustic signal 104 is determined. This may be done in a spectrogram. The spectrogram provides a first frequency-related representation of the acoustic signal is obtained over time. Subsequently, 2-D transform 107, such as a 2-D GCT transform, is performed on the first frequency-related representation 106 with respect to localized portions. The resulting signal 108 is processed 109 to produce a speaker separated speech signal 129. Specifically, the GCT is analyzed to find a location with a maximum value and the distance from the GCT origin to the maximum value is determined 125. The reciprocal of the distance 126 is computed to produce a pitch estimate 127. The pitch estimate is then used 208 to produce a speaker separated speech signal 129.