Audio processing generally refers to enhancing or analyzing audio signals for all kinds of purposes, such as improving audio quality, removing noise or echo, providing full duplex communication systems, improving results of audio analysis engines and tools such as continuous speech recognition, word spotting, emotion analysis, speaker recognition or verification, and others. On a higher level, improving the results of such audio analysis engines and tools, and optionally using further tools such as text analysis tools to process the resulting text, may assist in retrieving information associated with or embedded within captured or recorded audio signals. Such information may prove useful in achieving many personal or business purposes, such as trend analysis, competition analysis, quality assurance, improving service, root cause analysis of problems conveyed over voce interactions, or others.
Many audio signals to be processed comprise one or more speakers who may at times talk simultaneously, and may include, in addition to the speakers, also music, tones, background noise on either side of the interaction, or the like. Audio analysis performance, as measured in terms of accuracy, detection, real-time efficiency and resource efficiency, depends heavily on the quality and integrity of the input audio signals, on the available computing resources and on the capabilities of the computer programs that constitute the audio analysis process.
Many of the analysis tasks are highly sensitive to the audio quality of the processed interactions. Multiple speakers, as well as music which is often present on hold periods, tones, background noises such as street noise, ambient noise, convolutional noises such as channel type and handset type, keystrokes and the like, severely degrade the performance of these engines, sometimes to the degree of complete uselessness.
One of the most required yet challenging audio analysis applications is continuous speech recognition. Currently known speech recognition systems achieve reasonable performance in a noise-free environment in which speech segments are easily recognized. Noise free signal can be obtained, for example, by using a close-talking microphone worn near the mouth of the speaker.
Speech recognition systems may comprise as a first stage a phoneme recognition engine, and in particular classification engines such as a Gaussian Mixture Model (GMM) or a Hidden Markov Model (HMM) for extracting phonemes out of an audio signal, wherein the engine outputs the most probable phoneme sequence for an input audio signal, or others. Such classification engines operate under the assumption that the speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal. Therefore, although the engines can generally report if the segment is silent and does not contain audio, still the engines are not designed to indicate that the signal contains noise rather than speech, and will not avoid outputting phonemes. However, noise is generally not a stationary signal. Therefore, when noise is input into a classification engines, the classification engine still outputs a phoneme sequence, although it is a rather random or incidental sequence, which degrades the overall recognition quality.
Some known solutions for providing signals with low Signal-to-Noise Ratio (SNR) include the usage of multiple microphones to capture the speech signal. Array-signal-processing techniques have been developed to combine multiple signals as captured by an array of microphones, to achieve high quality results. Microphone array-based speech recognition may be performed in two independent stages: processing the captured waveforms, followed by passing the optionally-combined and improved output waveform to the speech recognition system.
However, not all audio required for processing purposes can be captured or recorded in noise-free environments. For example, it is often impractical pr inconvenient for a speaker to wear a close-talking microphone, and as the distance between the speaker and the microphone increases, the speech signal may become increasingly susceptible to background noise and reverberation effects that significantly degrade speech recognition accuracy. This is especially problematic in situations where the locations of the microphones or the users are dictated by physical constraints of the environment, as in meetings, automobiles, on the street, or the like. In such situations, it is impractical to provide and use a plurality of microphones, and only a single input audio signal is available.
There is thus a need for a method and apparatus that enable obtaining improved results from an identification engine such as a phoneme identification engine, even for signals with low SNR.