1. Field
This disclosure relates to speech processing.
2. Background
An information signal may be captured in an environment that is unavoidably noisy. Consequently, it may be desirable to distinguish an information signal from among superpositions and linear combinations of several source signals, including a signal from a desired information source and signals from one or more interference sources. Such a problem may arise in various acoustic applications for voice communications (e.g., telephony).
One approach to separating a signal from such a mixture is to formulate an unmixing matrix that approximates an inverse of the mixing environment. However, realistic capturing environments often include effects such as time delays, multipaths, reflection, phase differences, echoes, and/or reverberation. Such effects produce convolutive mixtures of source signals that may cause problems with traditional linear modeling methods and may also be frequency-dependent. It is desirable to develop signal processing methods for separating one or more desired signals from such mixtures.
A person may desire to communicate with another person using a voice communication channel. The channel may be provided, for example, by a mobile wireless handset or headset, a walkie-talkie, a two-way radio, a car-kit or other communication device. When the person speaks, microphones on the communication device receive the sound of the person's voice and convert it to an electronic signal. The microphones may also receive sound signals from various noise sources, and therefore the electronic signal may also include a noise component. Since the microphones may be located at some distance from the person's mouth, and the environment may have many uncontrollable noise sources, the noise component may be a substantial component of the signal. Such substantial noise may cause an unsatisfactory communication experience and/or may cause the communication device to operate in an inefficient manner.
An acoustic environment is often noisy, making it difficult to reliably detect and react to a desired informational signal. In one particular example, a speech signal is generated in a noisy environment, and speech processing methods are used to separate the speech signal from the environmental noise. Such speech signal processing is important in many areas of everyday communication, since noise is almost always present in real-world conditions. Noise may be defined as the combination of all signals interfering or degrading the speech signal of interest. The real world abounds from multiple noise sources, including single point noise sources, which often transgress into multiple sounds resulting in reverberation. Unless the desired speech signal is separated and isolated from background noise, it may be difficult to make reliable and efficient use of it. Background noise may include numerous noise signals generated by the general environment, and signals generated by background conversations of other people, as well as reflections and reverberation generated from each of the signals. For applications in which communication occurs in noisy environments, it may be desirable to separate the desired speech signals from background noise.
Existing methods for separating desired sound signals from background noise signals include simple filtering processes. While such methods may be simple and fast enough for real-time processing of sound signals, they are not easily adaptable to different sound environments and can result in substantial degradation of a desired speech signal. For example, the process may remove components according to a set of predetermined assumptions of noise characteristics that are over-inclusive, such that portions of a desired speech signal are classified as noise and removed. Alternatively, the process may remove components according to a set of predetermined assumptions of noise characteristics that are under-inclusive, such that portions of background noise such as music or conversation are classified as the desired signal and retained in the filtered output speech signal.
Handsets like PDAs and cellphones are rapidly emerging as the mobile speech communication device of choice, serving as platforms for mobile access to cellular and internet networks. More and more functions that were previously performed on desktop computers, laptop computers, and office phones in quiet office or home environments are being performed in everyday situations like the car, the street, or a café. This trend means that a substantial amount of voice communication is taking place in environments where users are surrounded by other people, with the kind of noise content that is typically encountered where people tend to gather. The signature of this kind of noise (including, e.g., competing talkers, music, babble, airport noise) is typically nonstationary and close to the user's own frequency signature, and therefore such noise may be hard to model using traditional single microphone or fixed beamforming type methods. Such noise also tends to distract or annoy users in phone conversations. Moreover many standard automated business transactions (e.g., account balance or stock quote checks) employ voice recognition based data inquiry, and the accuracy of these systems may be significantly impeded by interfering noise. Therefore multiple microphone based advanced signal processing may be desirable e.g. to support handset use in noisy environments.