The present invention relates to the processing of speech signals in a communications system, and more particularly to the enhancement of near-end speech in a signal that includes the near-end speech combined with an echo of far-end speech.
In the field of telecommunications, such as with speaker phones and in cellular telephony, it is often desirable to allow a user to operate communication equipment without requiring the continued occupation of one or more of the user""s hands. This can be an important factor in environments, such as automobiles, where a driver""s preoccupation with holding telephone equipment may jeopardize not only his or her safety, but also the safety of others who share the road. Freedom of use one""s hands for something other than holding a microphone is useful in other applications as well, such as with internet communication by means of a personal computer, speech recognition by a computer, or with audio-visual presentation systems.
To accommodate these important needs, so-called xe2x80x9chands-freexe2x80x9d equipment has been developed, in which microphones and loudspeakers are mounted within the hands-free environment, thereby obviating the need to hold them. For example, in an automobile application, a cellular telephone""s microphone might be mounted on the sun visor, while the loudspeaker may be a dash-mounted unit, or may be one that is associated with the car""s stereo equipment. With components mounted in this fashion, a cellular phone user may carry on a conversation without having to hold the cellular unit or its handset. Similarly, personal computers often have microphones and loudspeakers mounted, for example, in a monitor in relatively close proximity to each other.
One problem with a hands-free arrangement is that the microphone tends to pick up sound from the nearby loudspeaker, in addition to the voice of the user of the hands-free equipment (the so-called xe2x80x9cnear-end userxe2x80x9d). This is also a problem in some non-hands-free devices, such as handheld mobile telephones, which are becoming smaller and smaller. (Because of the small size, a mobile telephone""s microphone cannot entirely be shielded from the sound emitted by its loudspeaker). This sensing by the microphone of sound generated by the loudspeaker can cause problems in many types of applications. For example, in communications equipment, delays introduced by the communications system as a whole can cause the sound from the loudspeaker to be heard by the individual on the other end of the call (the so-called xe2x80x9cfar-endxe2x80x9d) as an echo of his or her own voice. Such an echo degrades audio quality and its mitigation is desirable. A similar problem can exist, for example, in automated systems that synthesize speech through a loudspeaker, and include voice recognition components for recognizing and responding to spoken commands or other words sensed by the microphone. In such applications, the presence of an echo of synthesized speech in the microphone signal can severely degrade the performance of the speech recognition components. Solutions for ameliorating such echoes include utilizing an adaptive echo cancellation filter or an echo attenuator.
As a representative example of hands-free equipment in general, an exemplary xe2x80x9chands-freexe2x80x9d mobile telephone, having a conventional echo canceler in the form of an adaptive filter arrangement, is depicted in FIG. 1. A hands-free communications environment may be, for example, an automotive interior in which the mobile telephone is installed. Such an environment can cause effects on an acoustic signal propagating therein, which effects are typically unknown. Henceforth, this type of environment will be referred to throughout this specification as an unknown system H(z). The microphone 105 is intended for detecting a user""s voice, but may also have the undesired effect of detecting audio signals emanating from the loudspeaker 109. It is this undesired action that introduces the echo signal into the system.
Circuitry for reducing, if not eliminating, the echo includes an adaptive filter 101, such as an adaptive Finite Impulse Response (FIR) filter, an adaptation unit 103, such as a least mean square (LMS) cross correlator, and a subtractor 107. In operation, the adaptive filter 101 generates an echo estimate signal 102, which is commonly referred to as a û signal. The echo estimate signal 102 is the convolution of the far-end signal 112, and a sequence of m filter weighting coefficients (hi) of the filter 101 (See Equation 1).                                           u            ^                    ⁡                      (            n            )                          =                              ∑                          i              =              0                                      m              -              1                                ⁢                      xe2x80x83                    ⁢                                    h              i                        ⁢                          x              ⁡                              (                                  n                  -                  i                                )                                                                        (        1        )            
where:
x(n) is the input signal,
m is the number of weighting coefficients, and
n is the sample number.
When the weighting coefficients are set correctly, the adaptive filter 101 produces an impulse response that is approximately equal to the response produced by the loudspeaker 109 within the unknown system H(z). The echo estimate signal 102 generated by the adaptive filter 101 is subtracted from the incoming digitized microphone signal 126 (designated u(n) in Eq. 2), to produce an error signal e(n) (see Eq. 2)
e(n)=u(n)xe2x88x92û(n)xe2x80x83xe2x80x83(2) 
Ideally, any echo response from the unknown system H(z), introduced by the loudspeaker 109, is removed from the digitized microphone signal 126 by the subtraction of the echo estimate signal 102. Typically, the number of weighting coefficients (henceforth referred to as xe2x80x9ccoefficientsxe2x80x9d) required for effectively canceling an echo will depend on the application. For handheld phones, fewer than one hundred coefficients may be adequate. For a hands-free telephone in an automobile, about 200 to 400 coefficients will be required. A large room may require a filter utilizing over 1000 coefficients in order to provide adequate echo cancellation.
It can be seen that the effectiveness of the echo canceler is directly related to how well the adaptive filter 101 is able to replicate the impulse response of the unknown system H(z). This, in turn, is directly related to the set of coefficients, hi, maintained by the filter 101.
It is advantageous to provide a mechanism for dynamically altering the coefficients, hi, to allow the adaptive filter 101 to adapt to changes in the unknown system H(z). In a car having a hands-free cellular arrangement, such changes may occur when a window or car door is opened or closed. A well-known coefficient adaptation scheme is the Least Mean Square (LMS) process, which was first introduced by Widrow and Hoff in 1960, and is frequently used because of its efficiency and robust behavior. As applied to the echo cancellation problem, the LMS process is a stochastic gradient step method which uses a rough (noisy) estimate of the gradient, g(n)=e(n)x(n), to make an incremental step toward minimizing the energy of an echo signal in a microphone signal, e(n), where x(n) is in vector notation corresponding to an expression x(n)=[x(n)x(nxe2x88x921)x(nxe2x88x922) . . . x(nxe2x88x92m+1)]. The update information produced by the LMS process e(n)x(n) is used to determine the value of a coefficient in a next sample. The expression for calculating a next coefficient value h1(n+1) is given by:
hi(n+1)=hi(n)+xcexce(n)x(nxe2x88x92i),i=0 . . . mxe2x88x921xe2x80x83xe2x80x83(3) 
where
x(n) is the digitized input signal,
(hi) is a filter weighting coefficient,
i designates a particular coefficient,
m is the number of coefficients,
n is the sample number, and
xcexc is a step or update gain parameter.
The LMS method produce information in incremental portions each of which portions may have a positive or a negative value. The information produced by the LMS process can be provided to a filter to update the filter""s coefficients.
Referring back to FIG. 1, the conventional echo cancellation circuit includes a filter adaptation unit 103 in the form of an LMS cross correlator for providing coefficient update information to the filter 101. In this arrangement, the filter adaptation unit 103 monitors the corrected signal e(n) that represents the digitized microphone signal 126 minus the echo estimate signal 102 generated by the filter 101. The echo estimate signal 102 is generated, as described above, with the use of update information provided to the adaptive filter 101 by the filter adaptation unit 103. The coefficients, hi, of the adaptive filter 101 accumulate the update information as shown in Eq. 3.
Having reduced the presence of the acoustic echo from the microphone signal, the resulting signal is then supplied to additional components for further processing which is application-specific. For example, in addition to the acoustic echo cancellation circuitry, such as that described above, transceivers such as the one depicted in FIG. 1 typically also include a near-end voice activity detector 150, which outputs a signal 153 that is indicative of whether or not a near-end user is speaking. The most commonly used approach to performing near-end voice activity detection employs a time domain power calculation. Typically, a decision regarding the presence or absence of voice activity is mainly based on a comparison between a threshold energy level (corresponding to background noise) and a measure of the bandpass filtered signal energy. The purpose of the bandpass filtering is to eliminate signal energy associated with background noise.
A signal that is indicative of the presence or absence of near-end speech may be useful for any of a number of uses. For one thin, in cellular communications systems such as the Global System for Mobile communication (GSM), digitized speech signals are not sent through the network in their raw form, but are instead encoded in a manner that reduces the number of bits that actually need to be transmitted from one place to another. In GSM, the speech coder takes advantage of the fact that each participant in a normal conversation speaks on average for less than 40% of the time. By incorporating a voice activity detector as part of the speech coder functioning, GSM systems operate in a discontinuous transmission mode (DTX), in which the GSM transmitter is not active during silent periods (i.e., when the near-end voice activity detector 150 indicates that the near-end user is not speaking). This approach provides a longer subscriber battery life and reduces instantaneous radio interference. A comfort noise subsystem at the receiving end introduces a background acoustic noise to compensate for the annoying switched muting which occurs due to DTX. p Near-end voice activity detectors may also be employed to control an attenuation factor of an active acoustic echo canceler based on whether a speech signal includes a near-end speech component.
Furthermore, near-end voice activity detectors may also be used to control adaptation speed of the adaptive filter 101.
Voice activity detectors are not the only types of components that process a signal representative of near-end speech. Such a signal may be supplied, for example, to a speech recognizer module. Speech recognizer modules are well-known, and are useful in applications that permit users to control an apparatus or computer via voice control, and in applications that permit users to create electronic documents merely by dictating them.
Furthermore, a signal representative of near-end speech may also be fed back within the system for use in controlling the echo cancellation filter 101 itself, such as for controlling speed of adaptation.
Despite the presence of echo cancellation circuitry, such as that described above, the signals generated for further processing (e.g., for transmission to the far-end user in a communications system, or for near-end speech recognition or for controlling the operation of the echo cancellation filter 101) may very often still include echo components. This may occur, for example, because the adaptive filter has not yet converged to a fully adapted state, or even after such convergence whenever the unknown environment H(z) changes, thereby requiring the adaptation process to be repeated. The presence of strong echo signal components in the signal can cause degraded or even faulty operation of the down-stream processing components, since these echo signal components may be mistaken for near-end speech.
Conventional applications that process near-end speech signals, such as conventional voice activity detectors, speech recognition modules and the like, typically assume that no echo is present in the signal to be processed, and therefore do not have any ability to focus on the near-end speech to the exclusion of echo signal components, which may also be in the frequency range of human voice activity.
It is therefore an object of the present invention to provide methods and apparatuses that generate a signal in which near-end speech components are enhanced relative to echo signal components.
The foregoing and other objects are achieved in methods and apparatuses for generating an enhanced near-end voice signal. In accordance with one aspect of the invention, generating an enhanced near-end voice signal includes receiving an audio signal; generating an estimated acoustic echo signal; and generating a processed signal by removing the estimated acoustic echo signal from the audio signal. These steps are useful in, for example, a hands-free telephone apparatus, wherein loudspeaker signals, conveying information from the far-end user, are picked up as an acoustic echo by the microphone of the hands-free telephone apparatus. Next, a near-end enhancement spectrum is determined, wherein the near-end enhancement spectrum has at least one range of contiguous frequencies over which the near-end enhancement spectrum has a magnitude greater than a predetermined threshold, wherein the range of contiguous frequencies are those associated with a relatively high echo return loss in the processed signal. The processed signal is then filtered in accordance with the near-end enhancement spectrum, thereby generating an enhanced near-end voice signal.
In another aspect of the invention, the amount of energy contained in the enhanced near-end voice signal is measured. The presence or absence of near-end voice activity is then detected based on the measured energy of the enhanced near-end voice signal.
In accordance with yet another aspect of the invention, the enhanced near-end voice signal may be applied to a near-end speech recognizer, thereby obtaining improved speech recognition performance.
In accordance with another aspect of the invention, the above-described process is repeated periodically, so that the detection of whether near-end voice activity exists is dynamically adjustable to accommodate changing conditions.
In yet another aspect of the invention, determining the near-end enhancement spectrum comprises determining the near-end enhancement spectrum as a function of a weighted spectrum, wherein the weighted spectrum is defined as:       W    ⁡          (      f      )        =            α      ⁢              Γ                  Γ          max                      +          β      ⁢              E                  E          max                      +          γ      ⁢              S                  S          max                    
where:
xcex93 is a spectrum of an estimate of an acoustic echo derived from a far-end signal;
E is an Echo Return Loss Enhancement spectrum that represents an echo canceling performance of step c);
N is a spectrum of the processed signal;
S is an echo spread spectrum that represents spectral spreading properties of the echo path;
xcex93max=max(xcex93),Emax=max(E) and Smax=max(S); and
xcex1, xcex2 and xcex3 are constants, with xcex1+xcex2+xcex3 greater than 0.
In still another aspect of the invention, xcex1+xcex2+xcex3=1
In yet another aspect of the invention, determining the near-end enhancement spectrum as a function of the weighted spectrum comprises determining the detector spectrum in accordance with:   C  =                    ∑        i            ⁢                        ∫                      Speech                          min              ⁡                              (                i                )                                                          Speech                          max              ⁡                              (                i                )                                                    ⁢                              W            ⁡                          (              f              )                                ⁢                      xe2x80x83                    ⁢                      ⅆ            f                                              ∫        0                  Spectrum                      total            ⁢                          xe2x80x83                        ⁢            max                              ⁢                        W          ⁡                      (            f            )                          ⁢                  xe2x80x83                ⁢                  ⅆ          f                    
where:
Speechmin(i) is an ith frequency where N goes above a predetermined threshold;
Speechmax(i) is the ith frequency where N drops below the predetermined threshold; and
Spectrumtotal max is a maximum frequency of interest in the weighted spectrum, W(f).