The present disclosure relates to a sound processing apparatus, a method, and a program, and particularly to a sound processing apparatus, a method, and a program which are preferably used when an acoustic echo is cancelled.
In the related art, an acoustic echo canceller which removes an echo by combining acoustics of a speaker's voice in a video telephone or the like is used. According to the video telephone, for example, not only the voice of the speaker but also the voice of a counterpart of the telephone call, which is output by a speaker, is collected when the speaker's voice is collected in order to transmit the voice to the counterpart of the telephone call. Therefore, such voice (acoustic echo) of the counterpart of the telephone call is removed by an acoustic echo canceller.
Specifically, a filtering process with the use of an adaptive digital filter is performed on the voice of the counterpart of the telephone call, which is to be output by the speaker, and a pseudo echo signal as an estimation result of the voice of the counterpart of the telephone call, which is collected by a microphone, is generated. A residual signal which is obtained by subtracting the pseudo echo signal from the voice actually collected by the microphone is transmitted as the voice of the speaker to the counterpart of the telephone call.
At this time, filter coefficients of the adaptive digital filter are continually updated with the use of the received voice of the counterpart of the telephone call and the residual signal in the acoustic echo canceller in order to enhance the precision in the estimation of the collected voice of the counterpart of the telephone call.
If such an acoustic echo canceller is applied to an actual video telephone, the echo length in a room is as long as several hundreds of msec depending on the installation environment, and therefore, it is necessary to prepare several thousand filter taps for the echo length, which involves extensive computations. In addition, since it takes a long time for the values of the filter coefficients to converge, a sufficient degree of echo suppression is not obtained immediately after the start of the telephone call. That is, the acoustic echo is not sufficiently removed immediately after the start of the telephone call.
Thus, a learning method has been proposed in which filter coefficients are made to more rapidly converge after the start of a telephone call by causing a speaker to output a training signal such as a white noise or the like prior to the telephone call and collecting the voice to update the filter coefficients (see Japanese Unexamined Patent Application Publication No. 9-247246, for example).