1. Technical Field
The present invention relates to communication terminals and communication methods for transmitting and receiving audio signals, and particularly relates to a communication terminal and a communication method which transmit an audio signal on which echo cancellation processing has been performed.
2. Background Art
Recent years have seen the development of devices which transmit and receive video for large-screen and audio in high-frequency band to allow users to enjoy high-quality communication and call with a high realistic sensation. In this case, audio is often output from a speaker embedded in a display that displays video. Furthermore, an installation site of such device is expected to be, for example, a conference room in an office, a living room in a house, or a room which is large enough in size to accommodate at least a few people.
In such a case, sound of a speaker at the far end (i.e., in a destination communication terminal) is captured by a microphone at the near end (i.e., in a source communication terminal) and then is transmitted to the far end in form of an echo. In order to remove this echo, the communication terminal includes an echo canceller. The echo canceller here removes an echo which is generated in the case where the speaker and the microphone are located at a distance greater than a distance of those of a mobile phone.
However, the above echo canceller involves an enormous amount of computation as compared to a simple echo canceller which is used, for example, in a conventional mobile phone. For this, there are two causes.
The first cause is that the reproduction band of audio is expanded aiming at high-quality audio communication. Take an example of a mobile phone, the reproduction band is slightly lower than 4 kHz, and the reproduction band of audio which is used in the communication with a high realistic sensation is, for example, 12 kHz.
The second cause is that the echo time is prolonged. In a conventional mobile phone, audio is output form a speaker at an ear and then is captured by a microphone at a mouth, with the result that the echo time is expected to be approximately 30 msec at most. In contrast, a communication system with a high realistic sensation is provided with, as mentioned above, a speaker embedded in a display and adapted to a high volume of sound and a microphone installed in a room. Since the room is large enough in size to accommodate at least a few people, the echo time is expected to be approximately 600 msec.
Generally, in a single echo cancellation scheme, the amount of computation of the echo canceller is proportional to the square of the reproduction band and is further proportional to the expected echo time. In the above example, the reproduction band is three times wider and the echo time is 20 times longer, which means that the required amount of computation is 3×3×20=180 times greater.
The reason why the amount of computation of the echo canceller is proportional to the square of the reproduction band and is further proportional to the expected echo time is as follows.
FIG. 16 shows a basic principle of a conventional echo canceller 10. As shown in FIG. 16, the echo canceller 10 removes an echo originated from the sound output from a speaker 20 and captured by a microphone 30.
Specifically, the echo canceller 10 includes a pseudo-echo generation unit 11 and a subtractor 12, and the pseudo-echo generation unit 11 estimates, using an input signal from the microphone 30 and a reference signal, a transfer function of the space where the speaker 20 and the microphone 30 are placed. The pseudo-echo generation unit 11 then uses, for the estimated transfer function, an adaptive filter having a predetermined number of taps and thereby generates a pseudo echo by driving the adaptive filter. The subtractor 12 then reduces the echo by subtracting, from the input signal captured by the microphone 30, the pseudo echo generated by the pseudo-echo generation unit 11.
Here, the number of taps T of the filter in the transfer function is determined by T=E×F where E represents the echo time and F represents a sampling frequency of the signal.
The echo canceller 10 processes a filter having (E×F) taps for each sample, which means that the amount of computation per unit time is (E×F)×F. Thus, the amount of computation of the echo canceller 10 is proportional to the echo time E and is proportional to the square of the sampling frequency F.
There is a subband echo canceller (see Non-Patent Literature 1) as a known technique for reducing an amount of computation of an echo canceller which cancels an echo in the space where a speaker and a microphone are placed at a distance from each other.
The subband echo canceller divides an input signal into a plurality of subband signals and down-samples the input signal at the same time. For example, assume that the signal is divided into 20 subbands and down-sampled to one sixteenth, then the amount of computation is E×(F/16)×(F/16)×20+α. Here, α represents an amount of computation for the division into the subbands. When α is sufficiently small, the amount of computation can be reduced to 20/256 as compared to a typical echo canceller.
In addition, as a method of further reducing the amount of computation, Patent Literature 1 discloses a technique of a subband echo canceller in which taps of an adaptive filter of an echo canceller in each band are increased and decreased depending on a sound source, to thereby reduce the amount of computation.