A technology for removing a speech signal (human voice or the like) from an acoustic signal may be used to make background sound that is lost in speech and that is hard to make out easily audible, or to play a piece of music karaoke style by removing the voice of the singer from music content. For example, a technology for removing a speech signal from acoustic signals of two channels, a right signal and a left signal, is known.
Now, there are various relationships between signals regarding acoustic signals of two channels. When the signals of two channels are given as a left signal L and a right signal R, respectively, these are modeled in the following manner.L=BL+CL+eL R=BR+CR+eR 
Now, BL and BR are background sound signals included in the left signal and the right signal, respectively. Also, CL and CR are speech signals included in the left signal and the right signal, respectively. Moreover, eL and eR are noises included in the left signal and the right signal, respectively. The noise includes a microphone noise, and an encoding noise. Many contents are created such that the speech signals are equally included in the left signal and the right signal. Thus, as conditions regarding the left signal and the right signal, there are four conditions as follows depending on the combinations of whether the background sounds are equal and whether the noises are equal.
Condition 1: BL≠BR, eL=eR 
Condition 2: BL≠BR, eL≠eR 
Condition 3: BL=BR, eL=eR 
Condition 4: BL=BR, eL≠eR 
Conditions 1 and 2 are cases where the background sounds are different for the left signal and the right signal. For example, a stereo signal corresponds to Conditions 1 and 2. Conditions 3 and 4 are cases where the background sounds are equal between the left signal and the right signal. For example, a case where a monaural signal is input as a two-channel signal corresponds to Conditions 3 and 4.
Acoustic signals of TV broadcasting correspond, in many cases, to Condition 1. Acoustic signals recorded in some DVDs correspond to Condition 3. Other acoustic signals such as the acoustic signals of videos on the Internet include signals of various conditions, and it is not possible to grasp in advance to which condition an acoustic signal corresponds. Also, according to Condition 3, the left signal and the right signal perfectly match each other, and thus, recognition is easy. However, because of the influence of noises, it is generally difficult to distinguish Condition 4 from Conditions 1 and 2 based on input acoustic signals.
As described above, acoustic signals include signals of various conditions. However, the conventional technology of removing a speech signal from acoustic signals of two channels is effective only for the acoustic signals of Conditions 1 and 2, and is not capable of appropriately removing speech from the acoustic signals of Conditions 3 and 4. For example, speech cannot be removed from a monaural signal.