In speech recognition, an audio signal acquired with a microphone includes not only a voice signal that is of a voice of a user but also a non-voice signal such as a background noise and music. A sound source separation technology is one that extracts only a desired signal from the audio signal in which the voice signal and the non-voice signal are mixed.
For example, the sound source separation technology includes a method in which nonnegative matrix factorization is used. In the method, when the voice signal is separated from the audio signal, a basis matrix of the non-voice signal is produced from a spectrogram of the audio signal in an interval having a high probability that the non-voice signal is included.
Then, using the basis matrix of the non-voice signal, a basis matrix and a coefficient matrix of the voice signal is produced from the spectrogram of the audio signal that becomes a separation target. The spectrogram of the voice signal is estimated from a product of the basis matrix and the coefficient matrix of the voice signal. Finally, the estimated spectrogram of the voice signal is transformed into a temporal signal to separate the voice signal from the audio signal.
However, in the method, the basis matrix of the non-voice signal cannot correctly be produced in the case that the voice signal is mixed in the audio signal in obtaining the basis matrix of the non-voice signal, which results in a problem in that audio signal separation performance is degraded. A problem to be solved by the invention is to construct an audio signal processing apparatus in which the audio signal separation performance is improved.