Field of the Invention
The present invention relates to audio signal processing apparatuses and methods that divide audio signals into a plurality of audio signals such as target sound and noise, for example.
Description of the Related Art
Techniques for removing noise, which is a non-target sound, from an audio signal improve the audibility of a target sound present in the audio signal, and are important techniques for improving recognition rates in voice recognition.
Non-negative matrix factorization is used as a technique for removing noise from audio signals. This is based on the assumption that a matrix obtained by performing a time-frequency conversion on an audio signal is split into a basis matrix and an activity matrix by the non-negative matrix factorization, and these matrices will be divided into a partial matrix for target sounds and a partial matrix for noise. Then, a target sound-restored signal from which noise has been removed is generated using a target sound basis matrix, which is a partial basis matrix for the target sound and a target sound activity matrix, which is a partial activity matrix for the target sound.
According to one past technique, target sound and noise are prepared separately from an audio signal from which noise is to be removed, and by training these in advance, a teacher basis matrix and a teacher activity matrix are obtained for both the target sound and the noise. Then, the target sound-restored signal is obtained by analyzing the matrix obtained through time-frequency conversion on an audio signal using statistical amount information of the teacher basis matrix and the teacher activity matrix.
According to another past technique, two matrices are obtained by performing time-frequency conversion on both channels in a two-channel audio signal, and non-negative matrix factorization is then carried out on the obtained matrices. Then, among the base spectra that configure the respective columns in each basis matrix, spectra with high interchannel correlation are taken as noise base spectra and other spectra are taken as target sound base spectra. The target sound-restored signal is generated using a target sound basis matrix configured of the target sound base spectra and a target sound activity matrix corresponding thereto.
According to the first technique mentioned above, the basis matrix is trained in advance from sounds prepared separately, and the restored signal is then generated using that basis matrix. This is thought to be useful for separating musical instrument sounds in cases where the shape (harmonic structure) of the base spectrum is generally fixed, like scales in harmonic instruments (for example, for use in automatic transcription). However, in other cases, it is possible that the restored signal will be generated using a base spectrum that is different from the sounds in the audio signal to be separated, and this technique can therefore lead to a drop in the audio quality.
The second technique described above obtains a basis matrix from an audio signal from which noise is to be removed, and a target sound-restored signal can likely be generated using the base spectrum corresponding to the actual target sound as long as the target sound basis matrix and the noise basis matrix can be separated well. However, the target sound base spectra and noise base spectra are classified based on interchannel correlation, which requires a multichannel audio signal.
Meanwhile, correlation corresponds to an amount calculated using a set of two base spectra, and a Euclidian distance, an inner product, and so on between the base spectra is used. However, such a simple correlation index does not provide a clear physical meaning and is not necessarily suited to the classification of base spectra.