Field of the Invention
The present invention relates to a technique of picking up a target sound while suppressing noise.
Description of the Related Art
The recent proliferation of camcorders, cameras, smartphones, and the like has allowed easy video shooting. Portable audio recorders capable of recording high-quality sounds have also become widespread. This increases the opportunities of recording or picking up an ambient sound or a sound of a target object both indoors and outdoors whether a video is included or not.
If noise other than the target sound, for example the operating sound of an air conditioner or a PC indoors, or wind noise (wind sound) outdoors, is mixed in such a sound pickup signal, it grates on the ear and also impedes voice recognition. Hence, it is conventionally important to suppress unnecessary noise in the sound pickup signal.
Some techniques of suppressing noise in an audio signal use non-negative matrix factorization (NMF). In these techniques, short-time Fourier transform is performed for an audio signal, and a matrix (to be referred to as an audio matrix hereinafter) in which the absolute amplitude values of coefficients are arranged in time series is factorized into a basis spectrum matrix and an activity matrix by non-negative matrix factorization. Based on an assumption that the matrices can be separated into components derived from the sound sources, the matrices are classified into a submatrix concerning the target sound and a submatrix concerning noise. A target sound reconstruction signal without noise is reconstructed using a target sound basis spectrum matrix that is a basis spectrum submatrix concerning the target sound and a target sound activity submatrix that is an activity submatrix concerning the target sound. Note that an audio matrix colored by values and displayed as a map is generally called a spectrogram.
For example, in patent literature 1 (Japanese Patent Laid-Open No. 2009-128906), a target sound and noise are prepared independently of an audio signal that is a noise removal target and learned in advance, thereby obtaining a teacher basis spectrum matrix and a teacher activity matrix for each of the target sound and noise. Then, using the statistic information of the teacher basis spectrum matrices and the teacher activity matrices, a matrix obtained by time-frequency conversion of the audio signal is factorized to obtain a target sound reconstruction signal.
In patent literature 2 (Japanese Patent Laid-Open No. 2012-22120), two matrices obtained by time-frequency conversion of audio signals of two channels undergo non-negative matrix factorization. Out of basis spectra included in each column of the basis matrix of each channel, basis spectra having high correlation between the channels are defined as noise basis spectra, and the rest is defined as target sound basis spectra. A target sound reconstruction signal is generated using a target sound basis matrix formed from the target sound basis spectra and a target sound activity matrix corresponding to it.
In the conventional techniques of separating sound sources using NMF, the components of each basis spectrum are not necessarily derived from the components of only one sound source, and the components of a plurality of sound sources may mix. Hence, when noise is suppressed by NMF, the reconstructed target sound degrades because of target sound components included in part of the noise basis spectrum matrix.
For example, the technique disclosed in patent literature 1 attempts to strictly separate sound sources by learning basis spectra and activities in advance. However, if the target sound components are included in the noise basis spectrum matrix as the result of separation, it cannot be corrected. There exists a related art that attempts to extract target sound components from a noise signal separated and reconstructed by NMF.
For example, the technique disclosed in patent literature 2 extracts residual components from a reconstructed noise signal based on the harmonic structure of a target sound signal reconstructed by NMF. However, extraction is difficult by this method when the target sound signal has no harmonic structure.