Conventionally, there is a proposed method of estimating, in a situation in which acoustic signals output from target sound sources and acoustic signals due to background noise are present in a mixed manner, from observation signals of sound collected by a plurality of microphones, a spatial correlation matrix in a case where only each of the target sound sources is included in the corresponding observation signals. Furthermore, when estimating the spatial correlation matrix, in some cases, a mask that is the proportion of each of the acoustic signals included in the observed acoustic signals is used.
The spatial correlation matrix is a matrix representing the auto-correlation and the cross-correlation of signals between microphones and is used to, for example, estimate the position of the target sound source or design a beamformer that extracts only the target sound source from the observation signals.
Here, a conventional spatial correlation matrix estimation device will be described with reference to FIG. 6. FIG. 6 is a diagram illustrating the configuration of the conventional spatial correlation matrix estimation device. As illustrated in FIG. 6, first, a time-frequency analysis unit 10a calculates an observation feature value vector for each time-frequency point extracted from the observation signals. Then, a mask estimation unit 20a estimates the masks associated with the target sound source and the background noise based on the observation feature value vectors. Furthermore, an observation feature value matrix calculation unit 30a calculates an observation feature value matrix by multiplying the observation feature value vector by Hermitian transpose of the subject observation feature value vector.
Then, a target sound feature value matrix time average calculation unit 40a calculates an average target sound feature value matrix that is the time average of the matrix obtained by multiplying the mask associated with the target sound source by the observation feature value matrix. Furthermore, a noise feature value matrix time average calculation unit 50a calculates an average noise feature value matrix that is the time average of the matrix obtained by multiplying the mask associated with the background noise by the observation feature value matrix. Lastly, a target sound feature value noise removal unit 60a estimates a spatial correlation matrix of the target sound source by subtracting an average noise feature value matrix from the average target sound feature value matrix.