The present disclosure relates to a signal processing device, a signal processing method, and a program, and more particularly, a signal processing device, a signal processing method, and a program which can identify a piece of music from an input signal in which the piece of music and noise are mixed.
In the related art, in order to identify a piece of music input as an input signal, a matching process of matching the feature quantity of the input signal with the feature quantity of reference signals which are candidates for the piece of music to be identified is performed. However, for example, when a broadcast sound source of a television program such as a drama is input as an input signal, the input signal often includes a signal component of a piece of music as background music (BGM) and noise components (hereinafter, also referred to as noise) other than the piece of music, such as a human conversation or noise (ambient noise) and a variation in feature quantity of the input signal due to the noise affects the result of the matching process.
Therefore, a technique of performing a matching process using only components with a high reliability by the use of a mask pattern masking components with a low reliability in the feature quantity of an input signal has been proposed.
Specifically, a technique of preparing plural types of mask patterns masking matrix components corresponding to a predetermined time-frequency domain for a feature matrix expressing the feature quantity of an input signal transformed into a signal in the time-frequency domain and performing a matching process of matching the feature quantity of the input signal with the feature quantities of plural reference signals in a database using all the mask patterns to identify the piece of music of the reference signal having the highest degree of similarity as a piece of music of the input signal has been proposed (for example, see Japanese Unexamined Patent Application Publication No. 2009-276776).
A technique of assuming that a component of a time interval with high average power in an input signal is a component on which noise other than a piece of music is superimposed and creating a mask pattern allowing a matching process using only the feature quantity of a time interval with low average power in the input signal has also been proposed (for example, see Japanese Unexamined Patent Application Publication No. 2004-326050).