There is known technology for estimating a target value that should be obtained for an observation signal, based on the observation signal, such as “music emotion recognition” that estimates, based on a music audio signal, an emotion (a target value) that a listener feels when he/she listens to the music. The music emotion recognition is one of typical examples of such technology. In the conventional studies on music emotion recognition, a focus has been placed on finding effective acoustic features for improving estimation accuracy and proposing a new method of regression between the acoustic features and values indicative of the emotion that is elicited by the music when the listener listens to the music.
In the conventional music emotion recognition, acoustic features are calculated from a music audio signal, and the calculated acoustic features are mapped into a space representing music emotions by using the regression or cluster classification methods. In the field of psychological studies, it has been proposed that the emotion that a human feels should be represented with two-dimensional values of Valence and Arousal (VA values) [Non-patent Document 1]. FIG. 18 illustrates the VA values represented on a two-dimensional plane. In the music emotion recognition, the VA values are analyzed based oil the music audio signal. More specifically, the VA values are estimated for segments of the music audio signal, lasting 30 seconds. This is a problem setting employed in the workshop, Emotion in Music of MediaEval Workshop, in which the participants compete with each other in performance of music emotion recognition algorithm. This problem setting has been leading the recent evaluation campaign of the music emotion recognition [Non-patent Documents 2 and 3]. FIG. 18 illustrates a space of emotion, having the Valence and Arousal values as two-dimensional coordinates and literary annotations of emotion at the individual coordinate points in the space. The inventors have prepared this figure, based on a figure of Non-patent Document 5 redrafted from a figure of Non-patent Document 4, and annotated the figure with Japanese equivalents to the English annotations.
In the conventional studies on music emotion recognition, efforts to find effective acoustic features for improving analysis performance have been made. Methods of mapping the chosen acoustic features into the emotion space using the linear regression such as the multivariate regression analysis [Non-patent Documents 6 and 7] have been proposed. Further, it has come up for discussion to automatically select a combination of effective acoustic features using a feature selecting algorithm [Non-patent Document 8]. Instead of carefully selecting the features, a method based on multi-level regression has been proposed. In the proposed method, regression models, to which the acoustic features are to be input, are constructed in advance, and then another model is used to aggregate estimation results from the individual regression models [Non-patent Documents 9 and 10].
In addition to the above, another proposed approach is to use non-linear regression models and apply non-linear dimensionality reduction. The proposed methods are, for example, analysis using neural networks [Non-patent Documents 11, 12, and 13], analysis using a support vector machine [Non-patent Document 14], and analysis using Gaussian process regression [Non-patent Documents 15 and 16].