Recently, techniques have been known that analyze voice data and detect a state, such as an emotion, of an user. For example, a method is known in which intensity, speed, tempo, intonation representing intensity change patterns of utterance, and the like are detected based on a voice signal, and then, an emotional state, such as sadness, anger, and happiness, is produced from their change amounts (for example, refer to Patent Document 1). For another example, a method is known in which a voice signal is subjected to lowpass filtering to extract a feature, such as intensity and pitch, of a voice signal so as to detect an emotion (for example, refer to Patent Document 2). For still another example, a method is known in which a feature relating to a phonologic spectrum is extracted from voice information, and an emotional state is determined based on a state determination table provided in advance (for example, refer to Patent Document 3). Furthermore, a device is known that extracts a periodical fluctuation of amplitude envelope of a voice signal, and determines whether an user is making an utterance in a forceful state from the fluctuation so as to detect anger or irritation of the user (for example, refer to Patent Document 4).    Patent Document 1: Japanese Laid-open Patent Publication No. 2002-091482.    Patent Document 2: Japanese Laid-open Patent Publication No. 2003-099084.    Patent Document 3: Japanese Laid-open Patent Publication No. 2005-352154.    Patent Document 4: Japanese Laid-open Patent Publication No. 2009-003162.
In most of the related art emotion detection techniques as described above, specified user reference information indicating a state of a specified user is prepared in advance as reference information for each user from a feature amount individualizing an user of voice data, such as voice pitch, voice volume, and prosody information. An emotion of the user is then detected by comparing each feature amount of voice data serving as a detection target with the specified user reference information. In this way, reference information is prepared in advance for each specified user in the related art techniques.
However, the preparation of reference information for each specified user in advance rises a problem in that the application of a technique is limited to a specified user, and cumbersome work is needed to produce reference information every introduction of the technique.
Taking into such a problem into consideration, the technique disclosed herein aims to provide an utterance state detection device and an utterance state detection method that can detect an utterance state without preparing reference information for each specified user in advance.