In recent years, fraud and other malicious solicitations conducted using telephones with an aim to defraud people out of money have become a social problem. To address this, techniques have been proposed for estimating a speaker's state of mind by monitoring the speaker's voice during voice telephone communications (For example, refer to Japanese Laid-open Patent Publication Nos. 2011-242755 and 2012-168296 and U.S. Patent Application No. 2013/0006630).
For example, an utterance state detection apparatus disclosed in Japanese Laid-open Patent Publication No. 2011-242755 extracts high-frequency components from the results of the frequency analysis of a speaker's utterance data, and calculates the degree of variation of the high-frequency components per unit time. Then, the utterance state detection apparatus detects the vocal utterance state of the specific speaker, based on the statistics obtained from the specific speaker's utterance data, the statistics being calculated section by section based on a plurality of degrees of variation during a predetermined period of time.
On the other hand, a suppressed state detection apparatus disclosed in Japanese Laid-open Patent Publication No. 2012-168296 analyzes input voice by dividing the input voice into a plurality of frames, and calculates the average value of the analysis results. The suppressed state detection apparatus determines a threshold value, based on the calculated average value of the analysis results and on statistical data concerning the average values of the analysis results prestored for a plurality of speakers and the cumulative frequency distribution of the analysis results, and calculates the frequency of occurrence of analysis results having values larger than the threshold value among the plurality of analysis results. Then, based on the frequency of occurrence, the suppressed state detection apparatus judges the state of tension of the vocal cords producing the voice.
A state detection apparatus disclosed in U.S. Patent Application No. 2013/0006630 calculates a plurality of statistics for feature parameters from a speaker's utterance data. Then, based on the feature parameter statistics of the speaker's utterance data and those of reference utterance data representing vocal utterance in a normal state, the state detection apparatus creates pseudo-utterance data having at least one statistic that matches one of the statistics of the reference utterance data. Then, based on the feature parameter statistics regarding the speaker's utterance data and the pseudo-utterance data, the state detection apparatus calculates feature parameter statistics regarding the synthesized utterance data obtained by replacing portions of the pseudo-utterance data with the corresponding portions of the speaker's input utterance data. The state detection apparatus detects an abnormal state of the speaker, based on the difference between the feature parameter statistics of the synthesized utterance data and those of the reference utterance data.
The above techniques are based on the assumption that the voice of the speaker at the transmitting end and the voice of the speaker at the receiving end are captured separately. To capture the voice of the speaker at the transmitting end and the voice of the speaker at the receiving end separately, a voice communication recording adapter is connected, for example, between the telephone base unit and the handset. Then, the state estimating apparatus estimates the state of the speaker by acquiring through the adapter a voice signal from the transmitting end and a voice signal from the receiving end separately. In this case, the voice signals that can be acquired through the voice communication recording adapter are limited to the voice signals arising from the voice communication being conducted over the telephone unit to which the voice communication recording adapter is connected. Therefore, if a plurality of telephone units are connected to one telephone line, and if the voice communication recording adapter is connected to only one of the plurality of telephone units, the state estimating apparatus is unable to estimate the state of the speaker from the voice communication being conducted on any other telephone unit than that one telephone unit. On the other hand, if the voice communication recording adapter is connected between the modular rosette and the distributor, and if the state estimating apparatus is adapted to acquire voice signals from the voice communication recording adapter thus connected, voice signals can be acquired from the voice communication being conducted on any of the telephone units connected to the distributor. However, in this case, any voice signal that can be obtained from the voice communication recording adapter is a voice signal containing the voice from the speaker at the transmitting end and the voice from the speaker at the receiving end in a mixed fashion. Therefore, if the above techniques that are based on the assumption that the voice of the speaker at the transmitting end and the voice of the speaker at the receiving end are captured separately are applied to such voice signals, it is difficult to achieve a sufficient estimation accuracy. This is because the voice of the other speaker is superimposed on the voice of the intended speaker, which means that the features of the voice of the other speaker are included in the features of the voice used to estimate the state of the intended speaker. On the other hand, a technique is proposed that separates sounds from two sound sources by estimating parameters of a sine wave superimposition model (for example, refer to Japanese Laid-open Patent Publication No. 2008-304718).