There is known a pattern recognition device (speaker attribute recognition device) which determines a gender by voice. Such device performs recognition of a male, female, or silence segment on each frame provided by extracting a sound waveform corresponding to a fixed period, and uses the recognition result of each frame to perform recognition by counting the number of frames determined to correspond to each of the male and female.
The following device has been proposed as well. First, a sound feature of each of a male, a female and silence is modeled in advance by mixed GMMs (Gaussian mixture Models) so that a voice feature is calculated for each frame that is provided by extracting a sound waveform corresponding to a fixed period. Next, the mixed GMMs are used to perform pattern matching on the male, female and silence so that the larger of the likelihood of male and female and the likelihood of the silence are used to detect a series of voice segments (voice segments) including a short period of silent pause, the calculated likelihoods of male/female are added up for the series of voice segments, and the likelihoods are compared to detect a segment and perform gender recognition by a single recognizer for the frame.
However, in the related art where the detection of the voice segment and the recognition of the speaker attribute corresponding to the segment are performed by using the result output from the single recognizer of the frame that is provided by extracting the sound waveform corresponding to the fixed period, the recognition problem pertaining to the frame is solved by comparing the likelihood of the generated models and has not been solved directly by using a probability. Moreover, when the detected voice segment partly includes a silent segment, a likelihood used to determine the speaker attribute has been calculated and added up for the silent segment.