1. Field of the Invention
The present invention relates to a clustering system that performs clustering of a language model, a clustering method, a clustering program and an attribute estimation system using a clustering system.
2. Description of Related Art
Conventionally, a system performing attribute estimation and speech recognition using an audio model is known. FIG. 26 shows the flow of data in a conventional attribute estimation apparatus. An attribute estimation apparatus 901 of FIG. 26 is intended to estimate the attributes of speakers, such as age groups. When receiving an input speech uttered by a speaker, the attribute estimation apparatus 901 estimates an age group of the speaker using an audio model concerning a plurality of age groups recorded beforehand, and outputs the estimation result. In the example of FIG. 26, an audio model is prepared, in which audios included in voice uttered by the respective age classes of ages 0 to 10, 11 to 20, . . . 61 to 70 and 70 or older are collected. Conventionally, an audio model is prepared using such age classes determined by a human.
In the age classes set by a human, it is difficult to incorporate the ages at which one's voice changes, the ages at which the voice of an adult changes into elderly hoarse voice, the ages at which use of words is changed from the young to adult, the ages at which use of words is changed from adult to the elderly and the like into the age classes of the audio model. In this way, the attribute classes of an audio model set by a human as he/she sees fit will inhibit the improvement in performance of an attribute estimation apparatus.
In order to understand the attribute classes accurately, it is preferable to cluster a model composed of a large amount of speaker's data. Conventionally, technology for clustering audio models composed of a large amount of speaker's data has been developed. For instance, a clustering apparatus is proposed, in which feature quantities of vocal-tract configurations of a plurality of speakers are estimated from speech-waveform data of the respective speakers, and the speakers are clustered based on these feature quantities (e.g., see JP H11(1999)-175090A). Another method is proposed, in which based on information about a vocal-tract length obtained from speech data of a speaker and information for correcting influences of his/her way of vocalization and habits, a feature quantity of the speaker is extracted, and the speaker is clustered using this feature quantity (e.g., see JP 2002-182682A). As a result of such clustering, the classes of the attributes of an audio model can be set accurately, and the clustered audio models can be obtained.
FIG. 27 shows the flow of data in a conventional attribute estimation apparatus that performs attribute estimation using a language model and audio models subjected to clustering. The audio model group is clustered according to ages of speakers, where audio model 1, audio model 2 . . . audio model n are audio models recorded according to their clusters. When receiving an input of voice uttered by a speaker, an attribute estimation apparatus 902 estimates an age group of the speaker using the audio model group subjected to clustering into a plurality of age groups and the language model, and outputs the estimation result. The attribute estimation apparatus 902 of FIG. 27 uses the audio models subjected to clustering into clusters according to age groups, but uses a language model common to all of the age groups.
Therefore, the attribute estimation apparatus 902 can recognize a difference in voice between different age groups, but cannot recognize a difference in wording between different age groups. As one specific example of Japanese, a youth may speak, “Boku-wa-genki-desu”, whereas an old person may speak in a different way as in, “Washi-wa-genki-jya”.
As one specific example of English, an old person may use the wording of “Nature calls”, whereas a youth does not use such a wording, but uses “I have to go to bathroom”.
In order to enable the attribute estimation with consideration given to such language information, it is necessary to cluster language models in which vocabularies appearing in voice uttered by or text written by a plurality of speakers are collected.
In this regard, although a method for clustering speakers based on speech data has been developed already, a method for clustering speakers based on a language model has not been established. In other words, a method for clustering language models has not been developed. The difficulty in clustering of a language model results from the fact that different language models contain different vocabularies, and therefore when a plurality of different language models are to be clustered, they cannot be processed simply as the same vector.
As a simple example, Japanese has a plurality of words representing the first person such as “boku”, “washi”, “watashi”, “ore” and the like. Among these plurality of words representing the first person, the frequency in use is different between age groups and genders. In general, a 70-year-old man uses often “washi” as the first person, whereas a 20-year-old man uses often “boku” as the first person. Therefore, a language model for 70-year-old men will contain the word of “washi”, but a language model for 20-year-old men will contain the word of “boku” instead of “washi”.
As a simple example of English, a language model for 70-year-old men may contain the wording, “Nature calls”, whereas a language model for 20-year-old men may contain, “to bathroom”, instead of “Nature calls”.
Therefore, with the foregoing in mind, it is an object of the present invention to provide a clustering system capable of clustering language models in which vocabularies appearing in voice uttered by or text written by a plurality of speakers are collected, a clustering method and a clustering program.