1. Field of the Invention
The present invention relates to generation of pattern models used for pattern recognition of predetermined data on unspecified objects. More particularly, it relates to a data process unit, data process unit control program, pattern model search unit, pattern model search unit control program, and specific pattern model providing system which are suitable for generating pattern models for unspecified objects taking distribution of diversifying feature parameters into consideration under specific conditions consisting of a combination of such factors as a type of object and measurement environment of the predetermined data and which are suitable for providing pattern models intended for unspecified speakers and adapted to pattern recognition of predetermined data on specified objects; a data process unit, data process system, data process method, and data process unit control program which are suitable for evaluating the value of speech data of unidentified speakers using pattern models generated in relation to speech data of a plurality of speakers; and a data process unit, data process system, data process method, and data process unit control program which are suitable for detecting a speaker who resembles a target speaker in speech out of a plurality of speakers and which are suitable for providing information needed to enhance similarity in speech between the target speaker and the detected speaker.
2. Description of the Related Art
There is an information processing technology known as pattern recognition which involves observing or measuring some properties of objects and identifying and classifying the objects based on data obtained as a result of the observation or measurement.
Generally, speech recognition, which is a type of pattern recognition, comprises an acoustic analyzer which converts speech samples taken from a speaker into a series of feature parameters and speech matcher which matches the series of feature parameters obtained by the acoustic analyzer with information about feature parameters of vocabulary words prestored in a storage unit such as a memory or hard disk and selects the vocabulary word with the highest similarity as a recognition result.
Known acoustic analysis methods for converting speech samples into a series of feature parameters include cepstrum analysis and linear prediction analysis, which are described in Non-Patent Document 1.
Among speech recognition, a technique for recognizing speech of unspecified speakers is generally referred to as speaker independent speech recognition. Since information about feature parameters of vocabulary words is prestored in a storage unit, the speaker independent speech recognition frees the user from the task of registering words desired to be recognized, unlike speaker dependent speech recognition.
Regarding methods for preparing information about feature parameters of vocabulary words and matching it with a series of feature parameters obtained by converting input speech, methods based on Hidden Markov models (HMMs) are in common use. In HMM-based methods, phonetic units such as syllables, half-syllables, phonemes, biphones, and triphones are modeled using HMMs. Pattern models of such phonetic units are generally referred to as acoustic models.
Methods for creating acoustic models are described in detail in Section 1 of Non-Patent Document 1.
Also, those skilled in the art can easily construct a speaker independent speech recognition unit based on the Viterbi algorithm described in Section 6.4 of Non-Patent Document 1.
Conventionally, more than one acoustic model is often created according to sex (male/female), age (children/adults/the aged), and speech environment (which is dependent on noise).
Non-Patent Document 2 discloses a method for clustering high dimensional acoustic models automatically using distance among the acoustic models. The clustering method involves performing clustering repeatedly on a trial-and-error basis by specifying a large number of clustering conditions until a good clustering result is obtained.
(Non-Patent Document 1) L. Rabiner et al., “Fundamentals of Speech Recognition,” Prentice Hall, Inc., 1993.
(Non-Patent Document 2) T. Kosaka et al., “Tree-Structured Speaker Clustering for Fast Speaker Adaptation,” Proc. ICASSP, Vol. I, pp. I-245-248, Adelaide, Australia, 1994.
However, as described above, a small number of acoustic models are often created according to sex (male/female), age (children/adults/the aged), and speech environment (which is dependent on noise) at the most. Consequently, to divide the acoustic models, there is no choice but to use a heuristic method based on transcendental knowledge. Thus, there are limits to available recognition rates.
Regarding Non-Patent Document 2, since there is no means to grasp interrelationship among acoustic models such as relative distance among acoustic models or the number and size of clusters of acoustic models easily in a visual way or the like, it is necessary to repeat calculations many times under a large number of clustering conditions until good clustering results are obtained. This requires a great deal of calculation time.
Generally, to implement high-accuracy speech recognition, since acoustic models are generated using cepstrum (described above), MFCC (Mel-Frequency Cepstrum Coefficient), or other high dimensional (10- to 30-dimensional) feature parameters, it is difficult to represent interrelationship among a plurality of acoustic models visually.
The above items apply not only to acoustic models, but also to pattern models in image recognition and other fields.