1. Field of the Invention
The present invention relates to a speaker-independent speech recognition method, and more particularly, to a speech recognition method using speaker cluster models, which can be used in products involving speech recognition such as spoken dialogue systems and auto-attendant systems.
2. Description of the Related Art
From related art, we learn that speaker cluster models have been applied to speaker-independent speech recognition and speaker adaptation. Although used in different application fields, the speaker cluster models are built in the same training phases. A training phase starts with dividing speakers into different speaker clusters. Then a cluster-dependent model is independently trained for each speaker cluster by using the speech data of the speakers belonging to the cluster. The collection of all cluster-dependent models then forms a speaker cluster model. Most approaches in building speaker cluster models are focused on means of dividing speakers into clusters, especially in finding measurement of similarities across speakers. Some speaker clustering methods reported in articles of the related art are as follows:
1. Using acoustic distances across speakers to measure similarities across speakers (T. Kosaka and S. Sagayama, xe2x80x9cTree-structured speaker clustering for fast speaker adaptationxe2x80x9d, Proceeding of ICASSP94, pp.245-248, 1994; Y. Gao, M. Padmanabhan and M. Picheny, xe2x80x9cSpeaker adaptation based on pre-clustering training speakersxe2x80x9d, Proceeding of EUROSPEECH97, pp.2091-2094, 1997)
2. Using vocal-tract-size related articulatory parameters to measure similarities across speakers (M. Naito, L. Deng and Y. Sagisaka, xe2x80x9cSpeaker clustering for speech recognition using the parameters characterizing vocal-tract dimensionsxe2x80x9d, Proceeding of ICASSP98, pp.981-984, 1998)
3. Clustering the speakers according to three classes of speaking ratexe2x80x94fast, medium and slow (T. J. Hazen and J. R. Glass, xe2x80x9cA comparison of novel techniques for instantaneous speaker adaptationxe2x80x9d, Proceeding of EUROSPEECH97, pp.2047-2050, 1997).
The difference among the three aforementioned speaker clustering methods is that their methods for measuring similarities across speakers are different. There are two different speaker cluster algorithms according to clustering structure. The first algorithm is called plain speaker cluster algorithm. This algorithm clusters all speakers directly using one of the aforementioned speaker clustering methods. The second algorithm is called tree-structured speaker cluster algorithm. Please refer to FIG. 1 which illustrates a tree-structured speaker cluster model 10. The speaker cluster model 10 has a root speaker cluster A 100 where all speakers belong. The speakers in the root speaker cluster A 100 are divided into male speaker cluster M 102 and female speaker cluster F 104 according to their gender. The male speakers in the male speaker cluster M 102 are further clustered into speaker clusters M1112 and M2114, respectively. The female speakers in the female speaker cluster F 104 are further clustered into speaker clusters F1122 and F2124, respectively.
When the speaker cluster model is applied to speaker-independent speech recognition where the testing speaker who utters a speech signal is unknown, two specific decision rules are commonly employed:
I. Build a cluster pre-selection model in addition to the speaker cluster model; when receiving the speech signal, use the cluster pre-selection model to pre-select a speaker cluster to which the testing speaker who utters the speech signal most probably belongs, and only use the cluster-dependent model of the selected speaker cluster to recognize the speech signal.
II. Find a best candidate for each speaker cluster by using each of the speaker cluster models as a recognition model to recognize the speech signal, and choose as the final recognition result a candidate with the highest score across all speaker clusters.
The present invention uses the speaker cluster model in speaker-independent speech recognition. Therefore, only related techniques are introduced.
In the training phase of the speaker cluster model, the methods of the related art emphasize on how to cluster speakers. Their purpose is to cluster speakers with similar characteristics into the same speaker cluster. However, the purpose of speech recognition is to correctly recognize a speech signal. Therefore, the two purposes are not exactly the same. In other words, improving the effectiveness of speaker clustering does not necessarily improve the accuracy of speech recognition. In a recognition phase, regardless which related art recognition algorithm is used, each cluster-dependent model is seen as an independent recognition model. The dependency among different cluster-dependent models is never considered.
Clustering speakers with similar characteristics absolutely into the same speaker cluster is a difficult task. Please refer to FIG. 2. FIG. 2 shows two speaker clusters 202, 204. The speaker clusters 202, 204 have an overlapping area 206. That means that, although the speakers in each speaker cluster 202, 204 have substantially similar characteristics, some of the speakers in one speaker cluster have characteristics similar to those of the speakers in the other speaker cluster. For example, suppose there are four speakers W, X, Y and Z. Speaker W and speaker X have similar characteristics; speaker X and speaker Y have similar characteristics; and speaker Y and speaker Z have similar characteristics. When clustering, assuming that the speakers W and X are clustered into the speaker cluster 202, the speakers Y and Z are clustered into the speaker cluster 204, because the speakers X and Y have similar characteristics, they form the overlapping area 206. In a speech recognition phase, when a testing speaker who inputs a speech signal has characteristics between that of the speaker X and that of the speaker Y, if each cluster-dependent model is treated as an independent recognition model, without considering the influence that its dependency with other cluster dependent models has on recognition, the overlapping phenomena generated by clustering may have a negative effect on recognition.
It is therefore an objective of the present invention to provide a speech recognition method for improving the performance of speech recognition.
To achieve the aforementioned goal, the present invention introduces the dependency among a plurality of cluster-dependent models to overcome recognition problems caused by between-speaker variability for improving the performance of speech recognition. The speech recognition method introduced in the present invention comprises the following steps: receiving a speech signal; recognizing the speech signal using a speaker cluster model obtained in a training phase wherein the speaker cluster model is a collection of a plurality of cluster-dependent models, and a score of each candidate is calculated according to a score function which is defined by taking the dependency among the cluster-dependent models into account; and obtaining a final recognition result according to a decision rule based on the score of each candidate.
The training phase comprises building an initialization model; adjusting parameters of at least two cluster-dependent models of the initialization model by using a discriminative training method to obtain the speaker cluster model wherein the discriminative training method is implemented by using a minimum classification error as a training criterion, a discriminant function of the discriminative training method is defined in the same manner as the score function.
Drawings are incorporated with the implementation hereinafter to further describe the present invention in detail.