1. Field of the Invention:
The present invention relates to a method for processing data into clusters each having a common homogeneity property without prior knowledge of the numbers of clusters and, particularly, to a subword based speaker verification system having the capability of user-selectable passwords which includes segmentation into clusters, such as subwords, without any knowledge of prior linguistic information, and other speech processing systems.
2. Description of the Related Art:
Cluster analysis relates to the entire range of statistical techniques and unsupervised classification methods that group or classify unlabeled data points into clusters. The members within a given cluster exhibit some type of homogeneity properties. The choice of the homogeneity measure results in several different clustering criteria. For example, conventional clustering methods have included an ordinal measure of similarity or dissimilarity between all pairs of objects as described in Helmut Spath "Cluster Analysis Algorithms For Data Reduction And Classification Of Objects", Ellis Horwood Limited, Chichester, England, 1980, Chapter 2, Chapter 6-Sec. 2. Also clustering methods have depended on distributional assumptions about the profile of the given measurements of the objects, as described in Douglas M. Hawkins, "Topics In Applied Multivariate Analysis", Cambridge University Press, Cambridge, England, 1982., Chapter 6, Sec. 4. In some clustering problems, cluster membership has also been restricted under certain criterion such as for segmentation of temporal signals, members of the same cluster can be required to be temporally contiguous, as described in Douglas M. Hawkins, "Topics In Applied Multivariate Analysis", Cambridge University Press, Cambridge, England, 1982., Chapter 6, Sec. 5.
Typically, in automatic speech segmentation, the number of clusters is not known apriori. Hence, the clustering method must determine both the optimal number of clusters and also the proper cluster assignment of the objects. Conventional attempts for determining the number of optimal objects in a cluster have included hierarchial and non-hierarchial techniques. These attempts have the shortcoming of needing prior knowledge of the number of clusters before the objects can be optimally assigned to the cluster.
The objective of a speaker verification system is to verify a claimed identity based on a given speech sample. The speech sample can be text dependent or text independent. Text dependent speaker verification systems identify the speaker after the utterance of a password phrase. The password phrase is chosen during enrollment and the same password is used in subsequent verification. Typically, the password phrase is constrained within a specific vocabulary such as digits. A text independent speaker verification system allows for user selectable passwords. A text independent speaker verification system allows he user to speak freely during training and testing. Accordingly, there are no pre-defined password phrases.
Speech recognition and speaker verification tasks may involve large vocabularies in which the phonetic content of different vocabulary words may overlap substantially. Thus, storing and comparing whole word patterns can be unduly redundant since the constituent sounds of individual words are treated independently regardless of their identifiable similarities. For these reasons, conventional vocabulary speech recognition and speaker verification systems build models based on phonetic subword units.
Conventional text dependent speaker verification systems have used the techniques of Hidden Markov Models (HMM) for modeling speech. For example, subword models, as described in A. E. Rosenberg, C. H. Lee and F. K. Soong, "Subword Unit Talker Verification Using Hidden Markov Models", Proceedings ICASSP, pages 269-272 (1990) and whole word models A. E. Rosenberg, C. H. Lee and S. Gokeen, "Connected Word Talker Recognition Using Whole Word Hidden Markov Models", Proceedings ICASSP, pages 381-384 (1991) have been considered for speaker verification and speech recognition systems. HHM techniques have the limitation of generally requiring a large amount of data to sufficiently estimate the model parameters.
Other speech modeling attempts use Neural Tree Networks (NTN). The NTN is a hierarchial classifier that combines the properties of decision trees and neural networks, as described in A. Sankar and R. J. Mammone, "Growing and Pruning Neural Tree Networks", IEEE Transactions on Computers, C-42:221-229, March 1993. For speaker recognition, training data for the NTN consists of data for the desired speaker and data from other speakers. The NTN partitions feature space into regions that are assigned probabilities which reflect how likely a speaker is to have generated a feature vector that falls within the speaker's region.
The above described modeling techniques rely on speech being segmented into subwords. Traditionally, segmentation and labeling of speech data was performed manually by a trained phonetician using listening and visual cues. There are several disadvantages to this approach. Firstly, this task is extremely tedious and time consuming. Secondly, only a small number of experts have the level of skill and knowledge to achieve reliable labeling of subwords. The combination of the above disadvantages also limit the amount of data that can be labeled in this manner. Thirdly, this manual process involves decisions that are highly subjective which leads to a lack in consistency and reproducibility of results. Also, the above techniques have the drawback of human error.
One solution to the problem of manual speech segmentation is to use automatic speech segmentation procedures. Typical automatic segmentation and labeling procedures use the associated linguistic knowledge, such as the spoken text and/or phonetic string to perform segmentation. In order to incorporate existing acoustic-phonetic knowledge, these procedures generally use bootstrap phoneme models trained from the pre-existing manually labeled speech corpus. Conventional automatic speech segmentation processing has used hierarchical and non-hierarchical approaches.
Hierarchical speech segmentation involves a multi-level, fine-to-coarse segmentation which can be displayed in a tree-like fashion called dendogram; see James R. Glass and Victor W. Zue, Proceedings of ICASSP, "Multi-Level Acoustic Segmentation of Continuous Speech", pp. 429-432, 1988 and Victor Zue, James Glass, Michael Phillips and Stephanie Seneff, Proceedings of ICASSP, "Acoustic Segmentation and Phonetic Classification in the SUMMIT System pages 389-392, 1989. The initial segmentation is a fine level with the limiting case being a vector equal to one segment. Thereafter, a segment is chosen to be merged with either its left or right neighbor using a similarity measure. This process is repeated until the entire utterance is described by a single segment.
Non-hierarchial speech segmentation attempts to locate the optimal segment boundaries by using a knowledge engineering-based rule set or by extremizing a distortion or score metric. Knowledge engineering-based algorithms have been implemented as a set of knowledge sources that apply rules to speech parameters to locate the segment boundaries and also assign broad phonetic category labels to these segments as described by Ronald Cole and Lily Hou, Proceedings of ICASSP, "Segmentation And Broad Classification of Continuous Speech", pp. 453-456, 1988; by Kaichiro Hatazaki, Yasuhiro Komori, Takeshi Kawabata and Kiyohiro Shikano, Proceedings of ICASSP, "Phoneme Segmentation Using Spectogram Reading Knowledge", pp. 393-396, 1989 and by David B. Grayden and Michael S. Scordilla, Proceedings of ICASSP, "Phonemic Segmentation" of Fluent Speech", pp. 73-76, 1994.
Other attempts of non-hierarchal operator segmentation use dynamic programming based techniques to find the set of optimal segment boundary points that minimize the over-all within segment distortion as described by T. Svendsen and F. Soong, Proceedings of ICASSP, "On the Automatic Segmentation of Speech Signals", pp. 3.4.1-3.4.4, 1987 and by Sin-Horng Chen and Wen-Yuan Chen, IEEE Transactions on Speech and Audio Processing, "Generalized Minimal Distortion Segmentation For ANN-based Speech Recognition", 3(2):141-145, March 1995. The segmentation is achieved by forced alignment of the given speech sample with its corresponding subword sequence. The above-described conventional techniques for hierarchial and non-hierarchial speech segmentation have the limitation of needing prior knowledge of the number of speech segments and corresponding segment modules.