1. Field of the Invention
This invention pertains generally to speaker recognition, and more particularly, to a system and method for recognizing a speaker with an estimated density of data points located, in a data structure, closest to a data point from the speaker.
2. Description of Related Art
Advancements in speaker recognition bring applications of human-computer interaction further into our everyday lives. Such artificial intelligence enables, for example, voice instructions to a robot, hands free productivity in a vehicle, and biometric security features. Even more, text-independent speaker recognition identifies speakers without relying on predetermined utterances such as a password. For example, a speaker enrolls by pronouncing an utterance of letters, and is recognized by pronouncing an utterance of numbers. On the other hand, text-dependent systems use the same utterance and make comparisons on similar portions of the two utterances (e.g., same letter of the alphabet).
A difficulty presented by text-independent voice recognition is that significant amounts of information must be collected from each voice sample in order to provide a basis for reliable comparison. An application having a large number of registered speakers must be outfitted with a large and complex database and a high-end processing system to perform comparisons. Many of the current voice recognition techniques are problematic in this environment in that they cannot handle large data sets quickly enough or they take shortcuts by making inaccurate assumptions about the data.
Parametric (or generative) approaches to speaker recognition are too restrictive and inaccurate for real-world data distributions. Methods such as Gaussian Mixed Models assume Gaussian distributions in order to reduce the amount of computations necessary for making an identification. However, data distribution properties change over time, and are, consequently, not always amenable to such assumptions. Thus, parametric approaches do not provide sufficient accuracy for many applications of voice recognition.
Discriminative approaches to speaker recognition, although highly accurate, are not trainable for large data sets. Support vector machines, polynomial regression classifiers, relevance vector machines, regularized least-squares classification, for example, use classifiers rather than parametric assumptions. Additionally, many discriminative approaches exhibit sparsity and other properties, making them computationally efficient by reducing the number of classifiers needed to make a comparison. Nevertheless, discriminative approaches are not scalable to large data sets (e.g., 500,000 or 1,000,000 data points) since training complexity can be quadratic to the number of data points. Because voice recognition performs pattern recognition of rich, high-dimensional data points, many nonparametric approaches are not tractable.
Other nonparametric approaches require unacceptable computation time. Each test data point is compared against every training point causing test time to increase linearly with the amount of training data points.
Accordingly, there is a need for a robust voice recognition system and method that maintains accuracy and computational efficiency in environments with large and feature rich data sets. During enrollment of a speaker, the solution should efficiently organize a speaker data structure such that, during recognition of an unidentified speaker, it can quickly produce a subset of speaker data points to use in reliably estimating a density function for identification.