1. Field of the Invention
This invention relates generally to matching test data to data within a database, and in particular to efficient fuzzy matching of data sampled from a noisy environment to samples within a large repository.
2. Background of the Invention
An important class of problems involves searching through a data repository for a match to particular item of test data, where the data repository contains a large number of data segments. The repository typically contains a set of sequenced data that reflects known events or items, and the test segment is a sample acquired from an unknown event or item. The test segment is often, but not necessarily, a subset (or sample in size) than individual stored data items. In this problem, the identity of the test segment is determined by matching the test segment to one or more data segments (or portions thereof) in the repository. Because of measurement noise and other real-world problems, the acquired test segment is not expected to match exactly with a segment in the repository. Accordingly, an approximate match may be considered sufficient to have a reasonable confidence in the match.
There are various specific applications of this problem. For example, the repository might include streams of feature vectors from audio samples in a database of songs, streams of feature vectors from video samples in a database of movies, or even portions of gene sequences in a database of DNA sequences. An obvious brute-force method to match a test segment to a segment in such a database is to keep a repository of all the streams and then attempt to match the test segment to each stream in the repository. This problem is made more difficult where the streams in the repository are longer than the test segment. In such a case, brute-force matching requires testing for each stream every substream of the same length as the test stream. Although such a brute-force method would likely give a correct answer, it can also be quite inefficient. In many applications, the repository could contain millions of streams, making searching each of the possible samples in the database to find a match impractical for real world applications.
Nearest-neighbor matching and approximate nearest-neighbor matching have been intensively studied for a number of years. But applying those solutions to this problem quickly becomes unmanageable for high dimensions, corresponding to a wide feature vector, as described in “Approximate Closest-Point Queries in High Dimensions,” by M. Bern, Information Processing Letters (1993). One approach for solving the approximate nearest-neighbor search problem is called “locality-sensitive hashing,” described in “Similarity Search in High Dimensions via Hashing,” by Gionis, Indyk et al. (1998). This solution, however, does not function well in the presence of noise levels of 20% or more. Searching time-sequenced data has also been studied, for example, in “Efficient Similarity Search in Sequence Databases,” by Agrawal, Faloutsos, and Swami, but the combination of multi-dimensional feature vectors plus time-sequencing is a difficult problem.
Accordingly, it is desirable to construct an appropriate data repository and provide a method for efficiently searching it, where the data repository and the test segment comprise high-dimensional data that may be affected by noise. Such a search may involve determining whether a test stream matches a stream already in the repository and finding that stream, or it may involve finding all streams in the repository that are sufficiently close to the given test stream to constitute a match. Preferably, the method should be sufficiently robust to function reliably in the presence of noise.