1. Field of the Invention
The present invention relates to techniques for estimating similarity between complex objects. More specifically, the present invention relates to a method and an apparatus that estimates similarity between complex objects by comparing object signatures for the complex objects.
2. Related Art
Data explosion in the information age is demanding an increasing number of computing applications to routinely process huge amounts of input data. For example, search engines on the Internet must comb through the billions of web pages that are presently accessible through the Internet and obtain relevant results within a fraction of a second. Traditionally, the computational approaches used by these applications assume that entire data objects can be stored in main memory while the data objects are being processed. However, it is unrealistic to keep all of the data objects in the main memory when applications are dealing with large numbers of “massive” data objects, such as data objects from a genome database, multimedia files, or web page repositories.
The tremendous burden created by these massive data objects has led to the development of computing techniques that can process such data objects more efficiently. In particular, people have developed “streaming” techniques which operate by streaming individual elements in a data object sequentially through the processor and the memory, thereby reducing memory storage requirements at any given time. Furthermore, while streaming the data object, these streaming techniques can construct an object signature for the data object that captures relevant features of the elements in the data object, while occupying significantly less space than the original data object. These object signatures are useful because many operations on the original data objects (such as comparisons) can be performed more efficiently on the object signatures with significantly reduced memory and computational requirements. Moreover, these object signatures can be stored using very little space for future reuse.
Charikar has applied the object signature technique to estimate the similarity between arbitrarily complex objects (see Moses S. Charikar, “Similarity Estimation Techniques from Rounding Algorithms,” Proceedings of the 34th Annual ACM Symposium on Theory of Computing, 2002). Specifically, Charikar's model first computes an object signature for an object in a streaming manner, such that the elements of the object are fed one-by-one through the model, while maintaining an internal state for the object. More specifically, the model applies a hashing operation to each of the elements in the object, and the hashed value of the element is used to update the internal state for the object. When all elements of the object have been processed, the model uses the final internal state to compute a signature for the object. Note that the internal state for the object requires only a small amount of space, which in practice is independent of the size of the object.
Unfortunately, Charikar's model has a drawback. Specifically, while generating the object signature, Charikar's model tends to overemphasize the influence of multiple occurrences of an identical feature in an object. In other words, when the same feature occurs multiple times in the object, the influence of that feature on the resulting object signature increases dramatically, thereby degrading the utility of the object signature for many types of operations, such as comparisons.
Hence, what is needed is a method and an apparatus for generating an object signature for an object without the above-described problems.