The present invention relates generally to document information retrieval, and more particularly to a method and apparatus for computing a measure of similarity between arbitrary sequences of symbols.
An important aspect of document information retrieval, classification, categorization, clustering, routing, cross-lingual information retrieval, and filtering is the computation of a measure of similarity between two documents, each of which can be reduced to an arbitrary sequence of symbols. Most techniques for computing document similarity require the computation of pair wise similarities over large sets of documents. Experiments have shown that the adopted similarity measure greatly influences performance of information retrieval systems.
One similarity measure known as the “string kernel” (also referred to herein as the “sequence kernel”) is disclosed by: Chris Watkins, in “Dynamic Alignment Kernels”, Technical Report CSD-TR-98-11, Department of Computer Science, Royal Holloway University of London, 1999; Huma Lodhi, Nello Cristianini, John Shawe-Taylor and Chris Watkins, in “Text Classification Using String Kernels”, Advances in Neural Information Processing Systems 13, the MIT Press, pp. 563-569, 2001; and Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Chris Watkins, in “Text classification using string kernels”, Journal of Machine Learning Research, 2:419-444, 2002, which are all incorporated herein by reference.
Generally, the string kernel is a similarity measure between two sequences of symbols over the same alphabet, where similarity is assessed as the number of occurrences of (possibly noncontiguous) subsequences shared by two sequences of symbols; the more substrings in common, the greater the measure of similarity between the two sequences of symbols. The string kernel may be used to evaluate the similarity between different types of sequences of symbols (or “symbolic data”) such as sequences of: characters, words, lemmas, or other predefined sets of terms (e.g., amino acids or DNA bases).
More specifically, the string kernel is referred to herein as a function which returns the dot product of feature vectors of two inputs strings. Feature vectors defined in a vector space is referred to as a “feature space”. The feature space of the string kernel is the space of all subsequences of length “n” characters in the input strings. The subsequences of characters may be contiguous or noncontiguous in the input strings. However, noncontiguous occurrences are penalized according to the number of gaps they contain.
A limitation of existing implementations for computing the string kernel is the memory required to carry out the computation. Known implementations for computing the string kernel of two sequences of symbols rely on a dynamic programming technique that requires computing and storing a large number of intermediate results. Such known implementations have used a technique which uses a variable (i.e., a component in a large array) for storing each intermediate result. These intermediate results require memory storage that is proportional in size to the product of the lengths of the sequences being compared.
Since existing techniques for computing this measure of similarity between arbitrary sequences of symbols require a storage usage proportional to the product of the lengths of the sequences being compared, it would be advantageous therefore to provide a technique for computing a string kernel that reduces the storage usage requirement of existing techniques to enable the computation of the string kernel for longer sequences of symbols.