The invention relates to the field of mathematics, and more particularly the computation of sparse matrixes. In mathematical terms, a vector is a sequence of integral keys and corresponding numeric values. A matrix is a two-dimensional grid of key-pairs with corresponding values, or equivalently a vector of vectors. In common practice, many matrices are xe2x80x9csparse,xe2x80x9d meaning that the majority of the numeric values are zero. The building and manipulation of vectors and matrices is a common mathematical task. The amount of memory consumption and processor time is of critical interest in performing these tasks.
The word-indexing of a set of documents, such as those used in document search engines, can be modeled as a sparse matrix. Because it is desired to both build and query such matrices quickly, and because there data is essentially dispersed at random, they are particularly susceptible to any number of speed tradeoffs made in current sparse matrix implementations.
Typically, a large set of documents are indexed by words they contain. The purpose of the index is to speed up the lookup of words. The words and the documents containing them are represented by a unique identifier, generally an integer. Such an index is a matrix wherein the word is the primary key, yielding vectors in which the document is the secondary key, and finally yielding values such as the occurrence frequency of said word. The invention presents a novel data structure representation for vectors and matrices. Further, the invention presents numerous implementations of common methods on such vectors and matrices. Finally, the invention is applied to improving the performance of document index building and querying.
Manipulation of large sparse matrices is a common mathematical task. There are many applications that require a space and time efficient data structure and related methods to perform these computations. However, existing implementations suffer from poor performance in particular tasks, or are specific to matrices of a particular pattern.
To analyze the computational time and memory requirements required for processing an algorithm, standard asymptotic notation (0) will be used. xe2x80x9cMxe2x80x9d and xe2x80x9cNxe2x80x9d will represent vectors. xe2x80x9cmxe2x80x9d and xe2x80x9cnxe2x80x9d will represent the number of elements in the vectors, respectively.
Many data structures and methods have been used for sparse matrix tasks. The classic method is to simply use arrays to implement vectors, and two-dimensional arrays to implement matrices. There are several benefits to this approach. First and foremost, it is conceptually simple. The vector indices map conveniently onto the array indices, and the adjacency of the values stored in memory is pleasingly similar to how a human would represent them visually. Furthermore, it allows quick, i.e., constant, access time to each value. The usual primitive operations on vectors: addition, multiplication, etc., can all be implemented efficiently - in linear time with respect to the number of values.
However, the memory consumption of arrays is proportional to the product of the matrix dimensions. In other words, arrays store zero values and hence do not take advantage of the sparsity of the matrices. A word-document index may contain a large number of documents, as well as a large number of words. However, a majority of words in the language(s) do not occur in any particular document. Hence such an index is generally a large, but sparse, matrix. For this reason, there is considerable interest in other representations that only require memory allocation proportional to the actual number of nonzero values, while still achieving acceptable operation times.
One such representation make use of xe2x80x9ccompressed vectorsxe2x80x9d, which are generally stored as parallel arrays of key-value pairs. Pointers are eschewed, as they require additional memory to store the addresses. The key-value pairs can be sorted (by key) or unsorted. The usual tradeoffs apply. Sorted vectors yield a binary search algorithm resulting in O(log(n)) access time. Unsorted vectors requires a linear search algorithm, resulting in O(n) time. But maintaining sorted vectors result in extra overhead for insertion and deletion of entries; the worst-case being O(n) time. A search engineer then needs to process the vector of documents for each word in the query. In order to generate results that are more relevant from a search, it is typically desired that search engines be queried with multiple words, combined with an xe2x80x9cANDxe2x80x9d or xe2x80x9cORxe2x80x9d operation. This in turn results in the merging of multiple word vectors, with a set-like operation of intersection or union. Keeping the vectors sorted thus yield much faster merging for multiple word queries, at the expense of more computation being done during the building of the index.
Compressed matrices consist of an array of compressed vectors keys, with a corresponding full array of compressed vector values. However, unless all the vectors of matrix have similar length, this will still result in a large amount of wasted memory. Such a representation is poor for document indexes because their vectors cannot be expected to be of similar size.
Compressed Sparse Row (or Column) is another common method for sparse matrix representation. The vectors are compressed in the usual manner and concatenated in an order according to their own key. A third array stores the start position of each vector. Access to a given vector is constant, access to a value is O(log(n)), and insertion/deletion is O(n{circumflex over ( )}2). Several variants with the same basic tradeoffs are available.
Storing the matrix relative to a Diagonal (or Skyline) method is also common. Generally these are used if values can be expected to be concentrated along the main diagonal of the matrix. Otherwise, memory reduction is not significant.
All of the above methods offer relatively slow value lookups: O(log(N)) or worse, and extremely poor modification penalties: at least O(n) and generally O(n{circumflex over ( )}2). Furthermore, irrespective of the data structure choice for matrices, unsorted compressed vectors and temporary full-size vectors are generally used in practice.
When a word-document index is queried, generally an entire vector is desired, not just one word-document value. Thus the relatively slow lookup times for individual values is not an issue. However, constructing such an index with O(n) insertion time results in an excruciatingly slow build time. Consequently, it is common to build a forward index firstxe2x80x94document by wordxe2x80x94and then invert the indexxe2x80x94to word by documentxe2x80x94for querying. By compiling all the values first, they can be sorted by word, thereby building the inverted index without the O(n) insertion hit.
But this common solution has several drawbacks. It requires a second pass through all the data, thus increasing the time to build the index. It also results in two distinct phases of indexing, a xe2x80x9cbuildxe2x80x9d phase and a xe2x80x9cqueryxe2x80x9d phase. Hence, the index cannot be effectively used while it is being built, and any updates to the documents cannot be quickly integrated into the index.
In addition to the above shortcomings, querying using compressed vectors can also yield sub-optimal performance times. A query is answered by acquiring the appropriate word vectors, and then merging them together. The merging operation may be as simple as adding the scores for each word-document value. Irrespective of what function is used to merge to values however, there is the overhead of the set operation used to combine the values at issue. Assume there are M query words each of which have any average word vector length of N. If the query is an OR of the words, these vectors must be unioned together. Hence, there are potentially O(M*N) result documents, which is a convenient lower bound for how long such a union operation can take. Unfortunately, if the vectors must be kept in a standard compressed formatxe2x80x94even if they are sorted by documentxe2x80x94the union operation can be shown to take O((M{circumflex over ( )}2)*N). It is typical to unpack each vector into an uncompressed format to overcome this. However, this workaround assumes there is sufficient memory to unpack a vector. It may be the case that the range of possible vector values is orders of magnitude larger than the actual size of the vectors, making this unreasonable in practice.
The situation is even worse in an AND operation, where the vectors need to be intersected. It can be easily shown that the intersection operation must take at least O(M*N) time, as all the values need to be scanned. In practice it may be even longer, depending on small the resulting vector is. Surprisingly, O(M*N) is not optimal in this case: all the values don""t have to be scanned. Only the smallest vector remaining needs to be scanned to intersect it with another, and the result vector continues to get smaller. However, the standard algorithm cannot take advantage of this fact, nor does uncompressing the vector help the situation, because than the larger range of possible values must again be scanned.
In conclusion, it has been shown that the prior art methods of vector and matrix representation have several shortcomings, especially as related to common search engine operations. Both indexing and querying have sub-optimal operations with respect to time. It is therefore desirable to create a new implementation which improves operations to their optimal limits, while keeping the amount of memory consumed to be proportional to existing methods. This new implementation will require the use of an existing data structure: a hash table.
A hash function refers to mapping from an arbitrary domain to a restricted range, usually consisting only of integers and of smaller size. It performs a transformation that takes an input m and returns a fixed-size string, which is called the hash value h (that is, h=H(m)). Though the function cannot be one-to-one, it is desirable to have a minimal number of collisions (i.e., having more than one key map to the same position). Furthermore, it is generally desirable that the output of the function be statistically random, so that the hash values aren""t clustered. Hash functions almost always end with a modulo operation, thereby yielding an integer in the desired range. For some applications, merely a modulo hash function is adequate.
Hash tables are a well-known data structure, in which keys are mapped to array positions by a hash function. The table is simply an array of key-value slots. Keys are processed through a hash function that yields an index of a slot in the table. Values are then inserted, deleted, or modified in the slot as necessary after a key lookup. The result is constant access time, albeit a higher constant than with arrays.
Collisions occur if two different keys hash to the same slot. There are generally two ways of resolving collisions. xe2x80x9cChangingxe2x80x9d refers to putting multiple values together, linked to the slot. As this requires the use of pointers, which result in additional memory and time overhead, it is often not preferred. xe2x80x9cOpen addressingxe2x80x9d (or xe2x80x9copen indexingxe2x80x9d) is the alternate method of resolving collisions. In addition to the initial hash, there is a probe sequence based on the hash value that iterates until the correct slot, or an empty slot, is found. This probe sequence must be deterministic and never repeat a slot. There are several well-known algorithms for implementing such a sequence. Although collisions naturally increase access time, it can be shown the statistical expectation of look-up time is constant with reasonably sized hash tables. It is typical to keep the hash table half fall, and double its size when growing, to achieve acceptable performance. Deletion is handled by inserting a dummy key into the empty slot. A dummy key is necessary to be distinguished from a null key, so that the probe sequence will continue if necessary.
It is an object of the invention to present a novel data structure and applicable methods for sparse vectors and matrices computations with optimal operation limits, while keeping the amount of memory consumed to be proportional to existing methods. It is another object of the the invention to present the following features: Modification of values occurring in constant time; Operations on vectors occurring in linear time as appropriate; and total memory consumption being linear with respect to the number of nonzero values.
It is a further object of the invention to combine these features to enable the sparse matrix representation to be used for a searchable word-document index to produce fast build and query times for document retrieval. The index could be built in linear time (with respect to the number of nonzero values), and can be queried in constant time. Furthermore, no conversion between build or query modes would be necessary. The index can be queried even as it is being built.
A new data structure and algorithms is presented which offer performance at least equal to the prior art in common sparse matrix tasks, and improved performance in additional tasks. Further, this representation is applied to a word-document index to produce fast build and query times for document retrieval.
Under this new data structure, matrices and vectors are represented as dictionaries of key-value pairs. A vector is a dictionary of integral keys and corresponding numeric values. A matrix is a dictionary of integral keys and corresponding vector values. The dictionaries themselves are implemented using hash tables. When a key is looked up, it is hashed, and its corresponding slot (which may be empty) is found in the hash table. The value of the slot is then retrieved, inserted, or deleted as appropriate. The hash function used is a modulo of the key itself.
The departure from more traditional representations in the prior art is that the context of the values are no longer relevant. Values are not kept sorted, or even spatially organized at all. The tradeoff is the emphasis on constant time access and modification, which is generally unavailable in other methods. Though it is not surprising that such a dictionary could be used to build a vector, the fact that the random organization of the values will be shown to be advantageous to numerous computations is novel and nonobvious.
The resulting implementation yields memory use that is comparable or better than other known methods, and at the same time yielding computation efficiencies that are comparable or better than other known methods. Further, certain computations will be shown to be superior to all known methods.
The constant insert and access time will be shown to implement an extremely fast document index. Additionally, fast set operations on the vectors will be shown to implement multiple word queries efficiently. See FIG. 1.