Manipulation of large sparse matrices is a common mathematical task. There are many applications that require a space and time efficient data structure and related methods to perform these computations. However, existing implementations suffer from poor performance in particular tasks, or are specific to matrices of a particular pattern.
To analyze the computational time and memory requirements required for processing an algorithm, standard asymptotic notation (O) will be used. “M” and “N” will represent vectors. “m” and “n” will represent the number of elements in the vectors, respectively.
Many data structures and methods have been used for sparse matrix tasks. The classic method is to simply use arrays to implement vectors, and two-dimensional arrays to implement matrices. There are several benefits to this approach. First and foremost, it is conceptually simple. The vector indices map conveniently onto the array indices, and the adjacency of the values stored in memory is pleasingly similar to how a human would represent them visually. Furthermore, it allows quick, i.e., constant, access time to each value. The usual primitive operations on vectors: addition, multiplication, etc., can all be implemented efficiently—in linear time with respect to the number of values.
However, the memory consumption of arrays is proportional to the product of the matrix dimensions. In other words, arrays store zero values and hence do not take advantage of the sparsity of the matrices. A word-document index may contain a large number of documents, as well as a large number of words. However, a majority of words in the language(s) do not occur in any particular document. Hence such an index is generally a large, but sparse, matrix. For this reason, there is considerable interest in other representations that only require memory allocation proportional to the actual number of nonzero values, while still achieving acceptable operation times.
One such representation make use of “compressed vectors”, which are generally stored as parallel arrays of key-value pairs. Pointers are eschewed, as they require additional memory to store the addresses. The key-value pairs can be sorted (by key) or unsorted. The usual tradeoffs apply. Sorted vectors yield a binary search algorithm resulting in O(log(n)) access time. Unsorted vectors requires a linear search algorithm, resulting in O(n) time. But maintaining sorted vectors result in extra overhead for insertion and deletion of entries; the worst-case being O(n) time. A search engineer then needs to process the vector of documents for each word in the query. In order to generate results that are more relevant from a search, it is typically desired that search engines be queried with multiple words, combined with an “AND” or “OR” operation. This in turn results in the merging of multiple word vectors, with a set-like operation of intersection or union. Keeping the vectors sorted thus yield much faster merging for multiple word queries, at the expense of more computation being done during the building of the index.
Compressed matrices consist of an array of compressed vectors keys, with a corresponding full array of compressed vector values. However, unless all the vectors of matrix have similar length, this will still result in a large amount of wasted memory. Such a representation is poor for document indexes because their vectors cannot be expected to be of similar size.
Compressed Sparse Row (or Column) is another common method for sparse matrix representation. The vectors are compressed in the usual manner and concatenated in an order according to their own key. A third array stores the start position of each vector. Access to a given vector is constant, access to a value is O(log(n)), and insertion/deletion is O(n^2). Several variants with the same basic tradeoffs are available.
Storing the matrix relative to a Diagonal (or Skyline) method is also common. Generally these are used if values can be expected to be concentrated along the main diagonal of the matrix. Otherwise, memory reduction is not significant.
All of the above methods offer relatively slow value lookups: O(log(N)) or worse, and extremely poor modification penalties: at least O(n) and generally O(n^2). Furthermore, irrespective of the data structure choice for matrices, unsorted compressed vectors and temporary full-size vectors are generally used in practice.
When a word-document index is queried, generally an entire vector is desired, not just one word-document value. Thus the relatively slow lookup times for individual values is not an issue. However, constructing such an index with O(n) insertion time results in an excruciatingly slow build time. Consequently, it is common to build a forward index first—document by word—and then invert the index—to word by document—for querying. By compiling all the values first, they can be sorted by word, thereby building the inverted index without the O(n) insertion hit.
But this common solution has several drawbacks. It requires a second pass through all the data, thus increasing the time to build the index. It also results in two distinct phases of indexing, a “build” phase and a “query” phase. Hence, the index cannot be effectively used while it is being built, and any updates to the documents cannot be quickly integrated into the index.
In addition to the above shortcomings, querying using compressed vectors can also yield sub-optimal performance times. A query is answered by acquiring the appropriate word vectors, and then merging them together. The merging operation may be as simple as adding the scores for each word-document value. Irrespective of what function is used to merge to values however, there is the overhead of the set operation used to combine the values at issue. Assume there are M query words each of which have any average word vector length of N. If the query is an OR of the words, these vectors must be unioned together. Hence, there are potentially O(M*N) result documents, which is a convenient lower bound for how long such a union operation can take. Unfortunately, if the vectors must be kept in a standard compressed format—even if they are sorted by document—the union operation can be shown to take O((M^2)*N). It is typical to unpack each vector into an uncompressed format to overcome this. However, this workaround assumes there is sufficient memory to unpack a vector. It may be the case that the range of possible vector values is orders of magnitude larger than the actual size of the vectors, making this unreasonable in practice.
The situation is even worse in an AND operation, where the vectors need to be intersected. It can be easily shown that the intersection operation must take at least O(M*N) time, as all the values need to be scanned. In practice it may be even longer, depending on small the resulting vector is. Surprisingly, O(M*N) is not optimal in this case: all the values don't have to be scanned. Only the smallest vector remaining needs to be scanned to intersect it with another, and the result vector continues to get smaller. However, the standard algorithm cannot take advantage of this fact, nor does uncompressing the vector help the situation, because than the larger range of possible values must again be scanned.
In conclusion, it has been shown that the prior art methods of vector and matrix representation have several shortcomings, especially as related to common search engine operations. Both indexing and querying have sub-optimal operations with respect to time. It is therefore desirable to create a new implementation which improves operations to their optimal limits, while keeping the amount of memory consumed to be proportional to existing methods. This new implementation will require the use of an existing data structure: a hash table.
A hash function refers to mapping from an arbitrary domain to a restricted range, usually consisting only of integers and of smaller size. It performs a transformation that takes an input m and returns a fixed-size string, which is called the hash value h (that is, h=H(m)). Though the function cannot be one-to-one, it is desirable to have a minimal number of collisions (i.e., having more than one key map to the same position). Furthermore, it is generally desirable that the output of the function be statistically random, so that the hash values aren't clustered. Hash functions almost always end with a modulo operation, thereby yielding an integer in the desired range. For some applications, merely a modulo hash function is adequate.
Hash tables are a well-known data structure, in which keys are mapped to array positions by a hash function. The table is simply an array of key-value slots. Keys are processed through a hash function that yields an index of a slot in the table. Values are then inserted, deleted, or modified in the slot as necessary after a key lookup. The result is constant access time, albeit a higher constant than with arrays.
Collisions occur if two different keys hash to the same slot. There are generally two ways of resolving collisions. “Changing” refers to putting multiple values together, linked to the slot. As this requires the use of pointers, which result in additional memory and time overhead, it is often not preferred. “Open addressing” (or “open indexing”) is the alternate method of resolving collisions. In addition to the initial hash, there is a probe sequence based on the hash value that iterates until the correct slot, or an empty slot, is found. This probe sequence must be deterministic and never repeat a slot. There are several well-known algorithms for implementing such a sequence. Although collisions naturally increase access time, it can be shown the statistical expectation of look-up time is constant with reasonably sized hash tables. It is typical to keep the hash table half full, and double its size when growing, to achieve acceptable performance. Deletion is handled by inserting a dummy key into the empty slot. A dummy key is necessary to be distinguished from a null key, so that the probe sequence will continue if necessary.