Computer-based information systems can store large amounts of data. Despite the potentially enormous size of such data collections, information retrieval queries over a dataset attempt to be as informative, rapid, and accurate as possible. Information retrieval systems often employ indexing techniques to improve precision, improve recall performance and rapidly access specific information within a dataset.
Data stored in an information retrieval system for textual data can be indexed using a term-by-document matrix. In that case, “term” means a word or phrase and a “document” is a collection of terms. However, generalized meanings of “term” and “document” can apply, as discussed hereinafter. A term-by-document matrix represents each term as a row and each document as a column. For a column representing a particular document, the elements going down the column can represent some function of the existence of terms within the document. For example, if term A is not used in document B, then the element in a term-by-document matrix that is in both row A and column B could be a zero to represent the absence of the term in the document. Alternatively, if term A is used X times in document B, then the element in a term-by-document matrix that is in both row A and column B could be an X to represent the presence of the term A occurring X number of times in the document.
This term-by-document matrix structure enables response to keyword search queries. The row of the term-by-document matrix that corresponds to the queried keyword is examined by the information retrieval system. Elements in that row indicate inclusion of that keyword term within the documents represented by those columns. Such inclusion prompts the information retrieval system to return the documents in response to the keyword query. Thus, the search returns the documents containing a specific keyword by examining a single matrix. Once this term-by-document matrix is constructed, the individual documents within a dataset do not need to be searched when forming a response to a keyword query.
Furthermore, the elements of the term-by-document matrix can include a measure of the relevance of the term (given by the row) to the document (given by the column). This measure can be as simple as a count of how many times the term occurs within the document. Likewise, a more involved metric can be employed. Forming a term-by-document matrix with such elements lends to statistical notions for the use of the matrix and enables more detailed query responses. For example, a response to a keyword query can be a list of documents containing a keyword and that list can be ordered such that the documents most relevant to the keyword are listed first. The most relevant documents can be those documents that include the most instances of the keyword.
For extremely large sets of documents with multiple keyword terms, the term-by-document matrix can become too large to manipulate during a keyword query. For this reason, simplification techniques can be employed that approximate the term-by-document matrix with a simpler matrix that is less time consuming to manipulate. Conventional Latent Semantic Indexing (LSI) employs a reduced rank version of the term-by-document matrix as an approximation of the original matrix. The approximation obtained has also been shown to be useful in increasing the overall information retrieval performance.
The LSI approach seeks to factor the term-by-document matrix using Singular Value Decomposition (SVD) and then makes some of the smallest singular values equal to zero, thereby leaving a reduced rank approximation of the term-by-document matrix. To achieve an approximation of the term-by-document matrix that is of reduced rank k, the conventional LSI approach only retains the k largest singular values and sets all of the other singular values to zero. The resultant matrix is an approximation of the original term-by-document matrix but with a lower rank of k (i.e., including only the k largest singular values).
Generating a reduced rank approximation of the term-by-document matrix is useful for reducing the computational complexity of indexed information retrieval. It is also said that it can produce a matrix that can be considered less “noisy.” Such a reduced rank matrix also can retrieve related term entries that would have been excluded based on the original term-by-document matrix due to synonymy. The reduced rank matrix can associate words that never actually appear together in the same document.
Such rank reduction is not lossless. Making some of the singular values equal to zero reduces the rank of a matrix and invariably removes some information. When using the conventional LSI techniques, one example of a loss that can be introduced is a loss of topical coverage. A topic is generally conceptualized as a subject addressed within the documents of the dataset. Mathematically, a topic can be considered a probability distribution over all terms. For example, the term “hexagon” is perhaps more probabilistically likely to be related to a topic of a mathematical nature than it would be to a topic of a historical nature.
Conventional LSI rank reduction does not always maintain coverage of all topics. The blind selection of the k largest singular values can result in the removal of information that loses the connection between a topic and certain keywords. Retaining only the largest singular values can allow the term-document relationships of more common topics to dominate the reduced rank matrix at the cost of the removal of the term-document relationships of less frequently represented topics.
Thus, there is a need in the art for a rank reduction technique that retains the general benefits of the conventional LSI approach while attempting to maintain topical coverage during rank reduction of the term-by-document indexing matrix. More particularly, a need exists in the art for selectively identifying the singular values of interest related to a dataset that has been annotated and stored in some matrix format.