1. Field of the Invention
The present invention relates generally to Information Storage and Retrieval systems, and more particularly to means and methods for Content Analysis and Indexing especially as related to such systems and their algorithms implemented in hardware.
2. Discussion of the Related Art
There is a large demand for text retrieval as a critical component of information retrieval technology. Electronic text collections and the availability of searching such collections over the world wide web for example, has led to ever increasing demands for fast and accurate document indexing techniques. Several data structures have been used for Content Analysis and Indexing within the field of Information Storage and Retrieval systems. Two such structures are the inverted index file structure and the signature file structure. The commonly used inverted index file structure is fast, but may suffer from excessive storage and index maintenance overheads. Signature files require small storage overhead but require extra processing time and may result in false positive indications of the presence of the term within the document. In general, such text retrieval structures and techniques are software controlled and require relatively high processor overhead to run the information retrieval software routines.
Referring to FIG. 1, as noted above, one popular form of data indexing used to support the efficient searching of documents is the inverted index structure 21. An inverted index comprises a term list 23, e.g., the terms being words, phrases, stems, etc. Each term, e.g. term 25, has an associated posting list 27. A “posting list” 27 is a series of posting entries, collectively 29. A “posting entry” is data identifying at least a document 26 containing the term and an indication of the significance of the term in the given document, herein referred to as “weight”. For example, weight may be, but is not limited to, the number of occurrences 28 of the term within the document. Other indicators of significance, i.e., weights, can rely on a composition function of the number of occurrences and term weighting such as inverted document frequencies or other such measures as known in the art. Without limitation and for simplicity of explanation, the remaining description only uses term occurrence. As used herein, a “posting” is a memory space for one posting entry. Thus, a posting list 27 will occupy a series of postings. In a typical inverted index structure, there may be an unlimited capacity for storing the posting entries corresponding to the documents associated with a term. As seen in the example of FIG. 1, the posting entries are not necessarily ordered in the posting list 27 by weight or by the document identifier. However, a sorted ordering according to any designated value or set of values within the posting entries is possible.
Referring to FIG. 2, a “pruned” inverted index data structure 31, e.g., a known technique such as set forth in the paper A. Soffer, et al., “Static Index Pruning for Information Retrieval Systems,” Proceedings of the 24th Annual ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, N.Y., September 2001, pp. 43-50., limits the posting list 33 to a certain number of documents, e.g., space for a maximum of only 500 postings per term, as illustrated for the first term 35. Further, the posting list is sorted by weight, i.e., the frequency of term occurrence, 37 with the first posting 39 being occupied by the document reference with the greatest number of occurrences of the listed term. Pruned inverted indexes are known in the art as a highly efficient means of data structure for information retrieval. As is known in the art, only the top few retrieval listings in a document search are likely to be considered by the searcher to be highly relevant. Thus, a pruned inverted index structure, as shown in the previously cited Soffer, et al. article, often reduces the number of posting entries stored in the index while still providing comparable accuracy in query processing. For example, by storing only those posting entries of only those documents in which a given term appears frequently, the posting list size of the index is potentially dramatically reduced, thus improving runtime performance and reducing processor overhead.
In the past, certain hardware assisted Information Retrieval systems were suggested. These hardware assisted Information Retrieval systems relied on pattern matching operations utilizing VLSI oriented design architectures and often delivered a marginal cost/benefit ratio over the ever more efficient general processors running software algorithms to maintain the inverted index.
Pattern matching involves a logical character-by-character comparison of the entire (full text character) source string with the characters of the term comprising the search pattern. If a sub-string within the source string matches the desired term, a match is detected, and the term is considered present within the source string. The source string is often, but is not limited to, the entire document collection. In such a pattern approach, the pre-processing step of creating an index is generally avoided, reducing the storage overhead and preprocessing time. This reduction often comes at that expense of lengthier query processing times associated with the need to scan the entire document collection instead of merely accessing those documents that were predetermined to contain the term, as designated in the index.
Therefore, there is a need for a system of hardware assisted Information Retrieval using inverted index structures which supports a high cost/benefit ratio and can be plugged in, or added to, present information retrieval systems, and provides low storage and index maintenance overheads as compared to present systems.