The following relates to information indexing, storage, retrieval, processing, analysis, and related arts. It particularly relates to indexing, content retrieval, content analysis, and related processing of text-based documents, and is described with particular reference thereto. However, the following more generally relates to information generally, such as indexing, retrieval, content analysis, and related processing of images, documents, semantic analyses, translation databases, lexicons, information archives, and so forth.
Electronic information storage is ubiquitous, and massive quantities of information are stored on the Internet, corporate databases, governmental electronic archives, and so forth. A key technology for facilitating use of such stored information is effective indexing and retrieval of selected contents. Indexing can use a pre-defined system based on selected keywords or the like. However, pre-defined indexing is limited in scope and usefulness. For example, a pre-defined index is not useful if the keywords used by a person searching the content are different from those selected by the index system designers. Moreover, indexing by keywords is only one approach; more generally, it is desirable to provide an indexing system that can locate or analyze occurrences of events, where an event is a general concept that may include, for example: an ordered sequence of words, possibly with some gaps or discontinuities; occurrence of a semantic structure in semantically annotated documents; the existence of a particular feature vector for characterizing images; or so forth.
Automated indexing is known, in which the information is analyzed to extract likely indexing keywords or the like. In such approaches, a tradeoff is made between the size of the index, on the one hand, and the level of indexing specificity on the other hand. For example, in automated keyword indexing it is common to index only those words (or perhaps phrases) that occur more than a threshold number of times in the documents. Increasing the threshold makes the index more compact, but at the cost of less specificity and reduced query efficiency since infrequent keywords are lost. Unfortunately, in some cases it is precisely the infrequent words or phrases that are omitted from a compact index are of most interest.
For events more complex than keywords or phrases, automated indexing continues to suffer from the undesirable compactness-versus-effectiveness tradeoff, and also suffers substantial difficulty in identifying events of interest in the documents. A straightforward approach for such identification is to scan the document using a suitable search algorithm, and keep a count of each event discovered by the search. However, this approach can be computationally intensive. Moreover, since it is not known a priori which events are frequent enough to justify indexing, this approach typically involves accumulating storage of a count for each event (no matter how rare) during the scan, which can be expensive in terms of temporary data storage allocation.
Still further, the type of event that is useful for indexing may vary depending upon the corpus being indexed, the subject matter of the corpus, and so forth. For example, a keyword-based index is useful for some tasks and some corpuses, but may be ineffective for other tasks or other corpuses.
Accordingly, it would be useful to provide indexing methods and systems that produce compact indices that are nonetheless useful for querying on infrequent or rare events, and that can construct such useful indices without using excessive amounts of computational and storage resources, and that are flexible as to the event type upon which the indexing is based, and that overcome other deficiencies of existing indexing methods and systems.