The amount of data available for search continues to grow rapidly. At the same time, users have come to expect their search engines to provide rapid response and accurate results regardless of the complexity of the queries that they pose.
A variety of query processing strategies are known in the art. For large corpora of data, an object-oriented document-at-a-time (DAAT) approach is widely used. This sort of approach is described, for example, by Burrows in U.S. Pat. No. 5,809,502. The index (often referred to in the art as an “inverted index”) to a collection of documents is organized as a plurality of index entries, wherein each index entry comprises a word and an ordered list of locations where the word occurs in the collection. The index entries are ordered first according to the documents in the collection, and second according to the locations of each associated word within the document.
A query is parsed into terms and operators. Each term is associated with a corresponding index entry, while the operators relate the terms. A basic stream reader object is generated for each term of the query. The basic stream reader object sequentially reads the locations of the corresponding index entry to determine a target location. A compound stream reader object is generated for each operator. The compound stream reader object references the basic stream reader objects associated with the terms related by the operator. The compound stream reader object returns locations of words within a single document according to the operator.
Sheinwald et al. describe a DAAT method for searching a corpus of documents in U.S. Patent Application Publication 2007/0033165, whose disclosure is incorporated herein by reference. A query processor receives a complex query, which includes a plurality of words conjoined by operators including a root operator and at least one intermediate operator. Respective advancement potentials are assigned to the words in the complex query. The query processor applies a consultation method to the words and operators in the complex query in order to choose one of the words responsively to the advancement potentials. The query processor then advances through the index in order to find a document containing the chosen word, and evaluates the document to determine whether the document satisfies the complex query.
Methods are known in the art for automatically annotating and indexing documents. For example, Aswani et al. describe such a method in “Indexing and Querying Linguistic Metadata and Document Content,” Proceedings of Fifth International Conference on Recent Advances in Natural Language Processing (RANLP-2005), 2005. This paper presents the ANNIC system, which can index documents not only by content, but also by their linguistic annotations and features. It is said to enable users to formulate queries mixing keywords and linguistic information. The result consists of the matching texts in the corpus, displayed within the context of linguistic annotations.
A variety of tools are available for automatic semantic and linguistic tagging of documents. For example, the Unstructured Information Management Architecture (UIMA) developed by IBM Corporation (Armonk, N.Y.) is an open platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components. It allows easy authoring of annotators, such as the expression of the format of telephone numbers, or dates, or meeting rooms. Then, given a set of text documents, the UIMA tool applies the various annotators authored, thereby automatically annotating segments of text by different annotations as authored. IBM product platforms that expose the UIMA interfaces include the OmniFind Enterprise Edition and Analytics Edition. The former features UIMA for building full-text and semantic search indexes, and the latter deploys UIMA for information extraction and text analysis. Further information regarding UIMA is available on the IBM Research Web site (www.research.ibm.com/UIMA/).