Conventional information retrieval systems (also known as text retrieval systems or text search engines) view document collections as standalone text corpora with little or no structured information associated with them. However, there are two primary reasons why such a view is no longer tenable. First, modern enterprise applications for customer relationship management, collaboration, technical support, etc., regularly create, manipulate, and process data that contains a mix of structured and unstructured information. In such applications, there is inherently a fair amount of structured information associated with every document. Second, advances in natural language processing techniques has led to the increased availability of powerful and accurate text analysis engines. These text analysis engines are capable of extracting structured semantic information from text. Such semantic information, usually extracted in the form of semantic annotations, has the potential to significantly improve the quality of free text search and retrieval.
However, the architectures of conventional information retrieval systems are not explicitly designed to take advantage of semantic annotations. In particular, semantic annotations provide the capability for describing content in terms of types and relationships, that is concepts that are not intrinsic to conventional information retrieval systems. For example, a particular document in a corpus may contain a person name “John” and a telephone number for John: “555-1234”, but not the actual word “telephone”. A person may search on that corpus using the keyword phrase “John telephone”. However, a conventional retrieval system does not find the document since the keyword “telephone” is not present. In essence, conventional information retrieval systems merely recognize keywords but not the types into which a word or phrase may be categorized or the relationships between such types.
A conventional information retrieval system is typically designed to return a ranked list of matching documents in response to a keyword search query comprising search words or tokens. In a standard implementation of such a system, an entire corpus of documents is processed in advance to build an inverted index. This inverted index maps each token to a list of occurrences of that token. A token is usually a word or a phrase; however, a token can also be a more complex entity.
Upon receiving a keyword query, the inverted index is used to compute a list of candidate documents that are potentially relevant to the query. Each of these candidate documents is assigned a rank, using a pre-designed ranking formula. The rank ordered list of candidate documents is then presented to the user. Although this technology has proven to be useful, it would be desirable to present additional improvements. The tightly integrated architecture of conventional information retrieval systems directly maps a query to storage and index structures. Consequently, it becomes difficult to exploit available semantic annotations. In conventional information retrieval systems, the available semantic annotations can only be exploited in an ad-hoc fashion by hand crafting specialized ranking formulae. Such ad-hoc ranking formulae are difficult to construct and are very often not portable across document collections. As a result, every time an information retrieval system is deployed over a new document collection, a significant amount of time and effort is required to craft a ranking formulae appropriate to that collection.
What is therefore needed is a system, a computer program product, and an associated method for exploiting semantic annotations in executing keyword queries over a collection of text documents, allowing a user to search on a corpus and locate information based on types and relationships found in the corpus by, for example, a text analysis engine. The need for such a solution has heretofore remained unsatisfied.