The present invention relates to automated document searching, and in particular to the introduction of conceptual/terminological structure to a document set based on textual content.
The exponential growth of the Internet has provided consumers with the ability to access vast quantities of informationxe2x80x94so much, in fact, that guiding consumers to the information they desire is now an industry. Commercial xe2x80x9csearch enginesxe2x80x9d such as ALTAVISTA, accessible over the Internet, maintain massive databases of Internet-accessible documents and accept user queries to search these documents.
The search engine may maintain the documents in an unstructured form, in which case the user searches by xe2x80x9ckeyword.xe2x80x9d Essentially, the search engine accepts one or more words that the user considers relevant to the topic of interest, and electronically identifies documents containing the entered words. Search sophistication can be increased by means of Boolean capability, which allows the user to concatenate search terms into strings in accordance with operators such as AND and OR. In practice, it is found that simple keyword queries, while easily composed, tend to underspecify the set of desired documents (retrieving large numbers of irrelevant documents). Such problems arise from the user""s lack of knowledge of the subject matter giving rise to the information need, unfamiliarity with the underlying document collection and its content with respect to that need, and the difficulty of translating even a well-defined need into an effective linguistic formulation.
Current search interfaces typically offer a query-refinement loop that allows the user to enter the initial search expression, evaluate the results returned, and then modify the query by addition of keywords. Evaluating search results can be a time- and energy-consuming task, however. In surveying a potentially long list of titles and document summaries, the user must not only evaluate the likely relevance of the retrieved documents, but also assess the likelihood that the database will eventually be able to satisfy the information need (or part of it); assess the degree to which the current query formulation has expressed the need; learn about the information space and the vocabulary used to describe the domain within this particular database; and ultimately decide on an appropriate query reformulation strategy to the extent necessary.
To help the user focus his or her search without this kind of extensive analysis, the documents may be organized according to content, allowing the user to browse through a category of documents or at least to confine a keyword search within such a category. xe2x80x9cClusteringxe2x80x9d techniques are frequently employed to categorize related documents within a document corpus. But generating the categories and placing the documents within them is an arduous task. Clustering can, for example, be accomplished manually, with each document being individually examined by a clerk who assigns it to the proper category. Naturally, this approach is prohibitive for commercial Internet search engines that store millions of documents.
Clustering can also be performed automatically. xe2x80x9cBottom-upxe2x80x9d and xe2x80x9ctop-downxe2x80x9d clustering techniques utilize algorithms that generate a hierarchical category structure and assign each document to one or more categories. These techniques are computationally demanding, however, and do not necessarily generate document categories that ultimately prove meaningful to users.
Another approach to providing users interactive feedback to assist searching is to display terminology xe2x80x9crelevantxe2x80x9d to the search. The difficulty here is two-fold, first determining which of the thousands of potentially related terms are likely to be most useful in this instance for query reformulation and, second, arranging those terms in some way that helps to elucidate the search space. A manually constructed thesaurus or a database of term-to-term correlations derived from statistical corpus analysis can be used to identify terms that are semantically or statistically related to terms in a user""s query expression. Alternatively, a result list can be analyzed at run-time for frequently occurring terms or for phrases containing query terms. In most cases, the terms are simply presented as an unstructured list (perhaps ordered alphabetically or by frequency).
The present invention facilitates searching by extracting, from a collection of documents within a corpus, terms representing key informational concepts (herein referred to as xe2x80x9cfacetsxe2x80x9d of the document collection). When the user performs a keyword or other conventional search, the facets pertaining to the documents retrieved by the search are returned to the user along with the documents (which are generally presented in summary form in a results list). The facets may be used directly to refine the search, but also serve to educate the user about the information content of the document corpus and the result list as these relate to the information need.
The invention constructs xe2x80x9cfacetedxe2x80x9d representations of documents by identifying a set of lexical dimensions that roughly characterize concepts likely to have informational relevance. It is found that lexical items signifying key concepts within a domain often tend to co-occur with other useful concepts within certain syntactic contexts, such as noun phrases. Consequently, facets are chosen heuristically based on xe2x80x9clexical dispersion,xe2x80x9d a measure of the number of different words with which a particular word co-occurs within such syntactic contexts. The greater the level of dispersionxe2x80x94i.e., the more different words with which the given word appears in the documents within the allowed syntactic contextxe2x80x94the greater is the likelihood that the given word (along with the lexical constructs in which it occurs) will represent a useful conceptual category relevant to the query topic. The facets and their corresponding lexical constructs effectively provide a concise, structured summary of the contents of a result set as well as a set of candidate terms for iterative query reformulation.
Accordingly, in a first aspect, the invention comprises a method of selecting and organizing documents from a document corpus in response to a user-provided search expression. Preferably, the document corpus is first analyzed to identify potential facets; this is accomplished by searching the textual content of the documents for lexical constructs conforming to a selected syntactic pattern, such as a noun phrase. The lexical constructs, in turn, are examined at query time to derive dispersion rates for words within the constructs. The dispersion rates are assumed to indicate the conceptual relevance of the words to which they relate, and these words are ranked in accordance with their dispersion rates.
The user""s conventional search is processed in the usual fashion, returning to the user a list of documents conforming to the search criteria. The user also receives a list of the facets contained in the retrieved documents. The facets, and the lexical constructs within which they appear, may be used for query reformulation in various ways. The user may, for example, recognize a particular construct as especially relevant to the information need and choose to see a list of documents containing this lexical construct. Alternatively, the user may choose to augment the original search expression with a selected word or lexical construct.