The Worldwide Web (“Web”) is an open-ended digital information repository into which new information is continually posted. The information on the Web can, and often does, originate from diverse sources, including authors, editors, collaborators, and outside contributors commenting, for instance, through a Web log, or “Blog.” Such diversity suggests a potentially expansive topical index, which, like the underlying information, continuously grows and changes.
Topically organizing an open-ended information source, like the Web, can facilitate information discovery and retrieval, such as described in commonly-assigned U.S. Patent Application, entitled “System and Method for Performing Discovery of Digital Information in a Subject Area,” Ser. No. 12/190,552, filed Aug. 12, 2008, pending, the disclosure of which is incorporated, by reference. Books have long been organized with topical indexes. However, constraints on codex form limit the size and page counts of books, and hence index sizes. In contrast, Web materials lack physical bounds and can require more extensive topical organization to accommodate the full breadth of subject matter covered.
The lack of topical organization makes effective searching of open-ended information repositories, like the Web, difficult. A user may not know the subject matter being searched, or could be unaware of the information available. Even if knowledgeable, the user may be unable to specify the exact information desired, or might stumble over problematic variations in vocabulary. Search results alone often lack needed topical signposts, yet even when topically organized, only a subpart of a full index of all Web topics may be germane to a given subject.
Conventional Web search engines retrieve information, such as articles, in response to a search query that is typically composed of only a few search terms. When a corpus is extensive, such as when articles gathered from the Web span wide-ranging topics, users may encounter ambiguity in identifying the precise information needed. Furthermore, Web search engines often return information in a disorganized jumble that intermixes the information over disparate topics, thereby making assimilation of the results and new query formulation hard.
Conventional Web search engines also operate without an index or knowledge of the topical organization of an underlying subject area. Keywords in context (“KWIC”) are sometimes available to emphasize search results that match query terms, but a sense of topicality is still lacking. Moreover, even when a form of categorizing is applied, Web search engines generally either rely on separating search results by source, collecting common queries as search representative, or applying clustering techniques to channel search results along popular themes. As a result, search results are often jumbled and topically intermixed sets of articles.
Thus, several interacting challenges for topic search exist. One challenge is that the input to search is minimal. When searching, users want to enter as little as possible in their information requests. Empirically, most user queries contain only one or two words. A second challenge is that the response to an information request be short, and yet provide a guide to the information desired. Providing too much information can be a distraction to the user. A focused index can address this challenge by giving an estimate of the most relevant topics, together with selected related topics, in case the user's information need is misidentified. The dual challenges of providing a high-precision response given a low-precision request is at the heart of a topic search.
One approach to providing focused topical sub-indexes uses finite state patterns, as used in search engine query languages. A finite state pattern can be used to determine which topics within a topical index correspond to a given query. However, most queries are simply too short to provide enough “content signal” to match against those finite state patterns that are suitable for identifying the topics.
Another approach to creating focused topical sub-indexes uses term similarity assessment. Techniques, such as generalized latent semantic analysis and spreading activation, are combined to compute a “term relatedness” score, which measures similarity in the use of terms. Word pair co-occurrence is used as a proxy for term similarity. As a pre-computation, word-pair occurrences are counted within a sliding window over the corpus. The counts for word pairs that co-occur are kept in a large sparse matrix. The matrix can then be used to find words related to search terms. The query terms are matched against the matrix to find other words that co-occur with them. The matching creates a list of related terms. The process is repeated for each of the words added, which can trigger further words to be added to the list. The influences of word-pair occurrences are combined when more than one path leads to an added word. At the same time, index labels can also be used as seeds for another spreading activation process. The process continues where the wave of words related to the query terms intersects the wave of words related to the index terms. After several iterations, the index entries whose label words have been identified as related to the query terms are gathered. Variations on this process can pre-compute the words related to label words. When the related index entries are identified, a sub-index is created containing the index entries having scores sufficiently high to relate their labels to the query terms. A difficulty with these techniques is that they require large co-occurrence matrices at search time, which is generally not practicable in light of the wide range of query terms possible.
Therefore, a need remains for providing a dynamically focused and topically-related sub-index in conceit with a digital information, corpus search.