The Worldwide Web (“Web”) is an open-ended digital information repository into which new information is continually posted and read. The information on the Web can, and often does, originate from diverse sources, including authors, editors, bloggers, collaborators, and outside contributors commenting, for instance, through a Web log, or “blog.” Such diversity suggests a potentially expansive topical index, which, like the underlying information, continuously grows and changes.
Topically organizing an open-ended information source, like the Web, can facilitate information discovery and retrieval, such as described in commonly-assigned U.S. patent application, entitled “System and Method for Performing Discovery of Digital information in a Subject Area,” Ser. No. 12/190,552, filed Aug. 12, 2008, pending, the disclosure of which is incorporated by reference. Books have long been organized with topical indexes. However, constraints on codex form limit the size and page counts of books, and hence index sizes. In contrast, Web materials lack physical bounds and can require more extensive topical organization to accommodate the full breadth of subject matter covered.
The lack of topical organization makes effective searching of open-ended information repositories, like the Web, difficult. A user may be unfamiliar with the subject matter being searched, or could be unaware of the extent of the information available. Even when knowledgeable, a user may be unable to properly describe the information desired, or might: stumble over problematic variations in terminology or vocabulary. Moreover, search results alone often lack much-needed topical signposts, yet even when topically organized, only part of a full index of all Web topics may be germane to a given subject.
One approach to providing topical indexes uses finite state patterns to form an evergreen index built through social indexing, such as described in commonly-assigned U.S. patent application, entitled “System and Method for Performing Discovery of Digital Information in a Subject Area,” Ser. No. 12/190,552, filed Aug. 12, 2008, pending, the disclosure of which is incorporated by reference. Social indexing applies supervised machine learning to bootstrap training material into fine-grained topic models for each topic in the evergreen index. Once trained, the evergreen index can be used for index extrapolation to automatically categorize incoming content into topics for pre-selected subject areas.
Fine-grained social indexing systems use high-resolution topic models that precisely describe when articles are “On topic.” However, the same techniques that make such models “fine-grained,” also render the models sensitive to non-responsive “noise” words that can appear on Web pages as advertising, side-links, commentary, or other content that has been added, often after-the-fact to, and which take away from, the core article. As well, recognizing articles that are good candidates for broadening a topic definition can be problematic using fine-grained topic models alone. The problem can arise when a fine-grained topic model is trained too narrowly and is unable to find articles that are near to, but not exactly on the same topic as, the fine-grained topic.
Therefore, a need remains for providing topical organization to a corpus that facilitates topic definition with the precision of a fine-grained topic model, yet resilience to word noise and over-training.