The Worldwide Web (“Web”) is an open-ended digital information repository into which new information is continually posted. The information on the Web can, and often does, originate from diverse sources, including authors, editors, collaborators, and outside contributors commenting, for instance, through a Web log, or “Blog.” Such diversity suggests a potentially expansive topical index, which, like the underlying information, continuously grows and changes.
Social indexing systems provide information and search services that organize evergreen information according to the topical categories of indexes built by their users. Topically organizing an open-ended information source, like the Web, into an evergreen social index can facilitate information discovery and retrieval, such as described in commonly-assigned U.S. patent application, entitled “System and Method for Performing Discovery of Digital Information in a Subject Area,” Ser. No. 12/190,552, filed Aug. 12, 2008, pending, the disclosure of which is incorporated by reference.
Social indexes organize evergreen information by topic. A user defines topics for the social index and organizes the topics into a hierarchy. The user then interacts with the system to build robust models to classify the articles under the topics in the social index using, for instance, example-based training, such as described in Id. Through the training, the system builds fine-grained topic models by generating finite-state patterns that appropriately match positive-example articles and do not match negative-example articles.
In addition, the system can build coarse-grained topic models based on population sizes of characteristic words, such as described in commonly-assigned
U.S. Pat. No. 8,010,545, issued Aug. 30, 2011, the disclosure of which is incorporated by reference. The coarse-grained topic models are used to recognize whether an article is roughly on topic. Articles that match the fine-grained topic models, yet have statistical word usage far from the norm of the positive training example articles are recognized as “noise” articles. The coarse-grained topic models can also suggest “near misses,” that is, articles that are similar in word usage to the training examples, but which fail to match any of the preferred fine-grained topic models, such as described in commonly-assigned U.S. Provisional Patent Application, entitled “System and Method for Providing Robust Topic Identification in Social Indexes,” Ser. No. 61/115,024, filed Nov. 14, 2008, pending, the disclosure of which is incorporated by reference.
To large extent, the success of social indexing depends upon the ease of creating new indexes, yet index creation can be the most difficult step for new users, particularly when built through example-based training of index topics. The example-based approach yields well-tuned topic models for the indexes and creates patterns without requiring a user to master the skills of writing potentially-complex queries. Example-based training also provides valuable feedback for tuning topic models. Notwithstanding, example-based training requires significant work and understanding. As a preliminary step, a new user must create and name each topic, and place that topic into a topic tree. Much more work is required for training. The user must identify one or more positive-example articles for each topic and train the index using the positive-example articles. Following training, the system reports the matching articles for each topic and their scores, plus candidate “near misses” for each topic. If one or more of the near misses belong under a topic, the user can add the article to the set of positive training examples. As well, if the system reports one or more off-topic articles as matching, the user can add those articles as negative training examples.
Through this routine, a user engages in an open-ended iterative process of tuning topics. Sometimes, several cycles of adding positive and negative training examples is required until satisfactory results are obtained from the topic models. For new users wanting to see quick results from their efforts, the labor of example-based training can be a disincentive.