The lack of meaningful topical indexing makes effective searching of open-ended information repositories, especially the Worldwide Web (“Web”), difficult. Topical indexing provides helpful context, which can be crucial to successful information discovery, as search results alone often lack much-needed topical signposts or other contextual clues. Moreover, the user may be unfamiliar with the subject matter being searched, or could be unaware of the full extent of the information available in the repository. And even when knowledgeable about the subject matter, a user may still be unable to properly describe the information desired, may stumble over problematic variations in terminology, vocabulary, or language, or may simply be unable to formulate a usable search query.
Topical indexing can help alleviate these difficulties. For instance, open-ended information repositories can be organized through evergreen topical indexes that use finite state patterns built through curator-guided social indexing, such as described in commonly-assigned U.S. patent application, entitled “System and Method for Performing Discovery of Digital Information in a Subject Area,” Ser. No. 12/190,552, filed Aug. 12, 2008, pending, the disclosure of which is incorporated by reference. This form of social indexing applies supervised machine learning to bootstrap curator-selected training material into fine-grained topic models as expressed through discrete Boolean queries for each topic in the topical index. Once trained, the topical index can be used for index extrapolation to categorize incoming content into topics under pre-selected subject areas.
Fine-grained social indexing uses high-resolution topic models, such as discrete Boolean queries expressed as finite state patterns, that precisely describe when articles are “on topic.” However, the same techniques that make such topic models “fine-grained,” also render the models sensitive to non-responsive “noise” words and other distractions that can appear on Web pages as advertising, side-links, commentary, or other content that has been added, often after-the-fact to, and which take away from, the core article contained on the Web page. Further, recognizing articles that are good candidates for topic broadening can be problematic when using fine-grained topic models alone, which can occur when a fine-grained topic model is trained too narrowly and is unable to find articles that are near to, but not exactly on, the same topic as the fine-grained topic.
Coarse-grained topic models use weighted characteristic word term vectors to characterize the population of words characteristic for topics. Combining fine-grained social indexing with characteristic word topic models can introduce resilience to noise, while providing robustness against over-training that can result in overly-narrow fine-grained topic models. For instance, for each topic, a fine-grained topic model can be combined with a coarse-grained topic model, such as described in commonly-assigned U.S. patent application, entitled “System and Method for Providing Robust Topic Identification in Social Indexes,” Ser. No. 12/608,929, filed Oct. 29, 2009, pending, the disclosure of which is incorporated by reference. Characteristic words are selected from the articles in the repository, scored using term frequency-inverse document frequency (TF-IDF) weighting, and normalized to form coarse-grained topic models. A term vector is then created for each coarse-grained topic model that characterizes the populations of characteristic words found in training examples. In combination, the fine-grained and coarse-grained topic models allow a curator to readily identify pages containing unacceptable “noise” content, propose candidate articles for near-misses to broaden a topic using positive training examples, and propose candidate articles for negative training examples to narrow a topic using negative training examples.
Notwithstanding, fine-grained social indexing, when used either alone or with coarse-grained topic models, and other forms of topical indexing, generally assume that each topic has only one core single-layer meaning. Articles are classified as being either “on-topic” if sufficiently similar to a representation of a single core meaning, or are categorized as being “off-topic.”
In contrast, some forms of topics have multiple and equally-applicable core meanings. Natural topics, for instance, are created through folksonomies or related collaborative approaches to tagging and categorizing content. Under these approaches, the set of acceptable core meanings assigned to a topic depends upon the perspective of the reader: what one reader considers “on-topic” could equally be considered “off-topic” by another reader. However, both readers are correct; each simply desires different core meanings for the same topic as a reflection of their interpretation of what is, or is not, considered to be “on topic.” Typically, the curator for the index has overall responsibility for determining the meanings for the topics.
Similarly, each topic can have subtopics, which in turn can each have multiple core meanings. This layering of topics results in a richer hierarchy of index entries that resembles a fractal-like nesting of core meanings. Each layer of subtopics has the same complexity as preceding layers, but within the scope of a specific topic. Existing topic models can also be organized hierarchically, yet topical diversity and semantic density are lacking and similarity duplication of articles can still occur across seemingly unrelated branches of the hierarchy.
Consequently, natural topics have a polysemic nature when a topic has several core meanings that apply equally depending upon whether an article is on-topic or off-topic. As well, a natural topic can have hierarchically-related meanings that are contextually embedded in a recursive manner. Conventional fine-grained topic models can be adapted for natural topics, such as by defining distinct finite state patterns for each core meaning. However, this approach raises further difficulties. One problem is that the overall pattern, which combines or excludes multiple meanings, can become cumbersome, complex and thereby difficult to maintain. A second problem is that the articles themselves may cover multiple topics. This problem leads to a need for a nuanced and gradual approach to classifying articles to indicate whether an article is mainly on topic, or close to a topic, or mainly off-topic, or far from a topic. A third problem occurs when a topic has subtopics. Subtopics introduce a potential for overlap in the classification of articles to topics, and duplication in the presentation of articles. For example, in 2010, the news covered the conflict between Google, a U.S.-based online search provider, and the government of China. News articles falling under that conflict could be classified under multiple general news topics. From one perspective, the articles are about Internet censorship. From another perspective, the articles are about the economic futures of Google and its competitors in China. From yet another perspective, the articles are about cyber attacks. From still another perspective, the articles are about trade between the U.S. and China. Depending on the topics or subtopics being presented, showing the same article on the same page under multiple topics should be avoided, that is, “topic-similarity duplication” in article presentation ought to be reduced.
Therefore, a need remains for providing topical organization to a corpus that accommodates natural topics in both a horizontal co-equal core meaning and vertical hierarchical, yet non-duplicative and embedded meaning fashion.