1. Field of the Invention
The present invention relates, generally, to a process, system and article of manufacture for organizing and indexing information items such as documents by topic, and in preferred embodiments, to such a process, system and article which employ a topic hierarchy and involve a determination of discriminating terms and stop terms at each internal node in the topic hierarchy.
2. Description of Related Art
With modern advances in computer technology, modem speeds and network and internet technologies, vast amounts of information have become readily available in homes, businesses and educational and government institutions throughout the world. Indeed, many businesses, individuals and institutions rely on computer-accessible information on a daily basis. This global popularity has further increased the demand for even greater amounts of computer-accessible information. However, as the total amount of accessible information increases, the ability to locate specific items of information within the totality becomes increasingly more difficult.
The format with which the accessible information is arranged also affects the level of difficulty in locating specific items of information within the totality. For example, searching through vast amounts of information arranged in a free-form format can be substantially more difficult and time consuming than searching through information arranged in a pre-defined order, such as by topic, date, category, or the like. However, due to the nature of certain on-line systems, such as the internet, much of the accessible information is placed on-line in the form of free-format text. Moreover, the amount of on-line data in the form of free-format text continues to grow very rapidly.
Search schemes employed to locate specific items of information among the on-line information content, typically depend upon the presence or absence of key words (words included in the user-entered query) in the searchable text. Such search schemes identify those textual information items that include (or omit) the key words. However, in systems, such as the web, or large intranets, where the total information content is relatively large and free-form, key word searching can be problematic, for example, resulting in the identification of numerous text items that contain (or omit) the selected key words, but which are not relevant to the actual subject matter to which the user intended to direct the search.
As text repositories grow in number and size and global connectivity improves, there is a pressing need to support efficient and effective information retrieval (IR), searching and filtering. A manifestation of this need is the recent proliferation of over one hundred commercial text search engines that crawl and index the web, and several subscription-based information multicast mechanisms. Nevertheless, there is little structure on the overwhelming information content of the internet.
Common practices for managing such information complexity on the internet or in database structures typically involve tree-structured hierarchical indices. Many internet directories, such as Yahoo!.TM. (http://www.yahoo.com) and Infoseek (http://www.infoseek.com) are largely manually organized in preset hierarchies. International Business Machine Corporation has implemented a patent database (http://www.ibm.com/patents) which is organized by the U.S. Patent Office's class codes, which form a preset hierarchy. Digital libraries that mimic hardcopy libraries support some form of subject indexing such as the Library of Congress Catalogue, which is also hierarchical. Such topic hierarchies are referred to herein as "taxonomies." Taxonomies can provide a means for designing vastly enhanced searching, browsing and filtering systems. Querying with respect to a topic can be more reliable than depending only on the presence or absence of specific words in documents. By the same token, multicast systems such as PointCast (http://www.pointcast.com) are likely to achieve higher quality by registering a user profile in terms of classes in a taxonomy rather than key words.
The danger in querying or filtering by keywords alone is that there may be many aspects to, and often different interpretations of the key words, and many of these aspects and interpretations are irrelevant to the subject matter that the searcher intended to find.
Consider, for example, a situation in which a wildlife researcher is attempting to find information about the running speed of the jaguar, using the conventional Alta Vista.TM. internet search engine (http://www.altavista.digital.com), with the query "jaguar speed". In a test search conducted with the above-noted search engine and query, a variety of responses were generated, spanning the car, the Atari.TM. video game, the football team, and a LAN server, in no particular order. The first page about the animal was ranked 183, and was directed to a fable.
To eliminate the responses on cars, the test query was then changed to "jaguar speed-car-auto". The top response in the generated results read as follows:
"If you own a classic Jaguar, you are no doubt aware how difficult it can be to find certain replacement parts. This is particularly true of gearbox parts."
The words car and auto do not occur on this page. There was no cat in the first 50 pages of the generated response. Some search engines such as Alta Vista.TM. propose additional keywords to refine the query, but, at the time of writing, all of the keyword were related to cars or football.
Even the query "jaguar speed +cat"gave unsatisfactory results. The responses included the word "cat", but were often about automobiles. The 25th page was the first with information about jaguars, but did not contain the desired information.
In contrast, if a topic taxonomy such as Yahoo.TM. is used, there is no problem in insisting that the user seeks documents containing "jaguar" in the topical context of animals, not cars. Unfortunately, it is labor-intensive to maintain Yahoo.TM. manually as the web changes and grows faster than ever. In our test case, even though the search was easily restricted to within animals, no answer could be found within the relatively small collection returned.
Search engines are still an immature technology. Other areas have been researched intensively long before web search engines were devised, and the following discussion surveys the following overlapping areas of related research: Information Retrieval (IR) systems and text databases, data mining, statistical pattern recognition, and machine learning.
For data mining, machine learning, and pattern recognition, the supervised classification problem has been addressed in statistical decision theory (both classical, as in Wald, Statistical Decision Functions, 1950, and Bayesian, as in Berger, Statistical Decision Theory and Bayesian Analysis, 1985, each of which is incorporated herein by reference), in statistical pattern recognition (as in Duda and Hart, Pattern Classification and Scene Analysis, 1973 and Fukunaga, An Introduction to Statistical Pattern Recognition, 1990, each of which is incorporated herein by reference), in machine learning (as in Weiss and Kulikowski, Computer Systems that Learn, 1990, Natarajan, Machine Learning: A Theoretical Approach, 1991, and Langley, Elements of Machine Learning, 1996, each of which is incorporated herein by reference).
Classifiers can be parametric or non-parametric. Two well-known classes of non-parametric classifiers are decision trees, such as CART (as in Breiman et al, Classification and Regression Trees, 1984, which is incorporated herein by reference) and C4.5 (as in Quinlan, C4.5: Programs for Machine Learning, 1993, which is incorporated herein by reference), and neural networks (as in Hush and Horne, Progress in Supervised Neural Networks, 1993, Lippmann, Pattern Classification using Neural Networks, 1989, and Jain et al, Artificial Neural Networks, 1996, each of which is incorporated herein by reference. For such classifiers, feature sets larger than 100 are considered extremely large. Document classification may require more than 50,000.
The most mature ideas in IR systems and text databases, which are also successfully integrated into commercial text search systems such as Verity, ConText, and Alta Vista, involve processing at a relatively syntactic level (e.g., stopword filtering, tokenizing, stemming, building inverted indices, computing heuristic term weights, and computing similarity measures between documents and queries in the vector-space model, as described by Rijsbergen, Information Retrieval, 1979, Salton and McGill, Introduction to Modern Information Retrieval, 1983, or Frakes and Baeza-Yates, Information Retrieval: Data Structures and Algorithms, 1992, each of which is incorporated herein by reference). More recent work includes statistical modeling of documents, unsupervised clustering (where documents are not labeled with topics and the goal is to discover coherent clusters, as described in Anick and Vaithyanathan, Exploiting Clustering and Phrases for Content-based Information Retrieval, 1997, which is incorporated herein by reference), supervised classification (as in Apte et al, Automated Learning of Decision Rules for Text Categorization, 1994, and Cohen and Singer, Context Sensitive Learning Methods for Text Categorization, 1996, each of which is incorporated herein by reference), query expansion (as in Schutze et al, A Comparison of Classifiers and Document Representations for the Routing Problem, 1995, and Voorhees, Using WordNet to Disambiguate Word Senses for text Retrieval, 1993, each of which is incorporated herein by reference).
Singular value decomposition on the term-document matrix has been found to cluster semantically related documents together even if they do not share keywords (as discussed in Deerwester et al, Indexing by Latent Semantic Analysis, 1990, and Papadimitriou et al, Latent Semantic Indexing: A Probabilistic Analysis, 1996, each of which is incorporated herein by reference). None of these works address the supervised topic analysis problem in a hierarchy or how to use context-dependent words for indexing, how to automatically and efficiently compute good feature sets, and how to maintain disk data structures as training documents and the topic structure changes with time.
Koller and Sahami, Hierarchically Classifying Documents Using Very Few Words, International Conference on Machine Learning, July 1997 and Yang and Pedersen, A comparative study on feature selection in text categorization, International Conference on Machine Learning, July 1997 discuss classification. Koller et al propose a sophisticated feature selection algorithm that uses a Bayesian net to learn inter-term dependencies. The complexity in the number of features is supralinear (e.g., quadratic in the number of starting terms and exponential in the degree of dependence between terms). Consequently, the reported experiments have been restricted to a few thousand features and documents. Yang and Pedersen's experiments appear to indicate that much simpler methods suffice, in particular, that the approach of Apte et al of picking a fixed fraction of most frequent terms per class performs reasonably. This fraction may be very sensitive to corpus and methodology (e.g., whether stemming and stopwording is performed). This is indicated by the poor performance of methods observed in recent work by Mladenic, Feature Subset Selection In Text Learning, 10th European Conference on Machine Learning, 1998.