The present invention relates to the field of data mining, and particularly to a software system and associated method for the automatic construction of a generalization-specialization hierarchy of terms from a database of terms and associated meanings, including but not limited to large text databases of unstructured information such as the World Wide Web (WWW). More specifically, the present invention relates to the automatic and iterative recognition of relevant terms by association mining and refinement of co-occurrences using, for example, the Least General Generalization (LGG) model.
The World Wide Web (WWW) is a vast and open communications network where computer users can access available data, digitally encoded documents, books, pictures, and sounds. With the explosive growth and diversity of WWW authors, published information is oftentimes unstructured and widely scattered. Although search engines play an important role in furnishing desired information to the end users, the organization of the information lacks structure and consistency. Web spiders crawl web pages and index them to serve the search engines. As the web spiders visit web pages, they could look for, and learn pieces of information that would otherwise remain undetected.
Current search engines are designed to identify pages with specific phrases and offer limited search capabilities. For example, search engines cannot search for phrases that relate in a particular way, such as books and authors. Bibliometrics involves the study of the world of authorship and citations. It measures the co-citation strength, which is a measure of the similarity between two technical papers on the basis of their common citations. Statistical techniques are used to compute this measures. In typical bibliometric situations the citations and authorship are explicit and do not need to be mined. One of the limitations of the bibliometrics is that it cannot be used to extract buried information in the text.
Exemplary bibliometric studies are reported in: R. Larson, xe2x80x9cBibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,xe2x80x9d Technical report, School of Information Management and Systems, University of California, Berkeley, 1996. http://sherlock.sims.berkeley.edu/docs/asis96/asis96.html; K. McCain, xe2x80x9cMapping Authors in Intellectual Space: A technical Overview,xe2x80x9d Journal of the American Society for Information Science, 41(6):433-443, 1990. A Dual Iterative Pattern Relation Expansion (DIPRE) method that addresses the problem of extracting (author, book) relationships from the web is described in S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998.
Another area to identify a set of related information on the World Wide Web is the Hyperlink-Induced Topic Search (HITS). HITS is a system that identifies authoritative web pages on the basis of the link structure of web pages. It iteratively identifies good hubs, that is pages that point to good authorities, and good authorities, that is pages pointed to by good hub pages. This technique has been extended to identify communities on the web, and to target a web crawler. One of HITS"" limitations resides in the link topology of the pattern space, where the hubs and the authorities are of the same kind. i.e., they are all web pages. HITS is not defined in the text of web pages in the form of phrases containing relations in specific patterns. Exemplary HITS studies are reported in: D. Gibson et al., xe2x80x9cInferring Web Communities from Link Topology,xe2x80x9d HyperText, pages 225-234, Pittsburgh, Pa., 1998; J. Kleinberg, xe2x80x9cAuthoritative Sources in a Hyperlinked Environment,xe2x80x9d Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms, May 1997; R. Kumar, xe2x80x9cTrawling the Web for Emerging Cyber-Communities,xe2x80x9d published on the WWW at URL: http://www8.org/w8-papers/4a-search-mining/trawling/ trawling.html) as of Nov. 13, 1999; and S. Chakrabarti et al. xe2x80x9cFocused Crawling: A New Approach to Topic-Specific Web Resource Discovery,xe2x80x9d Proc. of The 8th International World Wide Web Conference, Toronto, Canada, May 1999.
Furthermore, not only is the quantity of WWW material increasing, but the types of digitized material are also increasing. For example, it is possible to store alphanumeric texts, data, audio recordings, pictures, photographs, drawings, images, video and prints. However, such large quantities of materials is of little value unless it the desired information is readily retrievable. While, as discussed above, certain techniques have been developed for accessing certain types of textual materials, these techniques are at best moderately adequate for accessing graphic, audio or other specialized materials. Consequently, there are large bodies of published materials that remain inaccessible and thus unusable or significantly under utilized.
A common technique for accessing textual materials is by means of a xe2x80x9ckeywordxe2x80x9d combination, generally with boolean connections between the words or terms. This searching technique suffers from several drawbacks. First, the use of this technique is limited to text and is not usable for other types of material. Second, in order to develop a searchable database of terms, the host computer must usually download the entire documents, which is a time-consuming process, and does not normally provide an association between related terms and concepts.
Exemplary work in scalable data mining technology, is described in the following references: R. Agrawal et al., xe2x80x9cMining Association Rules Between Sets of Items in Large Databases, Proceedings of ACM SIGMOD Conference on Management of Data, pp. 207-216, Washington, D.C., May 1993; R. Agrawal et al., xe2x80x9cFast Algorithms for Mining Association Rules,xe2x80x9d Proc. of the 20th Int""l Conference on VLDB, Santiago, Chile, September 1994; and S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998, supra. Such work has been successfully applied to identify co-occurring patterns in many real world problems including market basket analysis, cross-marketing, store layout, and customer segmentation based on buying patterns.
Early work on applying association to texts can be found in FACT system, described in R. Feldman et al., xe2x80x9cMining Associations in Text in the Presence of Background Knowledge,xe2x80x9d Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Aug. 2-4, 1996, Portland, Oreg. It discovers associations amongst only keywords representing the topics of the document. The FACT system assumes that a set of predefined keywords describing the document is available. Such an assumption might not be too unrealistic for a set of well annotated documents or for a classes of documents for which text categorization system automatically produces reasonably good annotations with keywords. However, the assumption generally does not hold true for WWW pages since a major portion of the WWW pages is not well annotated. Annotation of the WWW pages by general text categorization techniques can perform poorly, in that these techniques use natural language processing (NLP) that expect grammatically correct sentences, and WWW pages frequently consist of irregular sentences.
There is therefore a great and still unsatisfied need for a software system and associated methods for the automatic construction of a generalization-specialization hierarchy of terms from an unstructured database of terms and associated meanings, with a high degree of accuracy and confidence, and with minimal human interference.
In accordance with the present invention, a computer program product is provided as an automatic mining system to build a generalization hierarchy of terms from a database of terms and associated meanings. The system and methods enable the automatic and iterative recognition of relevant terms by association mining and refinement of co-occurrences using, for example, the Least General Generalization (LGG) model.
The automatic mining system is generally comprised of a terms database, an augmentation module, a generalization detection module and a hierarchy database. The terms database stores the sets of terms (Ai) and their associated meanings (Mi), and the hierarchy database stores the generalization hierarchy (Hi) mined by the automatic mining system. The set of terms (Ai) includes the set of generalizations (Li) that have been mined by the automatic mining system, and the generalization hierarchy (Hi) is defined by a set of edges (Ei) and a set of terms (Ai).
One function of the augmentation module is to update the set of terms (Ai), knowing the terms (ai) stored in the terms database. This feature is implemented by a generalization technique such as the xe2x80x9cLeast General Generalizationxe2x80x9d or LGG model. The generalization detection module maps the LGG sets (Lixe2x88x921) that are stored in the terms database and the LGG terms {li} that are derived by the augmentation module, updates the set of edges (Ei), and derives a generalization hierarchy. In operation, the automatic mining system begins with no predefined taxonomy of the terms and derives a generalization hierarchy such as an LGG model constructed as a Directed Acyclic Graph (DAG), from the set of terms (Ai). The generalization hierarchy maps the generalization and specialization relationships between the terms (ai).