The present invention relates to the field of data mining, and particularly to a software system and associated methods for automatically discovering terms that are relevant to a given target topic from a large databases of unstructured information such as the World Wide Web (WWW). More specifically, the present invention relates to the automatic and iterative recognition of relevant terms by association mining and refinement of co-occurrences.
The World Wide Web (WWW) is a vast and open communications network where computer users can access available data, digitally encoded documents, books, pictures, and sounds. With the explosive growth and diversity of WWW authors, published information is oftentimes unstructured and widely scattered. Although search engines play an important role in furnishing desired information to the end users, the organization of the information lacks structure and consistency. Web spiders crawl web pages and index them to serve the search engines. As the web spiders visit web pages, they could look for, and learn pieces of information that would otherwise remain undetected.
Current search engines are designed to identify pages with specific phrases and offer limited search capabilities. For example, search engines cannot search for phrases that relate in a particular way, such as books and authors. Bibliometrics involves the study of the world of authorship and citations. It measures the co-citation strength, which is a measure of the similarity between two technical papers on the basis of their common citations. Statistical techniques are used to compute this measures. In typical bibliometric situations the citations and authorship are explicit and do not need to be mined. One of the limitations of the bibliometrics is that it cannot be used to extract buried information in the text.
Exemplary bibliometric studies are reported in: R. Larson, xe2x80x9cBibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,xe2x80x9d Technical report, School of Information Management and Systems, University of California, Berkeley, 1996. hftp://sherlock.sims.berkeley.edu/docs/asis96/asis96.html; K. McCain, xe2x80x9cMapping Authors in Intellectual Space: A technical Overview,xe2x80x9d Journal of the American Society for Information Science, 41(6):433-443, 1990. A Dual Iterative Pattern Relation Expansion (DIPRE) method that addresses the problem of extracting (author, book) relationships from the web is described in S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998.
Another area to identify a set of related information on the World Wide Web is the Hyperlink-Induced Topic Search (HITS). HITS is a system that identifies authoritative web pages on the basis of the link structure of web pages. It iteratively identifies good hubs, that is pages that point to good authorities, and good authorities, that is pages pointed to by good hub pages. This technique has been extended to identify communities on the web, and to target a web crawler. One of HITS"" limitations resides in the link topology of the pattern space, where the hubs and the authorities are of the same kind. i.e., they are all web pages. HITS is not defined in the text of web pages in the form of phrases containing relations in specific patterns.
Exemplary HITS studies are reported in: D. Gibson et al., xe2x80x9cInferring Web Communities from Link Topology,xe2x80x9d HyperText, pages 225-234, Pittsburgh, Pa., 1998; J. Kleinberg, xe2x80x9cAuthoritative Sources in a Hyperlinked Environment,xe2x80x9d Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms, May 1997; R. Kumar, xe2x80x9cTrawling the Web for Emerging Cyber-Communities,xe2x80x9d published on the WWW at URL: http://www8.org/w8-papers/4a-search-mining/trawling/trawling.html) as of Nov. 13, 1999; and S. Chakrabarti et al. xe2x80x9cFocused Crawling: A New Approach to Topic-Specific Web Resource Discovery,xe2x80x9d Proc. of The 8th International World Wide Web Conference, Toronto, Canada, May 1999.
Furthermore, not only is the quantity of WWW material increasing, but the types of digitized material are also increasing. For example, it is possible to store alphanumeric texts, data, audio recordings, pictures, photographs, drawings, images, video and prints. However, such large quantities of materials is of little value unless it the desired information is readily retrievable. While, as discussed above, certain techniques have been developed for accessing certain types of textual materials, these techniques are at best moderately adequate for accessing graphic, audio or other specialized materials. Consequently, there are large bodies of published materials that remain inaccessible and thus unusable or significantly under utilized.
A common technique for accessing textual materials is by means of a xe2x80x9ckeywordxe2x80x9d combination, generally with boolean connections between the words or terms. This searching technique suffers from several drawbacks. First, the use of this technique is limited to text and is not usable for other types of material. Second, in order to develop a searchable database of terms, the host computer must usually download the entire documents, which is a time-consuming process, and does not normally provide an association between related terms and concepts.
Exemplary work in scalable data mining technology, is described in the following references: R. Agrawal et al., xe2x80x9cMining Association Rules Between Sets of Items in Large Databases, Proceedings of ACM SIGMOD Conference on Management of Data, pp. 207-216, Washington, D.C., May 1993; R. Agrawal et al., xe2x80x9cFast Algorithms for Mining Association Rules,xe2x80x9d Proc. of the 20th Int""l Conference on VLDB, Santiago, Chile, September 1994; and S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998, supra. Such work has been successfully applied to identify co-occurring patterns in many real world problems including market basket analysis, cross-marketing, store layout, and customer segmentation based on buying patterns.
Early work on applying association to texts can be found in FACT system, described in R. Feldman et al., xe2x80x9cMining Associations in Text in the Presence of Background Knowledge,xe2x80x9d Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Aug. 2-4, 1996, Portland, Oreg. It discovers associations amongst only keywords representing the topics of the document. The FACT system assumes that a set of predefined keywords describing the document is available. Such an assumption might not be too unrealistic for a set of well annotated documents or for a classes of documents for which text categorization system automatically produces reasonably good annotations with keywords. However, the assumption generally does not hold true for WWW pages since a major portion of the WWW pages is not well annotated. Annotation of the WWW pages by general text categorization techniques can perform poorly, in that these techniques use natural language processing (NLP) that expect grammatically correct sentences, and WWW pages frequently consist of irregular sentences.
There is therefore a great and still unsatisfied need for a software system and associated methods for the automatic discovery of terms that are relevant to a given target topic from the World Wide Web, with a high degree of accuracy and confidence.
In accordance with the present invention, a computer program product is provided as an automatic mining system to discover terms that are relevant to a given target topic from a large databases of unstructured information such as the World Wide Web (WWW). The system and methods enable the automatic and iterative recognition of relevant terms by association mining and refinement of co-occurrences.
The operation of the automatic mining system is performed in three stages: The first stage is carried out by a new terms discoverer for discovering the terms in a document di; the second stage is carried out by a candidate terms discoverer for discovering potentially relevant terms; and the third stage is carried out by a relevant terms discoverer for refining or testing the discovered relevance to filter false (or insignificant) relevance.
The new terms discoverer includes a system for the automatic mining of patterns and relations, a system for the automatic mining of new relationships, and a system for selecting new terms from relations. In one embodiment, the system for the automatic mining of patterns and relations identifies a set of related terms on the WWW with a high degree of confidence, using a duality concept, and includes a terms database and two identifiers: a relation identifier and a pattern identifier.
The system for the automatic mining of new relationships enables the discovery of new relationships by association mining and refinement of co-occurrences, using automatic and iterative recognition of new binary relations through phrases that embody related pairs. The system for the automatic mining of new relationships is comprised of a database a knowledge module and a statistics module. In one embodiment, the knowledge module includes one or more of the following units: a stemming unit, a synonym check unit, and a domain knowledge check unit. New terms are obtained from relations discovered by the system for automatic mining of patterns and relations of the same kind by selecting an item (or a column) of a pair.
The candidate terms discoverer is comprised of a metadata extractor, a document vector module, an association module, a filtering module, and a database for storing the mined sets of relevant terms. The relevant terms discoverer includes a stop word filter and a system for the automatic construction of generalization-specialization hierarchy of terms. The system for the automatic construction of generalization-specialization hierarchy of terms includes a terms database, an augmentation module, a generalization detection module, and a hierarchy database.