The present invention relates to the field of data mining, and particularly to a software system and associated method for identifying relevant terms from a large text database of unstructured information, such as the World Wide Web (WWW). More specifically, the present invention relates to the automatic and iterative recognition of new binary relations through phrases that embody related pairs, by applying lexicographic and statistical techniques to classify the relations, and further by applying a minimal amount of domain knowledge of the relevance of the terms and relations.
The World Wide Web (WWW) is a vast and open communications network where computer users can access available data, digitally encoded documents, books, pictures, and sounds. With the explosive growth and diversity of WWW authors, published information is oftentimes unstructured and widely scattered. Although search engines play an important role in furnishing desired information to the end users, the organization of the information lacks structure and consistency. Web spiders crawl web pages and index them to serve the search engines. As the web spiders visit web pages, they could look for, and learn pieces of information that would otherwise remain undetected.
Current search engines are designed to identify pages with specific phrases and offer limited search capabilities. For example, search engines cannot search for that relate in a particular way, such as books and authors. Bibliometrics involves the study of the world of authorship and citations. It measures the co-citation strength, which is a measure of the similarity between two technical papers on the basis of their common citations. Statistical techniques are used to compute this measures. In typical bibliometric situations the citations and authorship are explicit and do not need to be mined. One of the limitations of the bibliometrics is that it cannot be used to extract buried information in the text.
Exemplary bibliometric studies are reported in: R. Larson, xe2x80x9cBibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,xe2x80x9d Technical report, School of Information Management and Systems, University of California, Berkeley, 1996.  less than http://sherlock.sims.berkeley.edu/docs/asis96/asis96.html greater than ; K. McCain, xe2x80x9cMapping Authors in Intellectual Space: A technical Overview,xe2x80x9d Journal of the American Society for Information Science, 41(6):433-443, 1990. A Dual Iterative Pattern Relation Expansion (DIPRE) method that addresses the problem of extracting (author, book) relationships from the web is described in S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998.
Another area to identify a set of relevant information on the World Wide Web is the Hyperlink-Induced Topic Search (HITS). HITS is a system that identifies authoritative web pages on the basis of the link structure of web pages. It iteratively identifies good hubs, that is pages that point to good authorities, and good authorities, that is pages pointed to by good hub pages. This technique has been extended to identify communities on the web, and to target a web crawler. One of HITS"" limitations resides in the link topology of the pattern space, where the hubs and the authorities are of the same kind. i.e., they are all web pages. HITS is not defined in the text of web pages in the form of phrases containing relations in specific patterns.
Exemplary HITS studies are reported in: D. Gibson et al., xe2x80x9cInferring Web Communities from Link Topology,xe2x80x9d HyperText, pages 225-234, Pittsburgh, Pa., 1998; J. Kleinberg, xe2x80x9cAuthoritative Sources in a Hyperlinked Environment,xe2x80x9d Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms, May 1997; R. Kumar, xe2x80x9cTrawling the Web for Emerging Cyber-Communities,xe2x80x9d published on the WWW at URL:  less than http://www8.org/w8-papers/4a-search-mining/trawling/trawling.html greater than as of Nov. 13, 1999; and S. Chakrabarti et al. xe2x80x9cFocused Crawling: A New Approach to Topic-Specific Web Resource Discovery,xe2x80x9d Proc. of The 8th International World Wide Web Conference, Toronto, Canada, May 1999.
Furthermore, not only is the quantity of WWW material increasing, but the types of digitized material are also increasing. For example, it is possible to store alphanumeric texts, data, audio recordings, pictures, photographs, drawings, images, video and prints. However, such large quantities of materials is of little value unless it the desired information is readily retrievable. While, as discussed above, certain techniques have been developed for accessing certain types of textual materials, these techniques are at best moderately adequate for accessing graphic, audio or other specialized materials. Consequently, there are large bodies of published materials that remain inaccessible and thus unusable or significantly under utilized.
A common technique for accessing textual materials is by means of a xe2x80x9ckeywordxe2x80x9d combination, generally with boolean connections between the words or terms. This searching technique suffers from several drawbacks. First, the use of this technique is limited to text and is not usable for other types of material. Second, in order to develop a searchable database of terms, the host computer must usually download the entire documents, which is a time-consuming process, and does not normally provide an association between relevant rerms.
Exemplary work in scalable data mining technology, is described in the following references: R. Agrawal et al., xe2x80x9cMining Association Rules Between Sets of Items in Large Databases, Proceedings of ACM SIGMOD Conference on Management of Data, pp. 207-216, Washington, D.C., May 1993; R. Agrawal et al., xe2x80x9cFast Algorithms for Mining Association Rules,xe2x80x9d Proc. of the 20th Int""l Conference on VLDB, Santiago, Chile, September 1994; and S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998, supra. Such work has been successfully applied to identify co-occurring patterns in many real world problems including market basket analysis, cross-marketing, store layout, and customer segmentation based on buying patterns.
Early work on applying association to texts can be found in FACT system, described in R. Feldman et al., xe2x80x9cMining Associations in Text in the Presence of Background Knowledge,xe2x80x9d Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Aug. 2-4, 1996, Portland, Oreg. It discovers associations amongst only keywords representing the topics of the document. The FACT system assumes that a set of predefined keywords describing the document is available. Such an assumption might not be too unrealistic for a set of well annotated documents or for a classes of documents for which text categorization system automatically produces reasonably good annotations with keywords. However, the assumption generally does not hold true for WWW pages since a major portion of the WWW pages is not well annotated. Annotation of the WWW pages by general text categorization techniques can perform poorly, in that these techniques use natural language processing (NLP) that expect grammatically correct sentences, and WWW pages frequently consist of irregular sentences.
There is therefore a great and still unsatisfied need for a software system and associated methods for automatically identifying relevant terms on the World Wide Web. The system and methods should enable the automatic and iterative recognition of binary relations through phrases that embody related pairs, by applying lexicographic and statistical techniques to classify the relations, and further by applying a minimal amount of domain knowledge of the relevance of terms and relations.
In accordance with the present invention, a computer program product is provided as an automatic mining system to identify a set of related terms from a large text database of unstructured information, such as the World Wide Web (WWW), with a high degree of confidence.
The automatic mining system includes a software program that enables the discovery of new relationships by association mining and refinement of co-occurrences, using automatic and iterative recognition of new binary relations through phrases that embody related pairs, by applying lexicographic and statistical techniques to classify the relations, and further by applying a minimal amount of domain knowledge of the relevance of the terms and relations.
The foregoing and other features and advantages of the present invention can be accomplished by an automatic mining system that includes a database for storing the mined sets of relevant terms and relations, and a software package comprised of a knowledge module and a statistics module. Using the document di, the previously identified sets of pairs Pixe2x88x921, and relations Rixe2x88x921, the knowledge module inquires whether or not the relation ri exists in the set of relations Rixe2x88x921. If the relation ri is deemed not to exist in the set of relations Rixe2x88x921, the knowledge module forwards the pair pi and the derived relation ri to the statistics module for optimizing and increasing the confidence level in the relation ri being considered. The derived relation ri is stored in the database for recognizing additional pairs pi+1, and relations ri+1. If the knowledge module determines the relation exists in the set of relations Rixe2x88x921, the knowledge module terminates the mining progress and proceeds to mine additional pairs and relations.
In one embodiment, the knowledge module includes one or more of the following units: a stemming unit, a synonym check unit, and a domain knowledge check unit. The stemming unit determines if the relation ri being analyzed shares a common root with a previously mined relation rixe2x88x921 in the database. The synonym check unit identifies the synonyms of the relation ri. The domain knowledge check unit considers the content of the document di for indications that would further clarify the relationship of the relations being mined. The statistics module optimizes and increases the confidence level in the relationship on the basis of the previous usage of the relations.