The present invention relates to the field of data mining, and particularly to a software system and associated methods for identifying relevant terms from a large text database of unstructured information, such as the World Wide Web (WWW). More specifically, the present invention relates to the automatic and iterative recognition of relevant terms by association mining and refinement of co-occurrences using hypertext link metadata.
The World Wide Web (WWW) is a vast and open communications network where computer users can access available data, digitally encoded documents, books, pictures, and sounds. With the explosive growth and diversity of WWW authors, published information is oftentimes unstructured and widely scattered. Although search engines play an important role in furnishing desired information to the end users, the organization of the information lacks structure and consistency. Web spiders crawl web pages and index them to serve the search engines. As the web spiders visit web pages, they could look for, and learn pieces of information that would otherwise remain undetected.
Current search engines are designed to identify pages with specific phrases and offer limited search capabilities. For example, search engines cannot search for. phrases that relate in a particular way, such as books and authors. Bibliometrics involves the study of the world of authorship and citations. It measures the co-citation strength, which is a measure of the similarity between two technical papers on the basis of their common citations. Statistical techniques are used to compute this measures. In typical bibliometric situations the citations and authorship are explicit and do not need to be mined. One of the limitations of the bibliometrics is that it cannot be used to extract buried information in the text.
Exemplary bibliometric studies are reported in: R. Larson, xe2x80x9cBibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,xe2x80x9d Technical report, School of Information Management and Systems, University of California, Berkeley, 1996. http://sherlock.sims.berkeley.edu/docs/asis96/asis96.html; K. McCain, xe2x80x9cMapping Authors in Intellectual Space: A technical Overview,xe2x80x9d Journal of the American Society for Information Science, 41(6):433-443, 1990. A Dual Iterative Pattern Relation Expansion (DIPRE) method that addresses the problem of extracting (author, book) relationships from the web is described in S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998.
Another area to identify a set of related information on the World Wide Web is the Hyperlink-Induced Topic Search (HITS). HITS is a system that identifies authoritative web pages on the basis of the link structure of web pages. It iteratively identifies good hubs, that is pages that point to good authorities, and good authorities, that is pages pointed to by good hub pages. This technique has been extended to identify communities on the web, and to target a web crawler. One of HITS"" limitations resides in the link topology of the pattern space, where the hubs and the authorities are of the same kind. i.e., they are all web pages. HITS is not defined in the text of web pages in the form of phrases containing relations in specific patterns. Exemplary HITS studies are reported in: D. Gibson et al., xe2x80x9cInferring Web Communities from Link Topology,xe2x80x9d HyperText, pages 225-234, Pittsburgh, Pa., 1998; J. Kleinberg, xe2x80x9cAuthoritative Sources in a Hyperlinked Environment,xe2x80x9d Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms, May 1997; R. Kumar, xe2x80x9cTrawling the Web for Emerging Cyber-Communities,xe2x80x9d published on the WWW at URL: http://www8.org/w8-papers/4a-search-mining/trawling/trawling.html) as of Nov. 13, 1999; and S. Chakrabarti et al. xe2x80x9cFocused Crawling: A New Approach to Topic-Specific Web Resource Discovery,xe2x80x9d Proc. of The 8th International World Wide Web Conference, Toronto, Canada, May 1999.
=Furthermore, not only is the quantity of WWW material increasing, but the types of digitized material are also increasing. For example, it is possible to store alphanumeric texts, data, audio recordings, pictures, photographs, drawings, images, video and prints. However, such large quantities of materials is of little value unless it the desired information is readily retrievable. While, as discussed above, certain techniques have been developed for accessing certain types of textual materials, these techniques are at best moderately adequate for accessing graphic, audio or other specialized materials. Consequently, there are large bodies of published materials that remain inaccessible and thus unusable or significantly under utilized.
A common technique for accessing textual materials is by means of a xe2x80x9ckeywordxe2x80x9d combination, generally with boolean connections between the words or terms. This searching technique suffers from several drawbacks. First, the use of this technique is limited to text and is not usable for other types of material. Second, in order to develop a searchable database of terms, the host computer must usually download the entire documents, which is a time-consuming process, and does not normally provide an association between relevant terms.
Exemplary work in scalable data mining technology, is described in the following references: R. Agrawal et al., xe2x80x9cMining Association Rules Between Sets of Items in Large Databases, Proceedings of ACM SIGMOD Conference on Management of Data, pp. 207-216, Washington, D.C., May 1993; R. Agrawal et al., xe2x80x9cFast Algorithms for Mining Association Rules,xe2x80x9d Proc. of the 20th Int""l Conference on VLDB, Santiago, Chile, September 1994; and S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998, supra. Such work has been successfully applied to identify co-occurring patterns in many real world problems including market basket analysis, cross-marketing, store layout, and customer segmentation based on buying patterns.
Early work on applying association to texts can be found in FACT system, described in R. Feldman et al., xe2x80x9cMining Associations in Text in the Presence of Background Knowledge,xe2x80x9d Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Aug. 2-4, 1996, Portland, Oreg. It discovers associations amongst only keywords representing the topics of the document. The FACT system assumes that a set of predefined keywords describing the document is available. Such an assumption might not be too unrealistic for a set of well annotated documents or for a classes of documents for which text categorization system automatically produces reasonably good annotations with keywords. However, the assumption generally does not hold true for WWW pages since a major portion of the WWW pages is not well annotated. Annotation of the WWW pages by general text categorization techniques can perform poorly, in that these techniques use natural language processing (NLP) that expect grammatically correct sentences, and WWW pages frequently consist of irregular sentences.
There is therefore a great and still unsatisfied need for a software system and associated methods for automatically identifying relevant terms on the World Wide Web. The system and methods enable the automatic and iterative recognition of relevant terms by association mining and refinement of co-occurrences using hypertext link metadata, such as link annotations.
In accordance with the present invention, a computer program product is provided as an automatic mining system to identify a set of relevant terms from a large text database of unstructured information, such as the World Wide Web (WWW), with a high degree of confidence.
One feature of the present invention is to design metrics that address the learning process of relevant terms by finding associations among terms that appear as link annotations, and to minimize the association errors resulting from one or more of the following sources:
False associations governed by the rules of association algorithms.
The unknowability of the optimal metric of significance for a domain.
The large amount of noise contained within the web pages. Reference is made to R. Agrawal, et al., xe2x80x9cMining Association Rules Between Sets of Items in Large Databases,xe2x80x9d Proceedings of ACM SIGMOD Conference on Management of Data, pp. 207-216, Washington, D.C., May 1993.
The foregoing and other features and advantages can be accomplished by the present automatic mining system that includes a computer program product such as a software package, which is comprised of a metadata extractor, a document vector module, an association module, and a filtering module. The automatic mining system further includes a database for storing the mined sets of relevant terms. The set of relevant terms is continuously and iteratively broadened by the automatic mining system.
The automatic mining system allows the users to conduct searches expeditiously on all types of linked annotations. In order to automate the mining process, the system is provided with a novel metric that can be used to sift strongly relevant terms from the association mining result, as well as the standard metrics, confidence and support, used by the data mining community. To this end, the automatic mining system scans the downloaded hypertext link annotations in the downloaded pages, rather than the entire body of the documents for related information. As a result, the crawler is not required to provide a relatively lengthy download of the document content, and the automatic mining system minimizes the download and processing time.