1. Field of the Invention
The present invention relates to the field of data mining, and particularly to a software system and associated method for identifying a set of related information on the World Wide Web. More specifically, the present invention relates to the automatic and iterative mining and refinement of patterns of occurrences and relations using a duality concept.
2. Description of Related Art
The World Wide Web (WWW) is a vast and open communications network where computer users can access available data, digitally encoded documents, books, pictures, and sounds. With the explosive growth and diversity of WWW authors, published information is oftentimes unstructured and widely scattered. Although search engines play an important role in furnishing desired information to the end users, the organization of the information lacks structure and consistency. Web spiders crawl web pages and index them to serve the search engines. As the web spiders visit web pages, they could look for, and learn pieces of information that would otherwise remain undetected.
Current search engines are designed to identify pages with specific phrases and offer limited search capabilities. For example, search engines cannot search for phrases that relate in a particular way, such as books and authors. Bibliometrics involves the study of the world of authorship and citations. It measures the co-citation strength, which is a measure of the similarity between two technical papers on the basis of their common citations. Statistical techniques are used to compute this measures. In typical bibliometric situations the citations and authorship are explicit and do not need to be mined. One of the limitations of the bibliometrics is that it cannot be used to extract buried information in the text.
Exemplary bibliometric studies are reported in: R. Larson, xe2x80x9cBibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,xe2x80x9d Technical report, School of Information Management and Systems, University of California, Berkeley, 1996. http://sherlock.sims.berkeley.edu/docs/asis96/asis96.html; K. McCain, xe2x80x9cMapping Authors in Intellectual Space: A technical Overview,xe2x80x9d Journal of the American Society for Information Science, 41(6):433-443, 1990. A Dual Iterative Pattern Relation Expansion (DIPRE) method that addresses the problem of extracting (author, book) relationships from the web is described in S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998.
Another area to identify a set of related information on the World Wide Web is the Hyperlink-Induced Topic Search (HITS). HITS is a system that identifies authoritative web pages on the basis of the link structure of web pages. It iteratively identifies good hubs, that is pages that point to good authorities, and good authorities, that is pages pointed to by good hub pages. This technique has been extended to identify communities on the web, and to target a web crawler. One of HITS"" limitations resides in the link topology of the pattern space, where the hubs and the authorities are of the same kind. i.e., they are all web pages. HITS is not defined in the text of web pages in the form of phrases containing relations in specific patterns. Exemplary HITS studies are reported in: D. Gibson et al., xe2x80x9cInferring Web Communities from Link Topology,xe2x80x9d HyperText, pages 225-234, Pittsburgh, Pa., 1998; J. Kleinberg, xe2x80x9cAuthoritative Sources in a Hyperlinked Environment,xe2x80x9d Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms, May 1997; R. Kumar, xe2x80x9cTrawling the Web for Emerging Cyber-Communities,xe2x80x9d published on the WWW at URL: http://www8.org/w8-papers/4a-search-mining/trawling/trawling.html) as of Nov. 13, 1999; and S. Chakrabarti et al. xe2x80x9cFocused Crawling: A New Approach to Topic-Specific Web Resource Discovery,xe2x80x9d Proc. of The 8th International World Wide Web Conference, Toronto, Canada, May 1999.
There is therefore a great and still unsatisfied need for a software system and associated method for automatically identifying and mining sets of related information on the World Wide Web, using the duality concept for quality enhancement.
In accordance with the present invention, a computer program product is provided as an automatic mining system to identify a set of related information on the WWW, with a high degree of confidence, using a duality concept. Duality problems arise, for example, when a user attempts to identify a pair of related phrases such as (book, author); (name, email); (acronym, expansion); or similar other relations. The mining system addresses the duality problems by iteratively refining mutually dependent approximations to their identifications. Specifically, the mining system iteratively refines (i) pairs of terms that are related in a specific way, and (ii) the patterns of their occurrences in web pages, i.e., the ways in which the related phrases are marked in the web pages. The automatic mining system runs in an iterative fashion for continuously and incrementally refining the patterns and patterns.
The automatic mining system includes a computer program product such as a software package, which is generally comprised of a database and two identifiers: a relation identifier and a pattern identifier. The database contains the previously identified pairs or sets of relations Rixe2x88x921 that have been identified by the relation identifier, and the set of patterns Pixe2x88x921 that have already been identified by the pattern identifier. Initially, the database begins with small seed sets of relations R0 and patterns P0 that are continuously and iteratively broadened by the automatic mining system.