The present invention relates to the field of data mining, and particularly to a software system and associated method for identifying a set of related information on the World Wide Web. More specifically, the present invention relates to the automatic and iterative mining of acronyms and their expansions through patterns of occurrences and formation rules using a duality concept.
The World Wide Web (WWW) is a vast and open communications network where computer users can access available data, digitally encoded documents, books, pictures, and sounds. With the explosive growth and diversity of WWW authors, published information is oftentimes unstructured and widely scattered. Although search engines play an important role in furnishing desired information to the end users, the organization of the information lacks structure and consistency. Web spiders crawl web pages and index them to serve the search, engines. As the web spiders visit web pages, they could look for, and learn pieces of information that would otherwise remain undetected.
Current search engines are designed to identify pages with specific phrases and offer limited search capabilities. For example, search engines cannot search for phrases that relate in a particular way, such as books and authors. Bibliometrics involves the study of the world of authorship and citations. It measures the co-citation strength, which is a measure of the similarity between two technical papers on the basis of their common citations. Statistical techniques are used to compute this measures. In typical bibliometric situations the citations and authorship are explicit and do not need to be mined. One of the limitations of the bibliometrics is that it cannot be used to extract buried information in the text.
Exemplary bibliometric studies are reported in: R. Larson, xe2x80x9cBibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace,xe2x80x9d Technical report, School of Information Management and Systems, University of California, Berkeley, 1996. http://sherlock.sims.berkeley.edu/docs/asis96/asis96.html; K. McCain, xe2x80x9cMapping Authors in Intellectual Space: A technical Overview,xe2x80x9d Journal of the American Society for Information Science, 41(6):433-443, 1990. A Dual Iterative Pattern Relation Expansion (DIPRE) method that addresses the problem of extracting (author, book) relationships from the web is described in S. Brin, xe2x80x9cExtracting Patterns and Relations from the World Wide Web,xe2x80x9d WebDB, Valencia, Spain, 1998.
Another area to identify a set of related information on the World Wide Web is the Hyperlink-Induced Topic Search (HITS). HITS is a system that identifies authoritative web pages on the basis of the link structure of web pages. It iteratively identifies good hubs, that is pages that point to good authorities, and good authorities, that is pages pointed to by good hub pages. This technique has been extended to identify communities on the web, and to target a web crawler. One of HITS"" limitations resides in the link topology of the pattern space, where the hubs and the authorities are of the same kind. i.e., they are all web pages. HITS is not defined in the text of web pages in the form of phrases containing relations in specific patterns.
Exemplary HITS studies are reported in: D. Gibson et al., xe2x80x9cInferring Web Communities from Link Topology,xe2x80x9d HyperText, pages 225-234, Pittsburgh, Pa., 1998; J. Kleinberg, xe2x80x9cAuthoritative Sources in a Hyperlinked Environment,xe2x80x9d Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms, May 1997; R. Kumar, xe2x80x9cTrawling the Web for Emerging Cyber-Communities,xe2x80x9d published on the WWW at URL: http://www8.org/w8-papers/4a-search-mining/trawling/trawling.html) as of Nov. 13, 1999; and S. Chakrabarti et al. xe2x80x9cFocused Crawling: A New Approach to Topic-Specific Web Resource Discovery,xe2x80x9d Proc. of The 8th International World Wide Web Conference, Toronto, Canada, May 1999.
The problem of information organization and lack of structure and consistency is further exasperated in technical and other fields that are acronym driven. The diversity and non-uniformity in the use of acronyms would oftentimes obscure the understanding of the subject matter being described, unless clear expansions are provided to the readers.
There is therefore a great and still unsatisfied need for a software system and associated method for automatically identifying and mining acronym-expansion pairs on the World Wide Web, using the duality concept and strict formation rules for quality. enhancement.
In accordance with the present invention, a computer program product is provided as an automatic mining system to identify a set of related information on the WWW using a duality concept. Duality problems, arise, for example, when a user attempts to identify a pair of related phrases such as (book, author); (name, email); (acronym, expansion); or similar other relations. The mining system addresses the duality problems by iteratively refining mutually dependent approximations to their identifications. Specifically, the mining system iteratively refines (i) pairs of phrases related in a specific way; (ii) the patterns: of their occurrences in web pages, i.e., the ways in which the related phrases are marked in the web pages; and (iii) the formation rules.
In one embodiment, the automatic mining system addresses a particular paradigmatic duality problem, namely identifying (acronym, expansion) pairs in terms of the patterns of their occurrences in the web pages. The solution to this problem involves two mutually dependent duality problems: The first being the duality between the related pairs and their patterns, and the second being the duality between the related pairs and the acronym formation rules. The automatic mining system runs in an iterative fashion for continuously and incrementally refining the sets of (acronym, expansion) pairs, patterns, and formation rules.
The automatic mining system is generally comprised of a database and three identifiers: a formation rule identifier, an acronym-expansion pair identifier, and a pattern identifier. The database contains the (acronym, expansion) pairs Rixe2x88x921 that have already been identified by the acronym-expansion pair identifier; the patterns Pixe2x88x921 that have already been identified by the pattern identifier; and the sets of formation rules that have already been identified by the formation rule identifier. Initially, the database begins with small seed sets of (acronym, expansion) pairs R0, patterns P0, and formation rules E0, that are continuously and iteratively broadened by the automatic mining system.