1. Field of the Invention
This invention relates in general to computer-implemented classifiers, and, in particular, to enhanced hypertext categorization using hyperlinks.
2. Description of Related Art
The Internet is a collection of computer networks that exchange information via Transmission Control Protocol/Internet Protocol (xe2x80x9cTCP/IPxe2x80x9d). The Internet consists of many internet networks, each of which is a single network that use the TCP/IP protocol suite. Currently, the use of the Internet for commercial and non-commercial uses is exploding. Via its networks, the Internet enables many users in different locations to access information stored in databases stored in different locations.
The World Wide Web (also known as xe2x80x9cWWWxe2x80x9d or the xe2x80x9cWebxe2x80x9d) is a facility on the Internet that links documents. The Web is a hypertext information and communication system used on the Internet computer network with data communications operating according to a client/server model. Typically, Web clients will request data stored in databases from Web servers, which are connected to the databases. The Web servers will retrieve the data and transmit the data to the clients. With the fast growing popularity of the Internet and the Web, there is also a fast growing demand for Web access to databases.
The Web operates using the HyperText Transfer Protocol (HTTP) and the HyperText Markup Language (HTML). This protocol and language results in the communication and display of graphical information that incorporates hyperlinks (also called xe2x80x9clinksxe2x80x9d). Hyperlinks are network addresses that are embedded in a word, phrase, icon or picture that are activated when the user selects a highlighted item displayed in the graphical information. HTTP is the protocol used by Web clients and Web servers to communicate between themselves using these hyperlinks. HTML is the language used by Web servers to create and connect together documents that contain these hyperlinks.
As the total amount of accessible information increases on the Web, the ability to locate specific items of information within the totality becomes increasingly more difficult. The format with which the accessible information is arranged affects the level of difficulty in locating specific items of information within the totality. For example, searching through vast amounts of information arranged in a free-form format can be substantially more difficult and time consuming than searching through information arranged in a pre-defined order, such as by topic, date, category, or the like. However, due to the nature of certain on-line systems, such as the internet, much of the accessible information is placed on-line in the form of free-format text.
Search schemes employed to locate specific items of information among the on-line information content, typically depend upon the presence or absence of key words (words included in the user-entered query) in the searchable text. Such search schemes identify those textual information items that include (or omit) the key words. However, in systems, such as the Web, where the total information content is relatively large and free-form, key word searching can be problematic, for example, resulting in the identification of numerous text items that contain (or omit) the selected key words, but which are not relevant to the actual subject matter to which the user intended to direct the search.
As text repositories grow in number and size and global connectivity improves, there is a pressing need to support efficient and effective information retrieval (IR), searching and filtering. Some conventional systems manage information complexity on the internet or in database structures typically using hierarchy structures. A hierarchy could be any directed acyclic graph, but, for purposes of simplifying the description, the present disclosure discusses hierarchies, primarily in the form of trees.
Many internet directories, such as Yahoo!(trademark) (http://www.yahoo.com) are organized in preset hierarchies. International Business Machine Corporation has implemented a patent database (http://patent.womplex.ibm.com), that is organized by the PTO class codes, which form a preset hierarchy.
Taxonomies can provide a means for designing vastly enhanced searching, browsing and filtering systems. For example, they can be used to relieve the user from the burden of sifting specific information from the large and low-quality response of most popular search engines. Search querying with respect to a taxonomy can be more reliable than search schemes that depend only on presence or absence of specific key words in all of the searchable documents. By the same token, multicast systems such as PointCast (http://www.pointcast.com) are likely to achieve higher quality by registering a user profile in terms of classes in a taxonomy rather than key words.
Some conventional systems use text-based classifiers to classify documents. A text-based classifier classifies the documents based only on the text contained in the documents. However, documents on the Web typically contain hyperlinks. These hyperlinks are ignored by text-based classifiers, although the hyperlinks contain useful information for classification. Text classification without hyperlinks has been extensively studied. Some experiments show that classification techniques that are designed for, and perform well on, text often perform poorly on heterogeneous hyperlinked corpora such as the Web. Valuable information in the vicinity of a hyperlinked document is lost upon a purely-text-based classifier. Existing text classification research pays no heed to the vital information latent in such hyperlink structure.
So far, classifiers that use a statistical document model capture only the distribution of terms in documents. Information from other documents has been used in the form of term expansion: using associations over the entire corpus, terms in a document can be padded with strongly associated terms before running the document through a text-based classifier, as discussed in U.S. Pat. No. 5,325,298, for xe2x80x9cMethods for Generating or Revising Context Vectors for a plurality of Word Stemsxe2x80x9d, issued to S. Gallant, on Jun. 28, 1994, which is incorporated by reference herein.
Classification of entities from relational data is well-explored in the machine learning and data mining literature, which is discussed further in L. Brieman, J. H. Friedman, R. A. Olshen, and C. J. Stone, xe2x80x9cClassification and Regression Treesxe2x80x9d, Wadsworth and Brooks/Cole, 1984; R. A. M. Mehta and J. Rissanen, xe2x80x9cSLIQ: A Fast Scalable Classifier for Data Miningxe2x80x9d, Proc. of the Fifth International Conference on Extending Database Technology, Avignon, France, March 1996; M. M. J. C. Shafer, R. Agrawal, xe2x80x9cxe2x80x9cSPRINTxe2x80x9d A Scalable Parallel Classifier for Data Miningxe2x80x9d, Proc. of the 22nd International Conference on very Large Databases, Bombay, India, September 1996; all of which are incorporated by reference herein. These typically work on a single relational table giving attributes of each entity (e.g., a customer).
Little appears to have been done about relationships between entities within the machine learning and data mining literature. In the case in which entities are patients, relationships could be useful in, say, diagnosing diseases. This is a very difficult problem in general. A few specific situations have been handled in the inductive logic programming literature, as discussed further in R. Quinlan, xe2x80x9cLearning Logical Definitions from Relations, Machine Learningxe2x80x9d, 5, 3:239-266, 1990; S. Muggleton and C. Feng, xe2x80x9cEfficient Induction of Logic Programsxe2x80x9d, Proc. of the Workshop on Algorithmic Learning Theory, Japanese Society for Artificial Intelligence, 1990; M. Pazzani, C. Brunk, and G. Silverstein, xe2x80x9cA Knowledge-intensive Approach to Learning Relational Conceptsxe2x80x9d, Machine Learning: Proc. of the Eighth International Workshop (ML91), Ithaca, N.Y., 1991; N. Lavrac and S. Dzeroski, xe2x80x9cInductive Logic Programmingxe2x80x94Techniques and Applicationsxe2x80x9d, Chichester, 1994 [hereinafter xe2x80x9cLavrac and Dzeroskixe2x80x9d]; all of which are incorporated by reference herein. However, these techniques have not yet been applied to large-scale data mining scenarios.
Another difference is the difference in dimensionality, discussed in S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan, xe2x80x9cUsing Taxonomy, Discriminants, and Signatures for Navigating in Text Databases, VLDB, Athens, Greece, August 1997, [hereinafter xe2x80x9cChakrabarti, Dom, Agrawal, and Raghavanxe2x80x9d]. Decision-tree classifiers handle up to hundreds of features or attributes, whereas text corpora often have a lexicon in the hundred thousands.
In other conventional systems, image processing and pattern recognition is used. Edge detection is an important operation in image analysis. The goal is to xe2x80x9cclassifyxe2x80x9d each pixel as part of some edge in the image, or an interior region. Thus, there are two classes. First, a spatial differentiation is applied, e.g., from each pixel, the average of its four neighbors is subtracted. This typically results in a noisy edge map. Continuity indicates that the class of a pixel depends on both its own brightness in the differentiated image, as well as the class of nearby pixels.
Hypertext classification is a generalization of this pattern recognition scenario. In particular, there may be many classes, including topic taxonomies, an average document has many more neighbors than pixels, and the neighbor""s classes provide information in an indirect way (while pixels near edge pixels are more likely to be edge pixels, but patents citing patents on antennas can also be about transmitters). In pattern recognition, pixel classification is studied in a more general context of relaxation labeling and their convergence properties studied using Markov random fields, which is discussed further in R. A. Hummel and S. W. Zucker, xe2x80x9cOn the Foundations of Relaxation Labeling Processesxe2x80x9d, IEEE Transactions of Pattern Analysis and Machine Intelligence, PAMI-593):267-287, May 1983, [hereinafter xe2x80x9cHummel and Zuckerxe2x80x9d]; L. Pelkowitz, xe2x80x9cA Continuous Relaxation Labeling Algorithm for Markov Random Fieldsxe2x80x9d, IEEE Transactions on Systems, Man and Cybernetics, 20(3):709-715, May 1990, [hereinafter xe2x80x9cPelkowitzxe2x80x9d]; W. Dent and S. S. Iyengar, xe2x80x9cA New Probabilistic Relaxation Scheme and Its Application to Edge Detectionxe2x80x9d, IEEE Transactions of Pattern Analysis and Machine Intelligence, 18(4):432-437, April 1996, [hereinafter xe2x80x9cDent and Iyengarxe2x80x9d]; all of which are incorporated by reference herein.
There is a need in the art for an improved classifier that can classify documents containing hyperlinks.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, apparatus, and article of manufacture for a computer implemented hypertext classifier.
In accordance with the present invention, a new document containing citations to and from other documents is classified. Initially, documents within a neighborhood of the new document are identified. For each document and each class, an initial probability is determined that indicates the probability that the document fits a particular class. Next, iterative relaxation is performed to identify a class for each document using the initial probabilities. A class is selected into which the new document is to be classified based on the initial probabilities and identified classes.
An object of the invention is to provide a classifier for classifying documents that contain hyperlinks. Another object of the invention is to provide a classifier for classifying documents on the Web using any radius of influence.