1. Field of the Invention
The present invention relates generally to high-dimensional similarity searches, and more particularly to methods for classifying a large number of Web documents in a Web taxonomy.
2. Description of the Related Art
High-dimensional search is an important operation in multimedia databases that contain text documents, audio, and video. To facilitate such searching, database objects can be classified in a taxonomy that has a tree-like structure. For illustration, text documents are used herein as examples of such database objects, it being understood that the present invention applies equally to other genre of objects.
In developing a classification taxonomy, supervised learning can be used, wherein a few training documents initially are assigned to the various nodes of a taxonomy and subsequent documents are then classified based on comparisons with the training documents. Generally, a document will be classified at a leaf node in a taxonomy when the leaf node contains training documents that are xe2x80x9cclosestxe2x80x9d to the document sought to be classified. For example, in so-called Bayesian classifiers, each node xe2x80x9ccxe2x80x9d (also referred to herein as a xe2x80x9cclassxe2x80x9d or xe2x80x9cclassificationxe2x80x9d) in a taxonomy tree has an associated document model that is defined by the training documents. When a test document xe2x80x9cdxe2x80x9d is to be classified, a posterior probability that measures the likelihood that the test document xe2x80x9cdxe2x80x9d could have been generated by the class xe2x80x9ccxe2x80x9d, denoted Pr[c|d], is determined for each class xe2x80x9ccxe2x80x9d. The test document is classified as belonging to the class xe2x80x9ccxe2x80x9d having the highest posterior probability for that document.
In undertaking the classification process, Bayesian classifiers collect term occurrences and estimate statistical parameters xcex8(c,t) which are measures of the fractional rate at which the term xe2x80x9ctxe2x80x9d occurs in the class xe2x80x9ccxe2x80x9d. For large data sets, e.g., databases of Web pages, the parameters xcex8(c,t) cannot be cached entirely in local memory, but must instead be stored on, e.g., a disk or other storage device. As will be appreciated, data accesses to disks and other such storage devices consume a great deal of time, compared to data accesses to local memory.
Thus, in contrast to an index probe being routed down a so-called xe2x80x9cB-Treexe2x80x9d or xe2x80x9cR-Treexe2x80x9d, the above-discussed high-dimensional classification search operation cannot cache the comparison data that is to be used at each node in memory. Instead, in the present high-dimensional application both the probe and the comparison models at each node are large complex objects, and each step in the classification decision process consequently requires an xe2x80x9coutsidexe2x80x9d data access, typically to a disk. As recognized by the present invention, it would be advantageous to optimize such xe2x80x9coutsidexe2x80x9d data accesses for the case of high-dimensional operations, particularly in contexts, such as classifying millions of Web pages, which require near-constant classification of a seemingly endless supply of large documents.
The invention is a general purpose computer programmed according to the inventive steps herein to classify documents in a taxonomy. The invention can also be embodied as an article of manufacturexe2x80x94a machine componentxe2x80x94that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to undertake the logic disclosed herein. This invention is realized in a critical machine component that causes a digital processing apparatus to undertake the inventive logic herein.
In accordance with the present invention, the computer includes a data storage device including a computer usable medium having computer readable code means for document classification. The code means include computer readable code means for establishing plural data tables, with the data tables including a taxonomy table containing data related to a classification taxonomy. Also, one of the tables contains data representing statistics related to occurrences of terms in nodes of the taxonomy. Computer readable code means receive documents, and computer readable code means are provided for classifying the documents with respect to the taxonomy by undertaking at least one table join using the plural data tables and the document, such that data access is optimized.
In a preferred embodiment, the classifying means requires no random data input/output (I/O) access, nor does it require in-place table updates. In this way, redundant probes for terms that occur in many documents are eliminated.
In a particularly preferred embodiment, an inner table join and a left outer table join are executed by the classifying means. The table joins can be represented by the expression (using Bayesian notation) logprior[c]+xcexa3t∈d∩cfreq[d,t](logxcex8[c,t]+logdenom[c])xe2x88x92logdenom[c]xcexa3t∈dfreq[d,t], wherein d represents the document, t represents at least one term, and c represents a taxonomy.
To enhance the effectiveness of the above-summarized join operations, computer readable code means can process a group of documents using the means for classifying by testing all documents in the group at a test node in the taxonomy, prior to testing any document at a node other than the test node. Thus, entire taxonomy nodes at a time are processed. For such bulk processing, the data tables include an expand table and a result table, and the means for processing the group of documents recursively deletes rows in the expand table and fills the rows with entries from the result table. Moreover, the means for classifying executes a left outer table join, and the result table is populated by results from the left outer table join.
In another aspect, a computer system for wave-front classification processing of a group of documents relative to a taxonomy includes a computer including program structure that processes the group of documents by testing all documents in the group at a test node in the taxonomy, prior to testing any document at a node other than the test node.
In yet another aspect, a computer-implemented method is disclosed for classifying at least one group of documents with respect to a taxonomy while optimizing I/O access. The method includes establishing plural data tables including a taxonomy table containing data related to the taxonomy and a statistic table containing data representing statistics related to occurrences of terms in nodes of the taxonomy. Documents in the group of documents are classified with respect to the taxonomy by executing an inner table join and an outer table join. In accordance with present principles, at least one of the joins uses elements from the taxonomy table and the statistic table, and all documents in the group are tested at a test node in the taxonomy prior to testing any document at a node other than the test node.
In still another aspect, the above-summarized method is stored as a computer program on a computer program device. The program device includes a computer program storage device that is readable by a digital processing apparatus, and the program is on the program storage device. The program includes instructions that can be executed by the digital processing apparatus for performing the method steps of the present invention.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which: