1. Technical Field
The invention relates to automatic document classification. More particularly, the invention relates to a method and apparatus for automatic document classification using either document clustering and document sketch techniques.
2. Description of the Prior Art
Typically, document similarities are measured based on the content overlap between the documents. Such approaches do not permit efficient similarity computations. Thus, it would be advantageous to provide an approach that performed such measurements in a computationally efficient manner.
Documents come in varying sizes and formats. The large size and many formats of the documents makes the process of performing any computations on them very inefficient. Comparing two documents is an oft performed computation on documents. Therefore, it would be useful to compute a fingerprint or a sketch of a document that satisfies at least the following requirements:                It is unique in the document space. Only the same documents share the same sketch.        The sketch is small, thereby allowing efficient computations such as similarity and containment.        Its computation is efficient.        It can be efficiently computed on a collection of documents (or sketches).        The sketch admits partial matches between documents. For example, a 60% similarity between two sketches implies 60% similarity between the underlying documents.        
There are known algorithms that compute document fingerprints. Broder's implementation (see Andrei Z. Broder, Some applications of Rabin's fingerprinting method, In Renato Capocelli, Alfredo De Santis, and Ugo Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143-152. Springer-Verlag, 1993) based on document shingles is a widely used algorithm. This algorithm is very effective when computing near similarity or total containment of documents. In the case of comparing documents where documents can overlap with one another to varying degrees, Broder's algorithm is not very effective. It is necessary to compute similarities of varying degrees. To this end, it would be desirable to provide a method to compute document sketches that allows for effective and efficient similarity computations among other requirements.