Embodiments of the present invention relate to a system and method for categorizing documents and more particularly to document categorization that takes advantage of word length distribution analysis.
Automatic classification or categorization is an important function of a complete electronic document management system. It permits automatic filing of documents where the user would like scanned document images to be automatically routed to directories that contain similar material. For example, the user may wish to automatically store newspaper articles with other newspaper articles and scientific journal articles with other scientific journal articles.
One known technique analyzes general visual features of document images and matches them to distributions from other documents to derive decisions. This technique however does not take text semantics into account. Systems embodying this technique are available from Documagix of San Jose, Calif. and Visioneer of Palo Alto, Calif.
Textual features are another possible basis for categorization. The document management system would search a newly scanned document for keywords associated with categories. This categorization procedure however requires optical character recognition (OCR) which does not operate well on degraded images. Also, this procedure requires that each document be easily classified by the keywords found within it.
Yet another possible technique utilizes character transition probabilities, i.e., given a particular character, what is the probability of another character following. This technique also relies on OCR to identify characters. One prior art system accepts a document and returns other documents relating to similar topics by comparing character transition probability distributions. This system retrieves only semantically similar documents with documents on similar topics being classed together. This is a narrower categorization than necessary in many applications which only require distinguishing among generic classes such as newspaper stories or scientific articles. Although highly accurate when working with high quality images, this technique is computationally intensive and slow.
What is needed is a document categorization system that is capable of classifying documents into broad categories but that is able to use degraded images as input.