Advances in computer technology have provided users with numerous options for creating electronic documents. For example, many common software applications executable on a typical personal computer enable users to generate various types of useful electronic documents. Electronic documents can also be obtained from image acquisition devices such as scanners or digital cameras, or they can be read into memory from a data storage device (e.g., in the form of a file). Modern computers enable users to electronically obtain or create vast numbers of documents varying in size, subject matter, and format. Such documents may be located, for example, on personal computers, networks, or other storage media.
As the number of electronic documents grows, the ability to classify and manage such documents (e.g., through the use of networks such as the Internet) becomes increasingly important. Document classification enables users to more easily locate related documents and is an important step in a variety of document processing tasks such as archiving, indexing, re-purposing, data extraction, and other automated document managing operations. In general, document classification involves assigning a document to one or more sets or classes of documents with which it has commonality-usually as a consequence of shared topics, concepts, ideas and subject areas.
A variety of document classification engines or algorithms have been developed in recent years. Performance of these engines varies from engine to engine, but is generally limited due to computers being historically poor at performing heuristic tasks. Common commercial document classifier engines use a single technology and approach for solving the problem of classifying documents. Expert users can tune these classifier engines to obtain better results, but this requires significant training of such engines using high quality example documents.