The subject application relates to scanned document type classification. While the systems and methods described herein relate to identifying an original document type for a scanned document, it will be appreciated that the described techniques may find application in other classification systems, other xerographic applications, and/or other document analysis systems.
A large portion of office documents are generated using slideshow presentation applications (e.g., Power Point, etc.), word processing tools (e.g., Word, WordPerfect, etc.), spreadsheet applications (e.g., Excel, etc.). For scanned documents, the original document types (e.g., slideshow, spreadsheet, word-processing) are generally unknown. However, the document type information can be useful for many applications, particularly in scanning services. For example, document type information can be used in a database system as a searching key. Document type identification can also be applied for guiding next-level categorization, recognition, and processing. For instance, the word-processing documents may further be recognized (e.g., at the next level) as office memo, resume, letter, journal articles, etc. Spreadsheet documents can be sent for further data extraction. Slideshow slides, which are usually generated with templates, can be efficiently compressed by exploring page-to-page correlation. However, conventional scanning and electronic document storage systems do not provide such information, especially when the electronic document is generated by scanning a paper hard copy of the document.
Accordingly, there is an unmet need for systems and/or methods that facilitate scanned document type classification while overcoming the aforementioned deficiencies.