The present invention relates generally to batch scanning and, in particular, to a method and apparatus for discriminating between documents in batch scanned document files.
Document scanners are well known in the art. Generally, document scanners are devices that optically scan a printed page to provide data representative of the scanned page (e.g., a scan file) which data can be stored and/or manipulated, typically by a computer.
Batch scanners, a particular type of document scanner, are becoming more prevalent. In a batch scanner, several documents are gathered together and scanned into a computer all at once. However, unless techniques are employed to discriminate between documents during the scanning process, the scan data resulting from the scanning of the documents will be encompassed by a single scan file. As a result, users of the scan file are not provided with any indication within the scan data where one document ends and another one begins. It thus becomes a manual process for a user to inspect the resulting scan data and determine where various documents begin and end. If the user wishes to store separate documents in separate scan files, he/she must manually separate the documents from the scan data and save them as separate files.
Currently, various techniques are used in order to discriminate when one document ends and another starts, thereby allowing the creation of separate scan files when using a batch scan process. One solution is to include separator pages comprising some type of indicia (e.g., bar codes, predetermined patterns, blank pages, etc.) making them recognizable by the scanner as a separator page. Based on the occurrence of the separator page, separate scan files can be generated, either by the scanner itself or by a computer that receives the scan data. While separator pages function adequately for this purpose, they do require a user to manually insert them between documents. Another solution is to put an indicator marking the first or last page of document directly onto the pages of each of the documents. Again, this solution requires user intervention prior to the scanning operation.
Thus, a need exists for a technique of discriminating between documents during batch processing and that does not require, as in prior art techniques, user intervention or the manual manipulation of the documents prior to batch scanning. Such a technique should preferably allow batch scanning to be automated such that separation between document in scan files are provided automatically, with user intervention being only optional.
The present invention overcomes prior art limitations by providing a technique for discriminating between documents based on various analyses of the documents. The data provided by the various analyses are compared with each other to determine whether successive pages belong to the same document.
In one embodiment, scanned documents result in a page sequence. The page sequence is then analyzed to extract at least one feature attribute for each page using, for example, an optical character recognition (OCR)/layout process to extract text and layout features, and using an image feature process to extract image features. Data representative of the at least one feature attribute is then added to the page sequence to provide an extended page sequence. The extended page sequence is then subjected to a decision process to determine breaks between documents, resulting in a segmented page sequence. The decision process, in a preferred embodiment, comprises four different comparison processes (listed in order of decreasing specificity): (i) text feature analysis; (ii) specific layout analysis; (iii) general layout analysis; and (iv) image feature analysis. Regardless of the particular type used, each of the analyses provides a different measurement regarding the similarity between two successive pages. In a preferred embodiment, the features for a current page are compared with the features of at least one previous page, generally in order of decreasing specificity. If a sufficient likelihood of similarity is found, then the compared pages are deemed to be from the same document; otherwise, they are deemed to be from different documents, indicating the existence of a document break. The one or more document breaks thus identified may be indicated within the segmented page sequence. Through the display of the segmented display sequence, a user may optionally modify the location of one or more document breaks. Based on the document breaks, separate scan files may be established. In this manner, the present invention represents advancement over prior art batch scanning techniques in that user intervention is no longer required.