1. Field of the Invention
The present invention relates to a system and method for effectively finding document or subdocument boundaries in a sequence of images, such as those produced from a digital scanner.
2. Description of the Related Art
Finding document or subdocument boundaries is useful in the context of processing large quantities of documents and/or subdocuments in accordance with their document or subdocument type. As used herein, the term “document” refers generally to information contained in a medium having a beginning boundary (e.g., first page, first paragraph, etc.) and an ending boundary (e.g., last page, last paragraph, etc.) and a “subdocument” may be any definable subset of information contained in a “document” (e.g., page(s), section(s), paragraph(s), etc.). Hereinafter, “documents” and “subdocuments” are collectively referred to as “documents.”
Current methods commonly employed for high volume digital document scanning and subsequent processing of documents include using physical separator sheets to sort the documents, as described in U.S. Pat. No. 6,118,544 to Rao, for example. In large scanning operations the manual effort of inserting physical separator pages prior to scanning can be extremely costly and time consuming. For example, a large loan processing company in the United States currently estimates that in order to process 20 million loan images a month they spend $1 M a year on the printing of separator pages. Additionally they estimate at least 20 seconds of manual effort per loan document. Therefore using separator pages can consume a substantial portion of the total document preparation costs and the level of effort scales linearly with the volume of forms processed.
Under similar volume assumptions, human constructed rule based systems, wherein the categorization and/or separation rules are specified by a human operator, do quite well for certain kinds of tasks. However, while the costs of such a rule based system do not scale linearly with the number of documents processed, they can scale even more poorly as the number of combinations of document types and business rules increases. This is because over time the system is forced to adapt to new constraints, and ensuring that the interaction of behaviors between new and old rules is correct can be cumbersome, time consuming, and requires highly skilled (and consequently expensive) labor.
Only recently has work been done to automate the process of rule generation. The work described in Collins-Thompson et al., “A Clustering-Based Algorithm for Automatic Document Separation,” ACM Special Interest Group in Information Retrieval (SIGIR), 2002. (hereinafter “Collins-Thompson”), takes a batch of documents, with the pages in no particular order, and automatically groups pages together from the same document. This work uses a three step method. First, each pair of pages is assigned four similarity scores based on document structure information, text layout information, text similarity, and general image content features. These scores are then used to compute an overall similarity between the two pages. Lastly, the pairs of pages are clustered by similarity score in order to get larger groups of pages that are all similar to each other. The result is a separation of documents in a large set of pages from multiple documents.
The method proposed by Collins-Thompson partitions pages into groups that correspond to documents, it does not attempt to identify what types of documents exist in the collection. However, this approach falls short of addressing the total business problem. Quite often separator pages are inserted between documents in order to instruct the computer where one document begins and another ends as well as to identify the type of document that will follow behind the separator page. Both pieces of information are critical to power certain business processes. The identification of the type of document is used to determine what further processing needs to be done on that particular document. The following example illustrates the value of completing both steps:
A mortgage refinance company wants to automate the document preparation of loan refinance applications. The preparation process today involves inserting bar code separators between each document. The separators tell the computer where one document begins and ends. The bar code tells the computer what document type is behind the separator. Based upon the document type, automated extraction and routing technology can pull out the correct information from each document. Previously, all this work had to be done by hand. Without document type identification the savings through technology is much reduced. Documents would be separated, but unidentified. A human operator would need to look at each document to determine its identification. This process is just as lengthy as looking at each document and inserting a bar code separator page.
Additionally, the system as described by Collins-Thompson was built to separate document pages from each other according to a particular criterion—the pages came from the same document. However, it may be useful to redefine the grouping criteria for a business process. For example, the division of deeds from tax forms might be one separation task. In another business process identifying all forms belonging to a single person might be the desired separation task. The methods used in Collins-Thompson do not allow the user of the system to easily redefine what it means to be similar and thus redefine the separation task. Instead the user would need to reprogram the classification and clustering system as well as reengineer the features used from the document the system uses as input.