The exemplary embodiment relates to document processing. It finds particular application in connection with a system and method for the unsupervised determination of a logical structure of a document based on trailing and leading pages.
There are a large number of existing documents that do not have a table of contents (TOC), or that have a TOC that is out of date and unreliable. This may occur when a large book is digitally scanned to generate a digital document, or when the content of a digital document is updated without creating and/or updating a corresponding TOC. The lack of a reliable TOC for a document makes it more difficult to determine quickly and efficiently whether a document contains a particular piece of information and if so, where in the document the information is located. Existing methods for determining the logical structure of a document (and by extension, a TOC for the document) are computationally expensive and are prone to skewed results. An example of such a method is described in Emmanuel Giguet, Alexandre Baudrillart and Nadine Lucas, “Resurgence for the Book Structure Extraction Competition,” INEX 2009 Workshop Pre-proceedings (hereinafter, “Giguet”). Giguet utilizes a four-page sliding window that is used to detect chapter transitions. However, this method is computationally expensive because it computes data for at least four pages at a time and compares the four pages of each four page window to determine chapter transitions. Accordingly, it is desirable to have a more efficient and reliable method of determining the logical structure of a document.