The exemplary embodiment relates to document processing. It finds particular application in connection with a system and method for the unsupervised detection of page frames applicable to a given document.
To enable processing of a book in digital form, it is often necessary to scan a hardcopy book. The pages can then be OCR processed or otherwise analyzed digitally. One problem which arises is that when a page of a book is scanned or photocopied, there is often what is referred to as noise in addition to the content of the current page. The noise may be “textual noise,” which in the present context is content of a neighboring page, i.e. a page which is previous or subsequent to the current page being scanned. The textual noise may be text content or, in some cases, image content. Additionally, there may be “non-textual noise” which is generally noise which does not arise from the content of the current or neighboring pages. Non-textual noise can include, for example, black borders around the document page and speckles, often arising from the spine, which creates a shadow in the margin between the current page and the neighboring page.
It is desirable to remove such noise before further processing of the scanned document pages. Various methods have been developed for identifying what is referred to as the “page frame”, also called the “page body” or by typographers, “type area”. These methods include filtering out non-textual noise and identification of connected components. The aim of many of these approaches is generally to identify “the smallest rectangle that encloses all the foreground elements of the document page.” See, for example, Faisal Shafait, Geometric Layout Analysis of Scanned Documents, PhD thesis, Technical University of Kaiserslautern, 2008. A related function found in some OCR engines is the Dual Splitting function. This function recognizes the situation where the input image is composed of two pages, when two consecutive pages of a book are scanned.
One problem with current approaches is that a portion of a neighboring page may be considered as part of the current page. The approach of Shafait can lead to two pages of a book having very different smallest rectangles. For example, a page with a large amount of white space will have a smaller rectangle than one which does not. Recognition of some typographical elements, such as headers and footers, based on page location, can thus be difficult.