1. Field of the Invention
This invention relates in general to optical character recognition systems and in particular to determining the layout of pages to identify the proper order of text elements to be read.
2. Description of the Related Art
Optical character recognition systems need page layout analysis to be able to extract text from complex pages, such as from books, magazines, journals, newspapers, letters, and reports. Without page layout analysis, an OCR system would attempt to recognize line drawings, graphics, and photographs as text, and would jumble the reading order of words in multi-column text. Physical page layout analysis, one of the first steps of optical character recognition, divides an image into areas of text and non-text, as well as splitting text into columns. Physical page layout analysis is distinct from logical layout analysis, which detects headers, footers, body text, numbered lists, and segmentation into articles.
Physical layout analysis is essential to enable an OCR engine to process images of arbitrary pages. Existing physical layout analysis methods divide roughly into two categories: bottom-up analysis methods and top-down analysis methods. Each of these methods has associated disadvantages.
Bottom-up methods are the oldest methods. They classify small parts of the image (pixels, groups of pixels, or connected components) and gather together like types to form regions. The key advantage of bottom-up methods is that they can handle arbitrarily shaped regions with ease. The key disadvantage is that they struggle to take into account higher-level structures in the image, such as columns. This often leads to overfragmented regions.
Top-down methods cut the image recursively in vertical and horizontal directions along whitespaces that are expected to be column boundaries or paragraph boundaries. Although top-down methods have the advantage that they start by looking at the largest structures on the page, they are unable to handle the variety of formats that occur in many magazine pages, such as non-rectangular regions and cross-column headings that blend seamlessly into the columns below.
A third category of methods is based on analysis of the whitespace in an image. This solves some of the flaws in the recursive top-down methods, by finding gaps between columns by a bottom-up analysis of the gaps, looking explicitly for white rectangles. These methods still suffer from the problem of being unable to handle non-rectangular regions.