A typical paper form used in an office environment consists of characters, symbols, lines, and charts while characters are denoted as text components. The conversion from a paper form into an editable electronic form is an essential function in automated office environments. To achieve the function requires a computer-based system which is capable of performing: (a) automatic separation of text and graphics components in a digitized document; (b) recognition of symbols and structures of charts; (c) automatic data organization to preserve spatial relations of graphical components and text strings. In addition, an optical character recognition system is required to convert text strings into ASCII strings, and a user interface tool incorporated with a database is a necessity for form editing and creation.
Runlength smearing, projection and connected component analysis are three techniques for component classification in a document. The runlength smearing method operates on the bitmap image such that any two black pixels (1's) which are less than a certain threshold apart are merged into a continuous stream of dark pixels. White pixels (0's) are left unchanged. It is first applied row-by-row and then column-by-column by applying a logical AND to each pixel location. The resulting image contains a smear wherever printed materials appear on the original image. The graphics and text discrimination of the smear regions is made by classification of features extracted from the smear regions. Apparently, smearing fails if a document is skewed, and the smear region features are not suitable for graphics/text separation because they are alike in both a text region and a dense region of graphics.
The projection method uses recursive projection profile cuts to decompose a document into a set of blocks. At each step of the recursive process, the projection profile is computed along both horizontal and vertical directions; a projection along a line parallel to, for example, the X-axis, is simply a sum of all the pixel values along that line. Then, subdivision along the two directions is accomplished by making a cut corresponding to deep valleys, with widths larger than a predetermined threshold, in the projection profile. The block classification requires feature extraction from each block and heuristic thresholds to classify text line blocks, graphics blocks and halftone blocks. This method does not work well with skewed documents and requires intensive computation for further extraction of image information within the blocks.
The connected component analysis method first determines the individual-connected components which contain individual character and other large figures. Possible features for performing component classification are size, geometrical branching structure and shape measures. The disadvantages associated with this technique are large processing memory, long computational times and inefficiency of data structure for post processing.
In addition to the three techniques mentioned above, there are three additional shortcomings encountered when using any bitmap based technique. First, a large memory is required in the process of classification. Secondly, low degrees of precision are achieved since the method lacks any geometrical information for component classification; and thirdly, each classified component is simply a collection of connected pixels, which is an inefficient data structure for post recognition processing. For these reasons, the classification results produced by bitmap methods are not desirable and high-speed and relatively high-precision applications such as those dealing with forms conversion and creation.
A typical paper form used in an office environment consists of characters, symbols, lines, charts and bit-reversed regions. Characters are denoted as a text component. Symbols, lines, charts and light texts (bit-reverse) are usually defined as graphic components. The conversion from a paper form into an editable electronic form is an essential function in the automated office environment. To achieve this function, requires a computer-based system which is capable of performing (a) automatic separation of text in graphic components in a digitized document; (b) recognition of symbols and structures of charts; and (c) automatic data organization to preserve spatial relations of graphic entities and text strings. In addition, an optical character recognition system is required to convert text strings into ASCII strings and a user interface incorporated with a data base is necessary for form editing and creation.
Much of the research that has been performed to automate document analysis systems has been in relation to engineering drawings and diagrams. A major barrier to building a practical system is the lack of a fast and reliable algorithm for graphics/text separation and for providing an efficient data structure for post graphics decomposition and recognition.