Forms are widely used in many consumer and commercial environments. Presently, forms are being processed by computers to enhance efficiency. As used herein, a form is a particular type of document. A form has a number of horizontal and vertical separators such as lines therein. Regions surrounded by these lines are called cells. A cell may be a region previously printed on the form itself such as "Name" (a header field), or a text field in which a user specifically fills in his or her name or address. These fields are laid out on the form in a predefined layout.
Generally, a system which processes a form captures an image of the form (bitmap) using a reader device such as an OCR so that the form processing system can extract necessary information from the form. The layout of the form is determined by analyzing the image on the basis of a format previously stored in the memory of the computer system. The format is a model for analyzing the layout of form. The layout of the form is analyzed by referring to this model for comparison. The analysis enables the system to identify what information such as an address or name exists at a predefined position on the form, and to recognize images of characters, numbers and symbols actually existing at that position as text by using well known character recognition techniques. This allows the system to recognize the information filled in that position as a text.
A simple conventional format is defined on the basis of positions and length of lines previously printed on the form, and characters previously printed (headers). In other words, the format is a blank form in which nothing is entered. Based on such a format, the blank form is erased from the bitmap of the form actually read by the reader device, and analyzed for the layout of the form. Thus, information on the form is identified.
FIG. 1 shows samples of specific forms. These three samples A, B and C differ from each other in the positions and length of lines therein. Conventionally, forms which can be handled with one format are generally limited to those in which positions and lengths of lines existing in the form exactly match the information of the format. Thus, the layout of the three samples of FIG. 1, generally cannot be readily analyzed with one format. Therefore, different formats corresponding to each of the samples A, B and C are generally provided. Since the memory area for storing data of respective formats may increase, excessive memory may be used. In addition, every time the form processing system receives a form, it may need to verify to which form it corresponds, so that the speed for processing the form may decrease. The decrease in processing speed may become significant as the number of forms to be processed increases.
In order to make it possible to accommodate a plurality of different forms which have different layouts with a single format, a technique has been proposed which defines a format based on an order of drawing lines in the form instead of using strict matching of positions or length of lines as the reference. In the sample B of FIG. 1, for example, the entire form is first divided into upper and lower areas by traversing it (Line 1). Then, the divided upper and lower areas are further vertically divided (Lines 2 and 3), so that the upper area is divided into a "Name area" and a "Zip area," and the lower area is divided into an "Address area" and a "Phone area." Furthermore, respective areas are vertically divided into a header area and a text field (Lines 4, 5, 6, and 7).
The format for sample B which is defined by the order of line drawing in the form, can be applied to a form in which the Line 5 in FIG. 1(B) is replaced with a Line 5', because the order of line drawing is unchanged. However, this format generally cannot be applied to the sample C in which the order differs from that for the sample B. In the case of sample C, the entire form is first vertically divided (Line 1) so that the form is divided into two right and left areas. Then, the divided right and left areas are horizontally divided (Lines 2 and 3).
Since the forms of FIGS. 1(B) and (C) differ from each other in the order of line drawing, they generally are represented by different formats. Therefore, this approach may not handle the three samples shown in FIG. 1 with a single format. In addition, when a format is produced from an actual form, or when a format is updated with a new form, expertise may be necessary to perform such operation.