In a field where forms in various formats are handled, an apparatus for automatically identifying forms by format has been proposed. This type of automatic identification is made based on the similarity between formats of the forms. A method of determining an origin position of a form greatly affects the result of the calculation of the similarity between form formats. If an upper left corner of an image read through a scanner is used as the origin of a form as is, a displacement of the form placed on the scanner displaces the position of the form origin, preventing the form from being properly recognized. Therefore, form format data is generated in order to correct the form origin position. This method will be described below. When a scanner that reads an image against a black background (hereinafter called a “black back scanner”) is used to read an image, the outer rim of a form in the read image appears in black Therefore a process (black rim correction) for deleting the black rim is performed to correctly recognize the shape of the form. Means for generating form format data uses as the origin the upper left corner of the image that has undergone the black rim correction to generate the format data (FIG. 2A).
When a scanner that reads an image against a white background (hereinafter called a “white back scanner”), the outer rim of a form in the read image appears in white. Thus the color of the rim in many cases is the same as that of the form itself. Therefore the black rim correction cannot be applied to it. Because no colors appear on the outer rim of the form except white, which is the background color, features of the image are extracted to determine the positions of a table block and a text block to decide its origin. For example, the top, left most position of a rectangle encompassing a whole block is used as the origin to generate the format data. Although the upper left corner of the form cannot be used as the origin in this method, the origin of forms in the same format can be determined uniquely when the background color is white (FIG. 2B).
However, the method of determining the origin in the black back scanner differs from the method used in the white black scanner. Therefore, in an environment where various types of scanners are used, different methods are used to calculate features, preventing document formats from being correctly identified. In addition, an apparatus for identifying document formats is often used in a relatively large client-server-based system environment. When the conventional automatic identification method described above is used in such a system in which a number of clients are used, a single type of scanners must be used in those clients or some other restrictions must be introduced.