1. Field of the Invention
The present invention relates to a system for converting documents and drawings into image data through an input device such as a scanner, etc., adding management information to the image data, and accumulating resultant data; to an apparatus for identifying the structure of the ruled lines in the image for image recognition; and to a method of performing the above described processes.
2. Description of the Related Art
Recently, a conventional method of storing information on paper has been switched to a method of storing data on electronic media. For example, an electronic filing system converts documents stored on paper into document images by an opto-electrical converter such as an image scanner, etc. and stores the converted document images on an optical disk, a hard disk, etc. with management information such as a keyword for retrieval added to the converted document images.
Since documents are stored as image data in the above described method, a larger disk capacity is required than in a method in which all characters in documents are stored after being encoded in a character recognition technology. However, the above described method can be easily followed at a high process speed, and pictures and tables containing data other than characters can be stored as is. On the other hand, the stored information should be retrieved using additional management information such as a keyword, numbers, etc. together with document images. The conventional systems require much effort and time in assigning a keyword, and do not bring user-friendly technology.
To solve the problem of the awkwardness of the conventional systems, the title of a document can be assumed to be a keyword, automatically extracted, recognized as characters, and encoded for storage with document images.
At present, the speed of recognizing characters is up to several tens of characters per second, and it takes about 30 seconds through several minutes to process a normal document page (approximately 21 cm.times.29.5 cm). Therefore, it is recommended not to recognize all characters of an entire document, but to first extract necessary titles from the images of the document and then recognize them.
The conventional technology of extracting a part of a document, for example, a title of the document from a document image obtained by reading the document through an opto-electrical converter is described in "TITLE EXTRACTING APPARATUS FOR EXTRACTING TITLE FROM DOCUMENT IMAGE AND METHOD THEREOF, U.S. patent application Ser. No. 08/694,503, now U.S. Pat. No. 6,035,061 issued Mar. 7, 2000 and Japanese Patent Application H7-341983" filed by the Applicant of the present invention. FIG. 1A shows the principle of the title extracting apparatus.
The title extracting apparatus shown in FIG. 1A comprises a character area generation unit 1, a character string area generation unit 2, and a title extraction unit 3. The character area generation unit 1 extracts, by labelling connected components of picture elements, a partial pattern such as a part of a character, etc. from a document image input through a scanner, etc. Then, it extracts (generates) a character area by integrating several partial patterns. The character string area generation unit 2 integrates a plurality of character areas and extracts (generates) a character string area. The title extraction unit 3 extracts as a title area a character string area which is probably a title.
At this time, the title extraction unit 3 utilizes notable points such as a top and center position, a character size larger than that of the body of the document, an underlined representation, etc. as the probability of a title area. The probability is expressed as a score for each of the character string areas to finally obtain a plurality of candidates for the title area in the order from the highest score to the lowest one. In the above described process, title areas can be extracted from documents containing no tables.
On the other hand, when a document contains a table, the title extraction unit 3 extracts a title area in consideration of the condition of the number of characters after the character string area generation unit 2 extracts a character string area in the table. For example, the number of characters indicating the name of an item implying the existence of the title is comparatively small such as `Subject`, `Name`, etc. The number of characters forming a character string representing the title itself is probably large such as ` . . . relating to . . . ` Thus, a character string which is probably a title can be detected from adjacent character strings by utilizing the number of characters in the character strings.
However, there are a large number of table-formatted documents using ruled lines such as slips, etc. Therefore, the above described conventional technology has the problem that there is little probability that a title can be successfully extracted from a table.
For example, when a title is written at the center or around the bottom in a table, the title may not be correctly extracted only by extracting character strings from the top by priority. Furthermore, as shown in FIG. 1B, an approval column 11 is located at the top in the table. If there are a number of excess character strings such as `general manager`, `manager`, `sub-manager`, `person in charge`, etc. in the approval column 11, then these character strings are extracted by priority, thereby failing in correctly extracting the title.
As shown by a combination of an item name 12 and a title 13, a title may be written below the item name 12, not on the right hand side of the item name 12. In this case, the relative positions of the item name and the title cannot be recognized only according to the information about the number of characters of adjacent character strings. Furthermore, item names are written not only horizontally but also vertically in Japanese. Therefore, it is very hard to correctly specify the position of the item name. When a document contains two tables, the title may be located somewhere in a smaller table.
Since a document containing tables can be written in various formats, the probability of a title depends on each document, and the precision of extracting a title in a table is lowered. If the state of an input document image is not good, the extraction precision is furthermore lowered.
In an electronic filing system, an extracted title area is character-recognized by an optical character reader (OCR) to generate a character code and add it to the image as management information. Thus, the image in a database can be retrieved using a character code.
In this case, there is no problem if the character string in a title area is readable by an OCR. However, if a background shows a textured pattern or characters are designed fonts, then the current OCR cannot recognize a character string. Therefore, in this case, management information cannot be added to an image.