1. Field of the Invention
The present invention relates to a system for converting documents and drawings into image data through an input device such as a scanner, etc., adding management information to the image data, and accumulating resultant data; to an apparatus for identifying the structure of the ruled lines in the image for image recognition; and to a method of performing the above described processes.
2. Description of the Related Art
Recently, a conventional method of storing information on paper has been switched to a method of storing data on electronic media. For example, an electronic filing system converts documents stored on paper into document images by an opto-electrical converter such as an image scanner, etc. and stores the converted document images on an optical disk, a hard disk, etc. with management information such as a key word for retrieval added to the converted document images.
Since documents are stored as image data in the above described method, a larger disk capacity is required than in a method in which all characters in documents are stored after being encoded in a character recognition technology. However, the above described method can be easily followed at a high process speed, and pictures and tables containing data other than characters can be stored as is. On the other hand, the stored information should be retrieved using additional management information such as a keyword, numbers, etc. together with document images. The conventional systems require much effort and time in assigning a keyword, and do not bring user-friendly technology.
To solve the problem of the awkwardness of the conventional systems, the title of a document can be assumed to be a keyword, automatically extracted, recognized as characters, and encoded for storage with document images.
At present, the speed of recognizing characters is up to several tens of characters per second, and it takes about 30 seconds through several minutes to process a normal document page (approximately 21 cmxc3x9729.5 cm). Therefore, it is recommended not to recognize all characters of an entire document, but to first extract necessary titles from the images of the document and then recognize them.
The conventional technology of extracting a part of a document, for example, a title of the document from a document image obtained by reading the document through an opto-electrical converter is described in xe2x80x9cTITLE EXTRACTING APPARATUS FOR EXTRACTING TITLE FROM DOCUMENT IMAGE AND METHOD THEREOF, U.S. patent application Ser. No. 08/694,503 which is now U.S. Pat. No. 6,035,061, and Japanese Patent Application H7-341983xe2x80x9d filed by the Applicant of the present invention. FIG. 1A shows the principle of the title extracting apparatus.
The title extracting apparatus shown in FIG. 1A comprises a character area generation unit 1, a character string area generation unit 2, and a title extraction unit 3. The character area generation unit 1 extracts, by labelling connected components of picture elements, a partial pattern such as a part of a character, etc. from a document image input through a scanner, etc. Then, it extracts (generates) a character area by integrating several partial patterns. The character string area generation unit 2 integrates a plurality of character areas and extracts (generates) a character string area. The title extraction unit 3 extracts as a title area a character string area which is probably a title.
At this time, the title extraction unit 3 utilizes notable points such as a top and center position, a character size larger than that of the body of the document, an underlined representation, etc. as the probability of a title area. The probability is expressed as a score for each of the character string areas to finally obtain a plurality of candidates for the title area in the order from the highest score to the lowest one. In the above described process, title areas can be extracted from documents containing no tables.
On the other hand, when a document contains a table, the title extraction unit 3 extracts a title area in consideration of the condition of the number of characters after the character string area generation unit 2 extracts a character string area in the table. For example, the number of characters indicating the name of an item implying the existence of the title is comparatively small such as xe2x80x98Subjectxe2x80x99, xe2x80x98Namexe2x80x99, etc. The number of characters forming a character string representing the title itself is probably large such as xe2x80x98. . . relating to . . . xe2x80x99 Thus, a character string which is probably a title can be detected from adjacent character strings by utilizing the number of characters in the character strings.
However, there are a large number of table-formatted documents using ruled lines such as slips, etc. Therefore, the above described conventional technology has the problem that there is little probability that a title can be successfully extracted from a table.
For example, when a title is written at the center or around the bottom in a table, the title may not be correctly extracted only by extracting character strings from the top by priority. Furthermore, as shown in FIG. 1B, an approval column 11 is located at the top in the table. If there are a number of excess character strings such as xe2x80x98general managerxe2x80x99, xe2x80x98managerxe2x80x99, xe2x80x98sub-managerxe2x80x99, xe2x80x98person in chargexe2x80x99, etc. in the approval column 11, then these character strings are extracted by priority, thereby failing in correctly extracting the title.
As shown by a combination of an item name 12 and a title 13, a title may be written below the item name 12, not on the right hand side of the item name 12. In this case, the relative positions of the item name and the title cannot be recognized only according to the information about the number of characters of adjacent character strings. Furthermore, item names are written not only horizontally but also vertically in Japanese. Therefore, it is very hard to correctly specify the position of the item name. When a document contains two tables, the title may be located somewhere in a smaller table.
Since a document containing tables can be written in various formats, the probability of a title depends on each document, and the precision of extracting a title in a table is lowered. If the state of an input document image is not good, the extraction precision is furthermore lowered.
In an electronic filing system, an extracted title area is character-recognized by an optical character reader (OCR) to generate a character code and add it to the image as management information. Thus, the image in a database can be retrieved using a character code.
In this case, there is no problem if the character string in a title area is readable by an OCR. However, if a background shows a textured pattern or characters are designed fonts, then the current OCR cannot recognize a character string. Therefore, in this case, management information cannot be added to an image.
The present invention aims at providing an apparatus and method of extracting appropriate management information for use in managing an image in a document in various formats, and an apparatus and method of accumulating images according to the management information.
An image management system having the management information extraction apparatus and the image accumulation apparatus according to the present invention includes a user entry unit, a computation unit, a dictionary unit, a comparison unit, an extraction unit, a storage unit, a group generation unit, and a retrieval unit.
According to the first aspect of the present invention, the computation unit computes the position of the management information contained in an arbitrary input image according to the position information about the position of a ruled line relative to the outline portion of a table area contained in the input image. The extraction unit extracts the management information from the input image based on the position computed by the computation unit.
In the second aspect of the present invention, the dictionary unit stores the features of the structures of the ruled lines of one or more table forms, and the position information about the management information in each of the table forms. The comparison unit compares the feature of the structure of the ruled lines of the input image with the feature of the structure of the ruled lines stored in the dictionary unit. The extraction unit refers to the position information about the management information stored in the dictionary unit based on the comparison result from the comparison unit, and extracts the management information about the input image. The user entry unit enters the position of the management information specified by the user in the dictionary unit.
According to the third aspect of the present invention, the storage unit stores image information as management information for an accumulated image. The retrieval unit retrieves the image information.
According to the fourth aspect of the present invention, the storage unit stores ruled line information about a table form. The group generation unit obtains a plurality of possible combinations between the ruled line extracted from an input image and the ruled line contained in the ruled line information in the storage unit, and extracts a group containing two or more compatible combinations from the plurality of combinations in such a way that no combinations of another group can be contained. The comparison unit compares the input image with the table form according to the information about combinations contained in one or more extracted groups.