1. Field of the Invention
The present invention relates to a ruled line extracting apparatus for extracting a ruled line portion from an arbitrary document image read by a photoelectric converter, etc., and method thereof.
2. Description of the Related Art
In recent years, the demand for an electronic filing system which converts a paper document into an electronic form, and stores it on an optical disc, etc., has increased, in order to improve the efficiency of operations performed within a company. With a conventional electronic filing system, a paper document is converted into an image by a photoelectric converter such as an image scanner, etc., and the image with a search keyword attached is stored on an optical disc or on a hard disk. However, since the keyword must be input from a keyboard, the input operation is troublesome.
As a former application by the present applicant in order to overcome this troublesome operation, “Title Extracting Apparatus for Extracting Title from Document Image and Method Thereof, U.S. patent application Ser. No. 08/694,503, Japanese patent application H7-341983” can be referred to. With this method, a document title included in an image is automatically extracted and registered as a keyword. Additionally, management information such as a title, destination, transmitting source etc., can be automatically extracted from various document images including a table format document. For example, it is proved that a title outside a table can be extracted with approximately 90% accuracy.
A title inside a table, however, can be extracted with only 55% accuracy, which is insufficient to be put into practical use. To extract a keyword such as a title from inside a table with high accuracy, ruled lines structuring the table must be accurately extracted. The technique for extracting a ruled line has been developed mainly for a spreadsheet in which characters, etc. are regularly lined up.
As the conventional techniques for extracting a ruled line, “Image Extracting Method” (Japanese patent laid-open H6-309498) and “Image Extracting Apparatus” (Japanese patent laid-open H7-28937) can be referred to. With these techniques, a frame can be extracted or removed without requiring an input of information such as a frame position etc., in a spreadsheet. A spreadsheet which can be processed is a sheet composed of one-character frames, block frames (horizontal one-line frames, or free format frames), or a sheet having a structure in which the shape of a frame is rectangular, and horizontal frame lines are regularly arranged.
Additionally, as the techniques for extracting a ruled line according to former applications in Japan by the present applicant, “Frame Extracting Apparatus and Rectangle Extracting Apparatus” (Japanese patent application H7-203259), “Pattern Area Extracting Apparatus and Pattern Extracting Apparatus” (Japanese patent application H7-282171), and “Pattern Extracting Apparatus and Pattern Area Extracting Method” (Japanese patent application H8-107568) can be referred to.
With these techniques, a frame can be extracted/removed even if the outer periphery of frames is rectangular as shown in FIG. 1A, or not rectangular as shown in FIG. 1B. Furthermore, the frame of a table structured by a rectangle which is surrounded by a frame, and partitioned into smaller portions, can also be extracted and removed, like the shaded portion shown in FIG. 1B. Provided below is the outline of this process.
(1) thinning: With a mask process, horizontal and vertical segments are made thinner, and the difference between the thickness of a character and that of a frame is eliminated.
(2) segment extraction: a relatively long straight line is extracted with the adjacency projection method according to the “Image Extracting Method” (Japanese patent laid-open H6-309498). The adjacency projection method is a method for recognizing the result of adding the projection value of pixels included in rows or columns around a specific row or column, to the projection value of pixels in the specific row or column, as the final projection value of the specific row or column. With this method, pixel distribution around a particular row or column can be globally identified.
(3) straight line extraction: extracted segments are sequentially searched, and it is examined whether or not there is an empty space of a predetermined length between segments. If there is no such empty space, the segments are sequentially linked, so that a long straight line is extracted.
(4) straight line integration: extracted straight lines are again integrated. Straight lines separated into two or more portions due to a blur are integrated into one straight line.
(5) straight line extension: a straight line which is made shorter due to a blur is extended, and restored to its original length, only when a spreadsheet is proved to be regular.
However, the above described techniques have the following problems.
According to the techniques disclosed in the former applications, whether the shape of a frame of a spreadsheet is regular or irregular, it can be processed as long as it is a table frame composed of rectangular regions. Whether a ruled line to be targeted is a solid or dotted line, it can be processed regardless of the existence of a blur. Furthermore, a straight line which is made shorter due to an extreme blur is extended only when a table is proved to be regular.
A normal input image may sometimes include characters of a thick font, or a shaded portion in a table, as shown in FIG. 1C. In such a case, a ruled line is erroneously extracted from a defaced character string in which characters touch one another, and ruled lines which are erroneously extracted may sometimes be integrated with correct ruled lines.
Additionally, a ruled line which touches a group of black pixels such as a shaded portion, or a ruled line which touches a character cannot be extracted. To overcome these problems, it is desirable that a table document such as a spreadsheet whose ruled-line structure is known beforehand should be a process target.
However, since it is unknown beforehand what type of table a normal document handled by electronic filing includes, the probability that various images including a defaced character etc., are input, is high. Accordingly, a ruled-line is not necessarily and correctly extracted according to the techniques of the former applications as they are.