1. Field of the Invention
The present invention relates to a pattern extraction apparatus and a pattern extracting method, and is specifically applicable to a case where a box and a ruled line indicating the range of a pattern containing characters, graphics, symbols, images, etc. in a hand-written character recognition apparatus, a printed character recognition apparatus, a graphics recognition apparatus, etc.
2. Prior Art Technology
Recently, there has been an increasing demand for a hand-written character recognition apparatus such as an optical character reader as a peripheral unit for inputting financial documents, business documents, etc.
A conventional optical character reader performs a character segmenting process on each character of a character pattern from an input image before recognizing a character. To attain a high character recognition rate for each character, an optical character reader has to correctly segment a character as a pre-recognition process.
Therefore, when a conventional optical character reader reads a character, a character is written in a specified range in a document such as a listing in which a character input position is specified (not with drop-out color but with, for example, a black rectangular box or a ruled line with similar color or density as a character) to attain a high recognition rate.
However, the conventional optical character reader has the problem that the character recognition rate is low because a character cannot be correctly segmented when the ruled line or rectangular box indicating a specified input range touches or intersects the character. For example, a current optical character reader cannot recognize a slight obliqueness, concavity, or convexity of a rectangular box when the rectangular box is removed. As a result, if the position or the line width of a rectangular box is changed, a part of a character to be recognized may be lost or a part of the rectangular box may remain unremoved.
When a range of inputting characters in a listing is specified, the information about the position and the fineness of a ruled line should be preliminarily stored, and the information about the range of inputting characters should be updated if a listing format is changed. Therefore, the conventional system gives a user a heavy load. Furthermore, in a system of specifying a character range, an unknown listing format cannot be processed.
In the previous Japanese patent application (Tokuganhei) No. 7-203259, the Applicant suggested the technology of extracting and removing a rectangular box without inputting format information about the position or size of a rectangular box. Applicable listings in this technology are a one-rectangular box, a block rectangular box (containing a single horizontal row of characters, or a free-format rectangular box), or a table having rectangular box with horizontal lines regularly arranged. Furthermore, the technology can process listings having no rectangular tables, having further complicated table structures, or listings in which dotted lines and solid lines coexist.
Described below is the outline of the process performed by the pattern extraction apparatus described in the specification and the attached drawings of the previous Japanese patent application (Tokuganhei) No. 7-203259.
First, an input image is labelled, and a portion pattern which is formed from pixels linked to each other in any of eight directions, that is, horizontally, vertically and diagonally, can be extracted as a linked pattern.
Then, the horizontal or vertical lines are fined to reduce the difference in fineness of lines between a character and a rectangular box by performing a masking process on a linked pattern extracted by labelling an input image. In the masking process, the entire image of the linked pattern is scanned using two types of masks, that is, a horizontal mask and a vertical mask. The proportion of the pattern to the mask is computed. If the proportion is above a predetermined value, then the entire mask is recognized as a pattern. If it is equal to or below the predetermined value, then vertical and horizontal elements are extracted by deleting the pattern in the mask.
Then, the masked pattern is divided into a plurality of pieces vertically or horizontally, and a contiguous projection value of the pattern is computed in each of the ranges divided vertically and horizontally. Based on the contiguous projection pattern, a predetermined length of a line or a part of a straight line is detected by an approximate rectangle. A contiguous projection value is obtained by adding the projection value of a target row or a target column to the projection value of a row or a column close to the target row or the target column.
Next, among the lines each forming part of a rectangle obtained by the contiguous projection method, adjacent lines forming part of a rectangle are combined into a long line. Thus, the obtained lines form an approximate rectangle, and can be recognized as candidates for horizontal or vertical ruled lines of a listing.
Then, the horizontal or vertical lines recognized as candidates for ruled lines are searched to detect the left and right margins for the horizontal lines, and the upper and lower margins for the vertical lines.
Next, small patterns arranged at predetermined intervals are detected to extract dotted lines and obtain an approximate rectangle using the dotted lines as in the above described lines.
A set of two horizontal lines forming part of a rectangular box is determined from among the horizontal lines detected in the above described process. Two horizontal lines are sequentially extracted from the top. When the two extracted horizontal lines have the same length or the lower horizontal line is longer than the upper horizontal line, the two horizontal lines are recognized as a set of horizontal lines. Unless the two extracted horizontal lines have the same length or the lower horizontal line is longer than the upper horizontal line, the two lines are recognized as a set even if the lower line is shorter.
Then, from among the horizontal lines detected in the above described process, the vertical ruled lines are determined if both upper and lower ends of them reach the above described set of two horizontal lines recognized as a set of two horizontal ruled lines.
Then, the range of a rectangle encompassed by the above described set of two horizontal lines and the two vertical ruled lines both upper and lower ends of which reach the set of the two horizontal lines is extracted as a cell. A line forming part of the cell is recognized as a ruled line. A line not forming part of the cell is recognized as a pattern other than a ruled line.
When the rectangle encompassed by the horizontal and vertical ruled lines determined in the above described process is further divided into smaller rectangular areas, the rectangle is newly defined as a table. By repeating the above described process, the rectangular areas are divided into furthermore smaller rectangles.
Thus, according to the conventional technology, any table formed by rectangular areas can be processed regardless of a regular or an irregular structure of a rectangular box. The process can also be performed on solid lines and dotted lines as ruled lines to be processed.
However, the above described pattern extraction apparatus selects as a candidate for a ruled line an area having a high density of pixels. If characters are close to each other or touch each other, the density of the pixels becomes high around the characters, and the character area can be regarded as a candidate for a ruled line.
For example, in FIG. 1A, when a character string 201  is entered in a listing 200, the density of the pixels of the pattern in a rectangular area 202 is high. Therefore, the pattern is recognized as a candidate for a ruled line although it is part of the character string 201. However, since the rectangular area 202 does not touch any of the ruled lines forming the listing 200, the rectangular area 202 cannot form a cell. Therefore, the pattern can be recognized that it is not a ruled line.
In FIG. 1B, when a character string 204  is entered in a listing 203, the density of the pixels of the pattern in a rectangular area 205 is high. Therefore, the pattern is recognized as a candidate for a ruled line although it is part of the character string 204. The rectangular area 205 touches vertical ruled lines 207 and 208, and the rectangular area 205 can form a cell with the vertical ruled lines 207 and 208, and a horizontal ruled line 206. Therefore, a part of the character string 204 is regarded as a ruled line, and it is difficult to correctly segment the character string 204 , thereby causing the problem that a character cannot be correctly recognized.
The present invention aims at providing a pattern extraction apparatus for correctly determining whether or not a pattern is a ruled line.
The pattern extraction apparatus according to the present invention includes a pattern input unit; a convexity/concavity computation unit; a pattern distinction unit; a shift frequency calculation unit; a first search unit; a second search unit; a count unit; a obliqueness detection unit; a computation unit; an adjustment unit; a regulation unit; an intersection number count unit; a ruled line distinction unit; a character distinction unit; a linked pattern extraction unit; a ruled line candidate extraction unit; a search unit; a mask process unit; a cell area extraction unit; a ruled line removal unit; a segment detection unit; a straight line detection unit; a length computation unit; a length comparison unit; a convexity/concavity obtaining unit; a ruled line exclusion unit; a listing distinction unit; and a partial convexity/concavity computation unit.
According to the first aspect of the present invention, the pattern input unit inputs a pattern. The convexity/concavity computation unit computes the convexity and concavity of the above described pattern. The pattern distinction unit distinguishes the attribute of the above described pattern based on the convexity and concavity.
According to the second aspect of the present invention, the linked pattern extraction unit extracts a partial pattern formed by linked pixels from the input original image data. The ruled line candidate extraction unit extracts as a candidate for a ruled line a rectangular area having a high density of pixels from the above described partial pattern. The search unit searches the above described partial pattern in the rectangular area. The convexity/concavity computation unit computes the convexity/concavity of the above described partial pattern based on the search result of the above described search unit. The ruled line distinction unit distinguishes based on the above described convexity/concavity whether or not the above described partial pattern forms a ruled line.