1. Field of the Invention
The present invention relates to an apparatus for extracting a text region in a document image, which makes it possible to discriminate and extract a text region automatically in a document image containing a mixed form of texts, drawings and pictures. Preparatory input operation work for extracting a text region by an operator when character symbols in a test region are recognized by a character recognition apparatus is eliminated.
2. Description of the Prior Art
As a prior art apparatus, well known is an apparatus wherein binary document image data is processed in a simple manner in order to eliminate noise in the data, and then a peripheral distribution of filled or black pixels in the document image is estimated by means of a projection calculation which is performed by comparatively easy computational operations, and subsequently a blank portion or an unfilled pixel region where the peripheral distribution is substantially equal to zero is detected to determine a boundary line between adjacent text regions.
However, in this type of prior art apparatus, a blank portion where a peripheral distribution is substantially equal to zero is detected, so that in the case of recognizing a document image having a complex layout form in a text, it may happen that a portion in the peripheral distribution where a projection calculation value is equal to zero cannot be found even though there exists a blank portion as a boundary between adjacent text regions. In this case, all the boundary lines cannot be detected automatically and accordingly an operator must input preparatory information for extracting text regions.