Text line extraction is a key step in document image processing. There are two main types of conventional methods to obtain text lines from images. The first type uses layout analysis to separate text paragraph from images and to extract the text lines. Another type uses idea of text extraction from natural scene images. Reference can be made to the following relevant technical documents: F. Shafait, D. Keysers, T. Breuel, “Performance evaluation and benchmarking of six page segmentation algorithms”. IEEE Trans. On Pattern Analysis and Machine Intelligence. v 30, n 6, pp 941-954, Nov. 30, 2007 (hereinafter referred to as technical document 1); and E. Kim, et Al, “Scene text extraction using focus of mobile camera”. Proceedings of the 10th International conference on Document Analysis and Recognition, p 166˜170, 2009. Jul. 26˜29, Barcelona (hereinafter referred to as technical document 2), the content of both of which is incorporated herein by reference.
The purpose of the text extraction is to decide the orientation of the scanned page by character recognition on the extracted text lines. The key requirements of the text extraction include:
1. Extraction of all text lines from images is not necessary.
2. The speed should be as fast as possible.
The traditional methods have problems for the above 2 requirements. Layout analysis based methods can not meet the speed requirement. Also, the layout analysis based methods will analyze the whole document image. If the structure of the image is very complex, text line extraction usually fails. The 2nd type of text extraction method (see technical document 2) is very fast. But it's mainly for horizontal text lines extraction from outdoor natural scene text. When the 2nd type of method is applied to scanned document images, one big problem is how to find the correct direction of the text lines in case that the scanned document includes horizontal text lines, vertical text lines, and images. The target of the present invention is a fast and reliable text line extraction from scanned document images.