A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates to an image/text filtering system and method for use with optical character recognition (OCR) techniques.
The use of word processors and personal computers in the office automation marketplace has dramatically increased in the past several years. With this growth has come an awareness that optical character recognition (OCR) machines can aid productivity by decreasing the time needed to enter typed documents into word processors, personal computers and databases.
The documents fed into OCR equipment consist mainly of typed text, although image features such as company logos, signatures, editing marks, graphs, and pictures are not unusual. Consequently, it is important that OCR machines be sophisticated enought not to let image features degrade recognition throughput, or even worse, let image features be recognized as, or interfere with, valid text. The present invention provides a hardware/software system to identify and erase image regions from digitized text documents prior to character recognition.
Earlier OCR systems achieve varying success with image filtering by following two different methods. The first approach relies on the fact that text characters are generally separated from adjacent characters and can be easily isolated. Conversely, images usually have comparatively longer strings of contiguous pixels. This is a good technique, but it is computation intensive, especially in systems that have limited memory to devote to video buffers. In addition, the large numbers of image fragments that can be generated by this method must be rejected during the recognition process. Throughput in these systems can fall off dramatically when documents with large numbers of small, isolated image elements are being processed. Then image fragments that have been mistaken for valid text must be edited by the user. Another drawback is that on some documents valid text is ignored because of its proximity to image fragments.
The second approach initially segments the document into large regions by vertical and horizontal smearing and then attempts to classify these regions as text or image by using statistical attributes of the region size and internal pixel distribution. This technique requires more computation than the first, but because the algorithm is fairly regular, a faster hardware implementation is possible. There are three main drawbacks to this method. First, if a text area and an image area on a document overlap (or in some cases are just in close proximity to each other), they will be identified as one block, thus creating a classification error for a potentially large portion of the document. Second, sometimes the statistical attributes of a region are misleading, again causing a classification error. Third, text completely surrounded by image regions may be called image. Despite these drawbacks, this technique works well over a large variety of documents.