This specification relates to shape clustering and optical character recognition.
Optical character recognition (OCR) uses computer software, which will be referred to generically as an OCR engine, to process digital images of printed, typewritten, handwritten, or other written text, whether originally on paper, microfilm, or other medium, and to produce machine recognizable and editable text from the images. The digital image of a document processed by an OCR engine may include images of multiple pages of written material. The images of the text to be processed by the OCR engine may be obtained by various imaging methods including using an image scanner to capture digital images of the text.
An OCR engine generally produces rectangular bounding boxes intended to enclose collectively the text written on each page. Generally, when the document image has gray scale or color information, the OCR engine binarizes the image so that each image pixel is determined to be either a foreground pixel (e.g., black text) or a background pixel (e.g., a white region). Each bounding box normally encloses one or more connected groups of text pixels of one character perceived by the OCR engine, but may also overlap part of, or in extreme cases all of, an adjacent character. In such situations, several methods exist to separate the pixels identified by the OCR engine as belonging to the interior of the bounding box from those that belong to a different but overlapping bounding box. These methods include: generating mask images by thresholding and connected component analysis, constructing outline polygons, and constructing parallelogram bounding boxes. An OCR engine generally assigns to each bounding box one or more OCR character codes. Each OCR code identifies one or more characters that the engine has recognized in the bounding box. If an OCR engine fails to recognize any character in a bounding box, it may assign no OCR character code to the bounding box. Each character identified by an OCR character code can be represented in a standard character encoding, e.g., an ASCII or Unicode encoding.
Each bounding box can be thought of as a clipping path that isolates a portion or small image of the document image, whether in an original form or a binarized binary form. Because these small images can be thought of as being clipped from the document image by their respective bounding boxes, these small images will be referred to as clips or clip images. Because each clip image is tied to a bounding box, the OCR character code or codes, and hence the character or characters, assigned to a bounding box can also be referred to or identified as the codes or the characters assigned to the clip image. Unless otherwise noted, the term clip or clip image will refer to an image that is a portion of a document image and that is processed for character recognition by an OCR engine.
An OCR engine may make errors during the processing. For example, an OCR engine may improperly segment the original image by, e.g., including only a portion of a character in a bounding box or including multiple characters that are recognized as a single character in a single bounding box. As another example, an OCR engine may assign an incorrect OCR character code to a bounding box due to some image similarity between the clip image enclosed by the bounding box and a reference image for a different character code or due to poor image quality of the digital images received by the OCR engine.