1. Field of the Invention
This invention relates to document image processing, and in particular, it relates to word segmentation, i.e. segmenting a document image into sub-images corresponding to words of the document.
2. Description of Related Art
A “document image” refers to a digital image representing a document which includes a substantial amount of text. For example, a document image may be generated by scanning a hard copy document, taking a photograph of a hard copy document, converting a text-based electronic document (e.g. a Word™ document) into an image format (e.g. PDF™), etc. “Document image processing” refers to various processing conducted for document images. One example of document image processing is optical character recognition (OCR), which aims to extract the textual content of the document. Another example of document image processing is document authentication, which aims to determine whether a target document image is the same as an original document image or whether it has been altered.
In some document image processing methods, a document image is segmented at various levels into blocks such as paragraphs of text or photos, text lines segments, words segments, and/or symbol segments. These steps are sometimes referred to as paragraph (or block) segmentation, line segmentation, etc. and collectively referred to as document segmentation. Here, paragraph segment, line segment, etc. refer to sub-images that represent a paragraph, line, etc. of the document. In this disclosure, sometimes a paragraph segment, line segment, etc. is simply called a paragraph, line, etc., but it should be clear from the context of the disclosure that they refer to sub-images rather than the text of the paragraph, line, etc.
Word segmentation refers to segmentation of lines into words. Many word segmentation methods are known. Some of these methods examine spacing segments (white spaces) in a text line to distinguish word spacing (space between neighboring words) and character spacing (space between neighboring characters within words). For example, Soo H. Kim, Chang B. Jeong, Hee K. Kwag, Ching Y. Suen, “Word segmentation of printed text lines based on gap clustering and special symbol detection”, 16th international conference on Pattern Recognition (2002) (herein after “Kim et al. 2002”), describes a method which applies a hierarchical clustering method to spacing segments in a text line to distinguish word spacing and character spacing. Commonly owned, co-pending patent application publication US 2014/0270526, published Sep. 18, 2014, describes a word segmentation method that uses a k-means clustering algorithm to classify the space segments as either character spacing or word spacing. Many word segmentation methods use vertical projections of the line text image to determine the locations and sizes of the white spaces before attempting to distinguish word spacing and character spacing.
Connective component based method have been used for word segmentation of italic text lines, but the associated computation cost is relatively high.