1. Field of the Invention
This invention relates to document image processing, and in particular, it relates to word segmentation, i.e. segmenting an image of a text line into sub-images corresponding to words.
2. Description of Related Art
A “document image” refers to a digital image representing a document which includes a substantial amount of text. For example, a document image may be generated by scanning a hard copy document, taking a photograph of a hard copy document, converting a text-based electronic document (e.g. a Word™ document) into an image format (e.g. PDF™), etc. “Document image processing” refers to various processing conducted for document images. One example of document image processing is optical character recognition (OCR), which aims to extract the textual content of the document. Another example of document image processing is document authentication, which aims to determine whether a target document image is the same as an original document image or whether it has been altered.
In some document image processing methods, a document image is segmented at various levels into blocks (e.g. paragraphs of text, photos, etc.), text lines segments, words segments, and/or symbol segments. These steps are sometimes referred to as paragraph (or block) segmentation, line segmentation, etc. and collectively referred to as document segmentation. Here, paragraph segment, line segment, etc. refer to sub-images that represent a paragraph, line, etc. of the document. In this disclosure, sometimes a paragraph segment, line segment, etc. is simply called a paragraph, line, etc., but it should be clear from the context of the disclosure that they refer to sub-images rather than the text of the paragraph, line, etc.
Word segmentation refers to segmenting lines into words. Many word segmentation methods are known. Some of these methods examine spacing segments (white spaces) in a text line to distinguish word spacing (space between neighboring words) and character spacing (space between neighboring characters within words). For example, Soo H. Kim, Chang B. Jeong, Hee K. Kwag, Ching Y. Suen. “Word segmentation of printed text lines based on gap clustering and special symbol detection”, 16th international conference on Pattern Recognition (2002) (hereinafter “Kim et al. 2002”), describes a method which applies a hierarchical clustering method to spacing segments in a text line to distinguish word spacing and character spacing.
Commonly owned U.S. patent application publication 2014/0270526, published Sep. 18, 2014 (hereinafter “the '526 application”), describes a word segmentation method which applies clustering analysis to the spacing segments of a line. Taking advantage of the bimodal distribution of spacing length distribution of typical text lines, a k-means clustering algorithm is used, with the number of clusters pre-set to two, to classify the spacing segments into character spacings and word spacings. Moreover, k-means++ initialization is used to enhance the performance of cluster analysis.