The following relates to the information processing arts, information retrieval arts, object classification arts, object recognition arts, object comparison arts, word spotting arts, and related arts.
Automated processing of handwritten documents can facilitate processing of forms, enable detection of documents relating to a certain subject, and so forth. Such processing begins by generating a digital image of the handwritten document, for example by optical scanning. In a technique known as optical character recognition (OCR) the digital image is processed to identify characters, such as letters or digits in the case of English textual content. For standardized text printed by a typewriter or marked by a printer using a computer-generated font, OCR can typically identify most or all of the text. However, OCR is less effective for handwritten textual context due to the larger variation in handwriting as compared with standardized molded typewriter letters or standardized computer type font. Substantial handwriting variation exists for a single human writer, and such variations are magnified further for documents written by numerous different human writers, as is often the case for bulk processing of filled-out business or government forms.
Word spotting employs a fundamentally different approach from OCR. Instead of attempting to identify each letter in order to convert the document image into text and then processing the OCR text, word spotting operates at the word or phrase level, in the image domain, to identify or “spot” words that are similar to a standard or representative word. Word spotting advantageously utilizes information about the overall form or shape of a word, and accordingly can be more accurate for handwritten textual content as compared with OCR. Moreover, word spotting is computationally efficient because it does not convert each individual letter of a word or phase.
Word spotting is used, for example, to sort or classify documents by identifying one, two, or a few keywords or key phrases in a document. Although described herein with respect to handwritten text, word spotting can also be applied to computer-generated or typewritten text in image form.
The basic word spotting process for handwritten documents entails: (1) optical scanning of the document pages; (2) segmenting the images to identify discrete word or phrase images; (3) extracting a quantitative representation of each word or phrase image; and (4) comparing the quantitative representations with a reference representation that was extracted from a word or phrase image (or a training set of word or phrase images) known to correspond to the word or phrase to be spotted. In a typical approach for the extraction operation (3), feature vectors are extracted at various points along the word or phrase image, with the points selected from left to right (or from right to left) so that an ordered sequence of feature vectors is generated. The comparison operation (4) then employs any suitable comparative measure for comparing the input and reference sequences of feature vectors.
A commonly used comparative measure is the dynamic time warping (DTW) distance measure, described for example in Sakoe et al., “Dynamic programming algorithm optimization for spoken word recognition”, IEEE Transactions on Acoustics, Speech and Signal processing, pp. 159-65 (1978), which is incorporated herein by reference in its entirety. This distance measures a best matching alignment between the two sequences, and then accumulates a vector-by-vector comparison. The DTW distance measure is computationally efficient, and accounts for the elastic deformation of the handwritten word or phrase which is sometimes observed in handwriting comparisons. However, DTW utilizes a limited amount of information, namely the information contained in the input and reference vector sequences. DTW also does not provide a convenient way to incorporate a training set of reference word or phrase images.
Jebara et al., “Spectral clustering and embedding with hidden Markov models”, In the 2007 European Conference on Machine Learning (ECML 2007), which is incorporated herein by reference, describes a probability product kernel (PPK)-based distance measure. In this approach, the extraction operation (3) employs a continuous hidden Markov Model (C-HMM), i.e. a HMM with continuous state emission probabilities. The state emission probabilities are generally Gaussian mixture models (GMM). The input and reference word or phrase models are thus C-HMMs. The comparison operation (4) then computes distances between the C-HMMs of the input and reference word or phrase images using a PPK. In this model-based approach, a reference model can be generated based on a training set of handwritten samples of the reference word or image, thus providing a substantial advantage over the DTW measure. However, the approach of Jebara et al. is computationally intensive, because the PPK computation essentially entails computing distances between states, i.e. between GMMs.
Moreover, neither DTW nor the approach of Jebara et al. incorporate a priori information about handwriting in general. That is, neither approach takes advantage of the knowledge that each handwritten word or phrase is made up of a set of handwritten letters or digits (or, more generally, symbols) chosen from an alphabet or other set of a priori known symbols. The inability to utilize this knowledge can adversely impact computational speed, accuracy, or both.
While word spotting is used herein as an illustrative example, the following relates more generally to the quantitative comparison of objects by (1) modeling the objects with ordered vector sequences and (2) computing a distance measure or other quantitative comparison of the ordered vector sequences. Such quantitative comparison finds application in speech recognition (where the objects are audio sequences representing spoken words or phrases), image content spotting (where the objects may be image portions showing an object of interest), biological sequence processing (where the objects may be DNA sub-sequences, protein sub-sequences, or so forth), as well as the aforementioned on-line and offline handwriting recognition applications.