Handwriting is one of the basic human communication tools, such as speech, sign and expression. Handwriting has been widely applied in our daily life. For example, people sign their signature in bank cheques and letters. Students acquire the knowledge in class when teachers write their lecture notes on the blackboard. Businesses recruit new employees by way of graphology, which is the study of handwriting shapes and patterns to determine the personality and behaviour of the writer. Although handwriting is such an efficient communication tool in our mind, how handwritten signals are mentally represented in the brain and what kinds of functionality mechanism underlies handwritten recognition is little known to us. Automatic analysis and recognition of handwritten signals by computers can help us better understand this problem to some extent.
The general off-line cursive handwritten recognition is a very challenging task although considerable progress [1][2][3][4][5][6][7] has been made in this domain over the last few years. Most recognition systems [8][9] have achieved a good performance which greatly depends on the constraints imposed such as contextual knowledge, size of the vocabulary, writing style and experimental conditions. Recently an off-line cursive recognition system dealing with large vocabulary unconstrained handwritten texts has been investigated [10]. Instead of modelling a word, this recognition system models a handwritten line by the integration of Hidden Markov models and N-gram word statistical language models in order to avoid the problem of handwritten word segmentation and make efficient use of contextual information. Although authors have shown that the use of language models improves the performance on some databases, the computational cost of this system is much higher than that based on an isolated word. It is well known that linguistic information plays an important role in cursive word recognition. From a biological point of view, the computational efficiency is as important as the accuracy in a human's recognition system. Therefore, the computer-based cursive word recognition system where that information is integrated should abide by the principle of computational efficiency.
Although a considerable number of off-line cursive handwriting recognition systems have been presented in the literature, the solutions to several key problems related to handwritten word recognition remain unknown. One of the most important problems is the representation of a cursive word image for a classification task. Intuitively, although a handwritten word is concatenated by a small set of handwritten characters (52 characters in English) from left to right, its shape exhibits various variations, which depend on the uncertainty of human writing. The boundaries between characters in a handwritten word are intrinsically ambiguous due to overlapping and inter-connections. The changes in the appearance of a character usually depend on the shapes of neighbouring characters (coarticulation effects). In the current literature these representation methods for cursive words usually fall into the categories described hereunder.
The image of the given word is considered as an entity in its whole and the difficult problem of segmenting a word into its individual characters is completely avoided. A word is characterized by a sequence of features such as length, loops, ascenders, descenders. No sub-models are used as a part of its classification strategy. The recognition method based on this representation is called “holistic approach” (see a recent survey in [11]). This method can model coarticulation effects. However, no uniform framework in the current literature is presented to extract those features. It is not clear how to solve the corresponding problem of feature points if some features are used as local shape descriptors. Moreover, the method does not make use of information of sub-models. As a result, information cannot be shared across different words. It is difficult to apply this method to cursive word recognition with a large lexicon since samples for each word is not sufficient.
The word image is segmented into a sequence of graphemes in left-to-right order. A grapheme may be one character or a part of a character. After the segmentation, all possible combinations of adjacent graphemes, up to a maximum number, are considered and fed into a recognizer for isolated characters. Then a dynamic programming technique is used to choose the best sequence of characters. There are two problems related to this method. One is that segmentation and grapheme recombination are both based on heuristic rules that are derived by human intuition. They are error-prone. The other is that the proposed framework is not computationally efficient since a character recognizer has to be used to evaluate each grapheme combination. For a large lexicon, the computational cost is prohibitively high.
Features are extracted in a left-to-right scan over the word by a sliding window. No segmentation is required. There are two main problems related to this method. One is that some topological information such as stroke continuity will be partially lost. But stroke continuity is an important constraint for handwritten signals. The other is how to determine the optimal width of a sliding window. From a signal viewpoint, the method based on a sliding window can be regarded as a one-dimensional uniform sampling on a two-dimensional signal. In general, the sampling width depends on the sampling position. Some information will be lost based on uniform sampling.
The other important problem is how to integrate the orthography (or phonology) into the recognition system effectively. It is known that orthography and phonology play important roles in human word reading [12][13]. Orthography and phonology impose strong constraints on cursive word recognition. Most of the existing methods use statistical language models such as character (or word) N-gram as a post-processing tool in the recognition. These language models are basically built based on a large text corpus. To our knowledge, no work is done to investigate how orthographic representations directly develop from primitive visual representations (word images, visual features).
In the following subsections, a general viewpoint of cursive word recognition from several disciplines such as visual perception and linguistics is first presented in order to understand the essential nature of this problem. Then the literature related to word skew/slant corrections and word representation is reviewed
A. Perspective of Cursive Word Recognition
1) Size of Vocabulary:
How many words are there into English? There is no single sensible answer to this question. It is impossible to count all words. English words have many inflections such as noun, plural, tense of a verb. Is “hot dog” really two words since we may also find “hot-dog” or even “hotdog”? In addition, many words from other languages enter into English. Sometimes, new scientific terms will be generated.
In order to obtain an approximated size, one can resort to the Oxford English Dictionary. The Second Edition of the Oxford English Dictionary (OED) [14] contains full entries for 171,476 words in current use, and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries. Over half of these words are nouns, about a quarter adjectives, and about a seventh verbs; the rest is made up of interjections, conjunctions, prepositions, suffixes, etc. These figures take no account of entries with senses for different parts of speech (such as noun and adjective). This suggests that there are at the very least, a quarter of a million distinct English words, excluding infections and words from technical and regional vocabulary not covered by the OED, or words not yet added to the published dictionary, of which perhaps 20 percent are no longer in current use. If distinct senses were counted, the total would probably approach three quarters of a million.
As we know that there is a huge size of vocabulary, it is impossible to work on all word entries in the Oxford English Dictionary for the research of cursive word recognition at the current stage. Then we have to choose a part of vocabulary. One of the important criteria for word selection is word frequency, which can be calculated according to a large language corpus. Some dictionaries such as the Collins COBUILD Advanced learner's English Dictionary provide information about word frequency. The other strategy is to cluster the vocabulary according to some similarity measures. Then we focus on the research of cursive recognition in individual group.
2) Cursive Word Visual Perception:
Visually, handwritten word images mainly consist of some primitives such as lines, arc, and dots. FIG. 1 shows some examples.
In FIG. 1, it can be observed that neighbouring characters are usually connected and it is very difficult to segment a word image into character components. This suggests that the crude segmentation by means of heuristic rules be not robust due to the intrinsic ambiguity of character boundaries. From FIG. 1, we can also observe other characteristics of handwritten signals. For example, character ‘n’ and ‘r’ in images (h) and (l) are almost skipped, respectively. For image (i) and (k), it is difficult to identify individual characters. Intuitively, the useful information seems to exist in the global shapes which are characterized by some extreme points. For image (g), it should be identified as “aud” from a pure shape recognition. But orthography imposes strong constraints on word identity. Humans can easily recognize it as “and”. This indicates that orthography plays an important role in cursive word recognition. The identity of word image (o) is ambiguous. It can be “care” or “case”. In this case, the contextual information in a sentence will be required to identify it. Usually humans can identify most isolated words without the contextual information in a sentence. We can draw from this fact that word image and orthography (and phonology) may provide enough information for recognition without higher-level linguistic information. From the computational viewpoint, the computational structure will be modularized easily and the dependence of functionality modules between difference levels will be reduced. As a result, computational efficiency will be enhanced.
What is the good representation of cursive word recognition? Although the complete answer to this question is still unknown, we may obtain some clues from the research of computer vision, psychology, and human reading. Marr [15] suggested that the representations underlying visual recognition are hierarchical and involve a number of levels during the early visual processing. Each level involves a symbolic representation of the information in the retinal image. For example, the primal sketch defined by Marr consists of some primitives and makes explicit important information about two-dimensional images. Edge, blobs, contour and curvilinear organization contains useful information for visual recognition. Cursive word image is binary and 2D shape, which is not a function of depth. Moreover, it consists of line drawing patterns, such as lines and arcs. The important information such as curvatures, orientations, loops, global shape, convex and concave properties can be derived from a word image contour. Biederman [16] [17] proposed a theory of entry-level objection that assumes that a given view of an object is represented as an arrangement of simple, viewpoint-invariant, volumetric primitives called geons. The position relationships among the geons are specified so that the same geons in different relations will represent different objects. These geons are activated by local image features. This view of part-based representation sounds attractive for cursive word recognition. Although the size of vocabulary is large, each word basically consists of a small number of letters. But letters in a word are possibly activated in high-level stage since in image level it is hard to solve the segmentation problem. McClelland and Rumelhart [18] proposed an interactive activation word reading model. A bottom-up and top-down process is integrated to this model. This indicates that letter representation is driven by bottom-up (low-level features to letter) and top-down (word to letter) information. Learning must play an important role in the representation.
Is wavelet-based coding a good representation of cursive word image? Although wavelet-based coding is mathematically complete or over-complete, the wavelet code does not meet the explicit criteria [19]. A wavelet code is simply a linear transform of the original image into a set of new images. There is no interpretation or inference in the process. The structures and features are not explicitly represented. For example, for cursive word recognition, we know that loops and word length are useful information for recognition. It is hard to extract them from redundant wavelet codes.
3) Word Linguistic Information:
Words in the English language have a specified structure. These structure constraints are usually imposed by orthography (the way a word is spelled) and phonology (the way a word is pronounced). For words of length 10 (10 letters), although the maximal combination is 26, valid words only exist in a small-size subset. In the context of cursive word recognition, a statistical language model such as n-grams is usually used to improve the performance [10]. Those models usually have several shortcomings. First, the accuracy of language models is very sensitive to the text domains. When the domain is changed, the language model has to be trained with data in a new domain. In some cases, the large text corpus in a domain may be not available. Second, the orthography information is not directly encoded from local image features. The extra complexity is introduced by statistical language model and it may be not necessary to infer the identity of a word image. As a result, the system accuracy will be degraded. In our view, a connectionist approach could be applied to implement the nonlinear transformation. FIG. 2 shows the transformation framework concept.
In FIG. 2, the phonology transformation network is enclosed with a dotted line. In current research, it is not very clear whether the phonology information is applied to visual word recognition. The distribution code can be slot-based 18 or a relational unit [20][13]. For the first case, each letter goes to its corresponding position. For the second case, the relational unit (called grapheme) consists of one or more letters (e.g. L, T, TCH). A word can be translated into several graphemes. When a multi-letter grapheme is present, its components are activated. For example, “TCH” will activate ‘T’ and ‘CH’. The main characteristic of the above representation is that the strength of orthographic units' output depends on not only co-occurrence frequency but also network structure and current input data.
4) Handwriting Process:
Handwriting production is a complex process which involves a large number of highly cognitive functions including vision, motor control and natural language understanding. The production of handwriting requires a hierarchically organized flow of information through different transforms [21]. FIG. 3 shows this process.
The writer starts with the intention to write a message. Then this message is converted to words by means of lexical and syntactical levels. During the writing, the writer plans to select the suitable allographs (shape variants of letters) in advance of the execution process. The choice may depend on the context of neighbouring allographs. This indicates that a visual feedback mechanism is involved in the writing process. Hollerbach [22] proposed an oscillatory motion model of handwriting. In this model, cursive handwriting is described by two independent oscillatory motions superimposed on a constant linear drift along the line of writing. The parametric form is given by:x(t)=Ax(t)cos(wx(t−t0)+φx)+C(t−t0)y(t)=By(t)cos(wy(t−t0)+φy),  (1)
where wx and wy are the angular velocities, respectively, Ax(t) and By(t) are the horizontal and the vertical amplitude modulations, respectively, and C is the horizontal drift. On online handwriting, a general pen trajectory can be encoded by the parameters in the above model [23]. The simplified model indicates what kinds of information are important in cursive handwriting signals. This information can guide us to extract the features in offline cursive word recognition.
The studies of handwriting have found that the atomic movement unit in handwriting is a stroke, which is a movement trajectory bounded by two points of high curvature (or a trajectory between two velocity minima [24]). Handwriting signals can be segmented reliably into strokes based on this method [24]. This important information indicates that neither letters nor graphemes are basic units at the stage of low-level feature extraction.
5) Handwriting Analysis:
Handwriting analysis (also called graphology) is the study of handwriting shapes and patterns to determine the personality and behaviour of the writer. The graphologists (forensic document examiners) examines some features such as loops, dotted “i's” and crossed “t's,” letter and line spacing, slants, heights, ending strokes, etc. and they believe that such handwriting features are physical manifestations of unconscious mental functions. There is a basic principle underlying graphology: handwriting is brain-writing. From the viewpoint of ecology, interactions of individual experiences and social environments have an effect in handwriting since handwriting is a graphic realization of natural human communication. Although this area is less related to cursive word recognition than computer vision and psychology, it shows that a lot of features are shared by different individuals. The features examined by graphologists could provide some information about feature extraction to the researchers of handwritten recognition.
6) Summary:
Although there is no high-performance system for large-scale cursive word recognition, the development of such a system may require to underlie the following rules from the above perspective:
Computation efficiency is as important as accuracy. Parallel computation is desirable.
The recognition system must be hierarchical. For example, the units such as strokes, graphemes, letters and words must be constructed in an increasing level.
The orthography information must be integrated directly into the system.
Biological relevance must be compatible with established facts from neuroscience.
Perceptual relevance must conform with well-established experiments and principles from Gestalt Psychology.
Most of the parameters must be obtained by learning.
B. Previous Studies for Word Skew/Slant Corrections
In most cursive word recognition, correcting the skew (deviation of the baseline from the horizontal direction—FIG. 4(a)) and the slant (deviation of average near-vertical strokes from the vertical direction—FIG. 4(b)) is an important pre-processing step [25]. The slant and slope are introduced by writing styles. Both corrections can reduce handwritten word shape variability which depends on writer and help the latter operations such as segmentation and feature extraction.
For the skew and slant corrections, the crucial problem is to detect the skew and slant angles correctly. Once two angles are found, skew and slant corrections are implemented by rotation and by a shear transformation, respectively. In the literature, several methods have been proposed to deal with this problem. In [6], the horizontal and vertical density histograms are used to estimate the middle zone. Then a reference line is estimated by fitting through stroke local minima in the middle zone. In [26], image contour is used to detect those minima. Marita et al. [27] proposed a method based on mathematical morphology to obtain a pseudo-convex hull image. Then minima are detected on the pseudo-convex image and a reference line is fit through those points. The primary challenge for these methods is the rejection of spurious minima. Also, the regression-based methods do not work well on short words because of lack of sufficient number of minima points. The other approaches for the detection of slope angle are based on the density distribution. In [28], several histograms are computed for different y (vertical) projections. Then the entropy is calculated for each of them. The histogram with the lowest entropy will determine the slope angle. In [29], the Wigner-Ville distribution is calculated for several horizontal projection histograms. The slope angle is selected by Wigner-Ville distribution with the maximal intensity. The main problem for those distribution-based methods is a high computational cost since an image has to be rotated for each angle. Also, these methods do not perform well for short words. For the slant estimation, the most common method is the calculation of the average near-vertical strokes [1][5][7]. These methods use different criteria to select near-vertical strokes. The slopes of those selected strokes are estimated from contours. The main disadvantage of those methods is that many heuristic parameters have to be specified. Vinciarelli et al. [30] proposed a technique based on a cost function which measures slant absence across the word image. The cost function is evaluated on multiple shear transformed word images. The angle with the maximal cost is taken as a slant estimate. Kavallieratou et al. [29] proposed a slant estimation algorithm based on the use of vertical projection profile of word images and the Wigner-Ville distribution. The approaches based on the optimization are relatively robust. However, the above two methods are computationally heavy since multiple shear transformed word images corresponding to different angles in an interval have to be calculated.
C. Previous Studies for Handwritten Word Representation
One of the important problems in cursive word recognition is to extract the discriminative features. There has been extensive research in the extraction of different-level features for handwritten words. The ascenders and descenders [31] and word length [32] are perceptual features in human reading. These features are usually used in holistic recognition of handwritten words [33]. But the accurate detection of these features become a challenge due to the uneven writing and curved baseline. Highly local, low-level structure features such as stroke direction distribution based on image gradients [34] that have been successfully applied to character recognition are generally unsuitable for offline cursive word recognition due to wide variation in style. In [35], [10], a word image is represented as a sequence of slice windows from left to right. In each small window, an observation vector will be extracted and used in Hidden Markov Models. Although this strategy attempts to avoid the segmentation, there are several shortcomings for this method. The slice window does not correspond to any perceptual units such as strokes, letters. In most cases, these windows contain meaningless fragments, which are not compatible with the Gestalt principles of perception (similarity) [36]. Moreover, no inference has been made during the process. In contrast to the method of slice windows, the other strategy is to segment a word image into graphemes that are ordered from left to right [9], [7] and then extract geometrical and structure features for each grapheme. But the segmentation is usually done in the horizontal direction and is still one-dimension while handwriting is a two-dimensional signal.
By reviewing the above methods, it has been found that none of them imply where the important information is located in a word image and how to organize them efficiently. The research in psychology and computer vision has indicated that a good representation underlying visual recognition is hierarchical and involves a number of levels [15]. Edge, blobs, contour and curvilinear organization contain useful information for visual recognition. A cursive word image is usually binary and 2D in shape, which is not a function of depth. Moreover, it consists of line drawing patterns such as lines and arcs. Since words are written by humans, handwriting signals may satisfy some physical constraints such as Newton's laws of motion. When a word image is viewed as a 2D curve, most of the important information such as curvatures, orientations, loops, global shape, local convex and concave properties can be derived from an image contour. The corners on a contour exhibit more invariant properties than other points.