Generally, in language, a word is the smallest element with meaning. A written language is the representation of a language by means of a writing system. An alphanumeric writing system may use a set of symbols, letters, and/or numbers, to form a word. In a logographic writing system, a logogram, which is a single written character, is used to represent a complete grammatical word or morpheme. A morpheme is the smallest semantically meaningful unit in a language. For example, some Chinese characters are logograms.
Text is the representation of written language. Printed text can be scanned, for example, using optical character recognition (OCR), to create an electronic image of the text. OCR is the electronic conversion of scanned images into machine-encoded text. The converted machine-encoded text may then be electronically searched and/or used in various machine processes, such as text mining, machine translation, etc. When running an OCR application on a scanned image, boundary information for the text is created. In character recognition, boundaries can be a real or imaginary rectangle which serves as the delimiter between consecutive letters, numbers, and/or symbols in alphanumeric words and between lines in character words (e.g., Chinese character words). The boundary information can include the rectangular coordinates for the lines that make up Chinese character words, and letters, numbers, and/or symbols in alphanumeric words.
Typically, when a scanned image is of poor quality or if the scanned image contains logographic characters (e.g., Chinese characters), the OCR application may make mistakes in detecting the boundaries, and applications and processes, which may rely on the boundary information, may generate incorrect results. For example, a Chinese character word or an alphanumeric word may be split into multiple parts, causing typographical and grammatical errors. Editors may spend a significant amount of time in trying to detect and correct the errors.