1. Field of the Invention
The present invention generally relates to a method for recognizing characters, in particular, to a method for recognizing characters by categorizing the characters with typographical lines.
2. Description of Related Art
In this information explosion era, ordinary people frequently have to read vast quantities of books, newspapers or journals. When some valuable sections or important points are found in an article, they are filed by photocopying or cut editing or directly labelled with a pen. However, for a person who works frequently on words, any data in an article that are useful must first be input into a computer by a key-in process before the data can be subsequently edited or filed. Therefore, a lot of time and labor is wasted.
To resolve this problem, optical recognition techniques have been developed so that useful documents can be scanned into graphical files through a common scanner and then characters in the graphical file can be extracted using character recognition software and converted into corresponding digital characters. As a result, the user can quickly obtain an electronic file of the document for editing or processing. At present, the scope of applications of optical recognition techniques is wide-spread. For example, the filing of literary data of a library, the management of internal documents of an enterprise, the recognition of identity cards and receipts and so on can be easily achieved by using the optical recognition technique. Therefore, not only the data can be accurately recognized, a lot of time and labor for comparing and verifying data can also be saved.
Optical character recognition, commonly shortened to OCR, is mainly used for recognizing the characters of an existing paper document. First, the document to be recognized has to be scanned into a graphical file using a flatbed or a palmtop scanner. Due to dirt on the document, blurs of the characters or resolution of the scanner, some noise may exist in the input image and affect the accuracy of subsequent character recognition. Therefore, the OCR software has to perform tilt correction, noise removal, and image edge sharpening of the graphical file of the scanned document first. Next, the OCR software takes action to separate the graph and the words in the processed graphical file so that the words, graphs and tables in the document are all separated and some of the characters without a clear connection are correctly cut or combined. Thereafter, the OCR software performs a document recognition process by comparing a graphical image of the characters with characters in a database. At the same time, an accurate result of the recognition is output after recognizing phrases and related words through a corrective function. The recognized characters can be directly saved in a Word, PDF or pure text format file. As a result, not only the loading of data input can be reduced, but the speed and accuracy of data input can also be increased.
However, some problems still exist in the current OCR software. These problems often lead to errors in character recognition or failure of recognition and cause much inconvenience to the user. For example, the scan document may be inappropriately positioned so that the scanned graphical file is tilted, inverted (horizontally shifted) or ratio distorted (vertically shifted). Alternatively, when the sizes of a character are different but the shapes are the same, large and small character writing are not recognized and punctuation marks, which have a small character shape, are also difficult to be recognized.