1. Field of the Invention
The present invention relates to a document reader device for optically reading characters on a document and converting them into code information, and particularly to a tabular document reader device for recognizing a string of characters on a tabular document such as a list of names that is formatted in a table.
2. Description of the Related Art
To prepare a data base from a list of names (a roster) or address book, a scanner recognizes the roster or the address book and provides image data of the names and addresses. Based on the image data, a document reader extracts the characteristics of characters contained in the image data and recognizes the characters. There is, however, a requirement for improving the recognition accuracy of the document reader.
Although optical character recognition techniques and their accuracy are improving, the accuracy of recognizing poor quality characters such as badly printed characters and deformed handwritten characters is not acceptable.
Some languages including Japanese use many kinds of characters and involve similar characters such as " " and " " and identical characters such as " " (KA) in Japanese Katakana and " " (CHIKARA) in Japanese Kanji. These characters may often deteriorate recognition accuracy.
If the accuracy of directly recognizing characters is poor as mentioned above, it is necessary to limit an object to be recognized. For example, in reading telephone numbers, only numerals will be recognized. It is also effective to carry out, after a first cycle of a recognition process, a postprocess using word information and context information to select a proper choice among several candidates.
Limiting the object to be recognized and using the postprocess have been achieved so far only in optical character recognition (OCR) in reading a formatted document such as an order slip. The formatted document has an already printed format, and the positions of characters written on the document and to be read by the document reader, are fixed. By defining attributes of characters to be written in each blank of the formatted document, it is possible to limit what should be recognized in each blank as well as to use the postprocess with work information provided for each blank.
If a document to be read is not a formatted document, it is impossible to limit the object to be recognized in advance and it is difficult to use the postprocess with word information. It may be possible to carry out a general postprocess for general documents. In this case, however, it is impossible to limit the object to be recognized to a specific word group, so that the recognition accuracy may not be improved.
Some documents such as lists of names are not completely formatted but have ruled lines between items of repeatedly appearing names, addresses, telephone numbers, etc., of the lists. In this sort of tabular document, it is possible to predetermine the attribute of each item according to the kinds of documents.
In this case, the kind of each tabular document must be fixed in advance, otherwise the tabular document cannot be processed in the same manner as in the formatted document, and the attributes such as the position of each item and the characteristics of written items of the document cannot be defined in advance. Namely, it is concluded that some particular types of tabular documents can be handled in the same manner as in the formatted document, but other different types of tabular documents cannot be handled in the same manner.
To solve the above-described problems, an object of the present invention is to provide a document reader that achieves an improved accuracy in reading tabular documents such as lists of names and addresses.