1. Field of the Invention
The present invention relates to a document identification device suitably applied to a character recognition device for processing a plurality of documents, and more particularly, it relates to a document identification device suitably applied when a plurality of documents in which character strings to be identified are not arranged in fixed positions are identified in accordance with document definition.
2. Description of the Related Art
Furthermore, the present invention relates to a document definition method which is used in carrying out document identification in accordance with a plurality of identification items set for each of a plurality of documents to be identified, and a document identification method for identifying a document in accordance with document definition based on the plurality of identification items.
Heretofore, in order to generally read characters recorded on a plurality of documents in the OCR, it has been necessary to prerecord (print) an ID number in a predetermined position for document identification, in the case of a document exclusively designed for OCR reading to enable the OCR to identify the document. In the OCR, format information (form information or format control information (FC)) corresponding to the ID number has been prestored, and format information corresponding to an identified type of a document (ID number) has been used to read characters from a target document. As format information, there is information for specifying a position of a field, in which a character to be read has been recorded, on a document, a number of digits of a character to be read, a character pitch, a type of a character recognition dictionary used for reading a character letter type, or the like.
On the other hand, recently, there has been an increase in requests for reading characters recorded on an existing document not designed exclusively for OCR reading. In order to read characters recorded on the existing document, it has been necessary to execute registration of format information in the OCR, manual sorting of a plurality of existing documents in accordance with types to form a bundle for each type, and specification of format information for each bundle (1 batch) to read the characters.
Furthermore, in order to read characters recorded on a document, coordinate information of a character writing section (ruled line) on the document and format information have been preregistered in the OCR correspondingly to each other, coordinate information of a ruled line has been obtained from an image of the document entered to the OCR, and compared with the preregistered coordinate information of the ruled line for each document to identify the document and, based on format information corresponding to the identified document, character reading has been executed. This method eliminates the necessities of manual sorting of documents in accordance with types before reading, and 1-batch processing of the same document during reading. Thus, a plurality of documents can be mixed to enable reading.
However, in the above-described method, if the entered document image is blurred, ruled line extraction cannot always be carried out accurately. Consequently, a problem of incorrect document identification occurs. Moreover, even in the case of documents of different types, if positions of ruled lines are completely identical, a problem of impossible identification of the documents occurs.