1. Technical Field
The present invention relates to a technique of extracting information contained in a document by performing a character recognition process on an image of the document.
2. Related Art
Electronic filing in which a paper document is scanned and the scanned document image is stored as an electronic document file is widely used. In such electronic document filing, it is common for optical character recognition (OCR) to be performed on an image obtained by scanning and to integrate the results of the character recognition processing into the electronic document file to increase retrievability. If a source material is a fixed format document such as a form, the location on the document image of each item of information on the document, such as an address or charge on a debit note, which are also called “attributes”, is often known. Utilizing this knowledge, it is also common to read a character string on a specific position on the scanned document image and integrate the character string into the electronic file as a value of a specific information item. When this is done, the location (or range) of the value of each information item on the document is measured and the information is stored in an electronic filing device. The electronic filing device can thereby retrieve a desired character string from a specific location predetermined for each information item on the document image.
In electronic filing of a document in which locations of information items are unknown, a further technique for increasing retrievability is used, in which a keyword from the result of character recognition is extracted and the extracted keyword is integrated into the electronic document file.