Conventionally, to recognize the word (or character) information or the like written on a paper document and automatically input the recognized data, there is a technique that identifies the type of the document based on document type identifying information stored in advance in a database. As used herein, a series of words “document type identifying information” refer to the word information, the ruled line information, an identification (ID) specifying the document type, or the like appeared on the document.
For example, Japanese Laid-open Patent Publication No. 2001-202466 discloses a technique that identifies the document type by checking matching of grouped word strings extracted based on a word recognition result obtained from document data received as an input with document type identifying keywords (which represent words frequently used in each document type) stored in advance in the database for each document type.
The conventional technique described above has a problem of failing to accurately identify the document type due to the following facts.
Because the document data received as the input includes many unwanted word strings such as explanation statements or remarks, it is difficult to extract the grouped word strings themselves that correspond to the document type identifying keywords stored in advance in the database. Therefore, for example, when a series of words “packing list form” is stored in the database as one of the document type identifying keywords that identify one of the document types and the grouped word string “packing list form (and receipt)” is extracted from the received document data, the document type identifying keyword does not match with the grouped word string, resulting in document type identification with poor accuracy.
Further, when the document data received as the input includes the word string consisting of three words with one word being incorrect, a word recognition rate of this word string is 67% and this word string is typically not matched with the keywords and not extracted. Therefore, when the document data includes the word string consisting of three words with one word being incorrect, the word string consisting of three words used as a title word string that is important to identify the document type such as a “card application form”, “packing list form”, “list of quotation”, or the like is not extracted, resulting in document type identification with poor accuracy.