1. Field of the Invention
The present invention relates to a technique of classifying plural document data files.
2. Description of Related Art
It is known to provide a technique of reading a handwritten document using a scanner or an image reading device, recognizing characters by applying an OCR (Optical Character Recognition) process to the read document data file, and extracting the recognized characters as text data. According to the technique, by converting information described in a handwritten document into text data, a computer can use the information described in the handwritten document for a variety of purposes. For example, a computer can sort plural document data files based on character strings included in text data, or prepare statistics based on plural document data files.
However, character strings having a property which is common to all formats of a character string used in various documents conveying the same meaning can be expressed in any variety of forms in accordance with the preferences of the creators of the documents. For example, when considering a case of writing in a document a character string having a property of “date”, e.g. a character string expressing “May 15, 2004”, a user may write a character string in the format of “2004.05.15”, or a character string in the format of “May 15, 2004”. This is to say, although the character strings convey the same meaning but are written in different formats, a computer cannot recognize the character strings as same text data.
Accordingly, if the text data, “2004.05.15” and “May 15, 2004” are classified on the basis of a common property of “date”, a computer cannot recognize date as being the common property in the two character strings by simply comparing the first characters in the text data, “2” and “M”.
Therefore, it might not be possible to arrive at a common property that would match all formats of a character string expressing one specific meaning.
The present invention has been made in view of the problems discussed above and provides a technique of appropriately classifying plural document data which have one common property although being expressed in different formats.