a) Field of the Invention
The invention relates to a system, a method and a recording medium for automatically classifying documents, and more particularly, to a system, a method and a recording medium for automatically classifying documents having a plurality of objects.
b) Description of the Related Art
With the technology advancement in recent years, lots of information is able to be digitally stored and document digitization is just one of the examples. Digitization of documents can effectively reduce the space required for storing documents and the digitized documents are easy to query and manage. However, an operating company generates a large quantity of documents of different types, such as financial, personnel, research, quality assurance and more, which all must be managed, and managing these complex types of documents causes another form of management overhead.
Classifying documents is an important step in document management because it can help narrow the search range and thus enhances management efficiency. In the past, a method called Optical Character Recognition (OCR) is used to automatically classify documents, but OCR requires more processes and operations, which in turn requires better hardware equipment and uses a lot of recourses. Therefore, unless the documents need to be classified down to the content written therein, otherwise it is best to avoid using OCR for automatic document classification.
Another method for automatically classifying documents is by imitating the mode of human vision to capture important characteristics of documents for determining whether the two documents are the same. For example, the table format in a document is used as a template for selecting and capturing characteristics; the characteristics mean, in general, straight lines or columns outlined by the straight lines in the table. However, during the input stage of digitization process, paper documents are more or less tilted, displaced, or scaled due to different resolutions, and these problems interfere the automatic classification of documents. Although relative information such as vectors, angle of tilt, and slope can be obtained to eliminate the aforementioned interferences, the aforementioned method still requires a lot of operations and uses hardware resources.
Therefore, how to eliminate the aforementioned factors that interfere automatic classification of documents and how to effectively classify digitized documents using relatively smaller hardware requirement are the goals to be achieved.