1. Field of the Invention
The present invention relates to a document classification program, a document classification method, and a document classification apparatus that classify documents in a second domain according to categories for classifying documents in a first domain. More particularly, the invention relates to a document classification program, a document classification method, a document classification apparatus that perform highly accurate classification at a low cost, and a vector transformation program and a lexical-distortion cancellation program applied to the document classification program. In the specification, patent documents are explained as the documents in the first domain, and papers are explained as the documents in the second domain. That is, classification of papers according to International Patent Classification (IPC) will be explained.
2. Description of the Related Art
A method of classifying documents in which a classification rule is learnt from classified correct solution data to classify documents by using the classification rule is widely used from the viewpoint of efficiency (see, for example, Japanese Patent Application Laid-Open No. 2002-222083). When classifying papers according to the IPC by using such a method, the procedure will be either one of the following two procedures.                1. When the patent documents are used as the correct solution data:        
creating a classification rule from the correct solution data (patent documents) by using a learning machine; and
classifying papers by using the classification rule.                2. When papers added with IPC are used as the correct solution data:        
classifying the papers manually according to the IPC;
creating a classification rule from the correct solution data (papers) by using the learning machine; and
classifying the papers by using the classification rule.
However, when the patent documents are used as the correct solution data, there is a large number of patent documents classified according to the IPC, but since the lexis (the way how the words are used) is different between the patent documents and the papers, the papers may not be able to be classified successfully even if learning is performed from the patent documents. Further, when the papers added with the IPC are used as the correct solution data, the cost for pre-creating the correct solution of the papers classified according to the IPC is high, and hence, a large number of classified patent documents cannot be used effectively.
Generally, when cases in a domain B is classified according to categories of a domain A, even if there is a large number of cases in the domain A classified according to categories of the domain A, since the domain A and the domain B are different, documents pre-classified in the domain A cannot be effectively used, and the correct solution cases must be created by using the documents in the domain B.