Generally, when document classification or the like of an electronic document is performed using a computer or the like, feature information on the electronic document is used.
Here, the document classification (Document classification/categorization) means that it classifies an electronic document into several classes based on the contents. The document classification is used, for example, for a spam filter of an e-mail system or the like. The spam filter determines whether a mail newly received is a spam mail with reference to information about a spam mail which has already been received and a mark has been putted on. And the spam filter sorts into a specific folder or the like in case the mail is the spam mail.
In work of the document classification like these, a document is expressed by a multidimensional vector. The multidimensional vector which expresses a document is also called a feature vector. For example, each dimension of the feature vector is corresponding to each word which appears in the document. And then, as a value of each component of the feature vector, for example, the number of appearances or the TF (Term Frequency)-IDF (Inverse Document Frequency) value which represents an importance or the like of the word, which is corresponding to the dimension of the component, is used. For example, the number of dimensions (the number of components) of the feature vector corresponds to the total number of words to be dealt with.
When indicating a specific example, for example, the following document d1 can be expressed by the following feature vector v1.
Document d1: “A document is represented as a vector. Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is no zero.”
Feature vector v1: [4, 0, . . . , 1, . . . , 2, . . . , 0]
Here, the feature vector v1 includes a plurality of components, and each component is corresponding to a single word.
And then, an arrangement sequence of those components is alphabetical order of the corresponding word. And then, the value of each component is expressed by the number of appearances of the corresponding word in the document d1. In other words, it represents that the first component of the feature vector v1 corresponds to the word “a”, and the word “a” appears 4 times in the document d1. Also, the second component is corresponding to the second word in alphabetical order among the words to be dealt with. And, supposing that the second word in alphabetical order is “about”, the value will be 0 because “about” does not appear in the document d1. When looking at components of the feature vector v1 in order, the word which is next young in alphabetical order and appears in the document d1 is “as”. The word “as” appears once in the document d1. Accordingly, the value of the corresponding component is “1”. And, the word “document”, which appears next, appears twice in the document d1. Accordingly, the value is “2”. By expressing a document as a feature vector in this way, it is possible for the document to be corresponded to 1 point on a high dimensional vector space.
In the document classification, the document is expressed in the feature vector using various methods including an example mentioned above, and the document is classified based on the distance between the feature vectors. Accordingly, generation of the feature vector in the document classification is a very important technical element because the precision of the document classification varies largely depending on the precision of the feature vector.
These technological examples which extract feature information on the document are disclosed in patent document 1 and patent document 2.
Next, in case a document includes several parts of gathered contents, there is an application which adopts the document classification to each of those. For example, the application is that, when information on a plurality of seminars is included in a notification mail of the seminar which is being sent by an e-mail, it applies, by breaking those down into information on each seminar, classifying into each of categories. Henceforth, the document of certain gathered contents included in a document is called a partial document, and classifying a partial document is called a partial document classification.
In such partial document classification, it needs to specify a partial document correctly from the whole document. For example, by catching changes in the tendency of appearance phrases in the document, it is possible to divide a document into a partial document. In addition, the phrase may be a word or phrase. In addition, it is similar as follows. However, because there are few changes in the appearance phrases when the partial documents are related to each other, such approach does not have the enough validity.
When aiming at a semi-structured document having structural information, as indicated in non-patent document 1, the heuristics which specifies a partial document is proposed. A partial document specifying method disclosed in non-patent document 1 pays attention to the behavior that elements having the same name appear in parallel at the same level in the structure of the document tree when the semi-structured document includes several topics (partial documents). This partial document specifying method, when elements of the document tree are traced to a root from a leaf, determines the element for which the same name appears in the element of a sibling relationship for the first time as a root of the partial document tree. By using this method, for the semi-structured document, it specifies a partial document and extracts those feature information using technology disclosed in patent document 1 or 2, and can classify the partial document.