1. Field of the Invention
The present invention relates to a document processing method, and more particularly to a method for merging document clusters, which is suitable for merging associated web page clusters or document clusters.
2. Related Art
As computer word software has been widely used, digital documents are greatly increased. When the digital documents are processed or managed, some functions of automatically detecting or comparing documents are usually needed. For example, a basic vocabulary comparison technology is needed in the process of producing and using digital characters, and similarly, a digital document also needs such functions, that is to say, the object to be compared is raised from a vocabulary level to a document level. The so-called document means a paragraph or an article formed by natural languages or vocabularies. For example, a common article, a paragraph of an article, a sentence of an article, a field (such as a topic of an official document), questions raised by users, or answers replied from service personnel may all be regarded as a document.
In order to classify various documents, document clusters (i.e., document collections) are generally classified by using a support vector machine (SVM) proposed by Vladimir Vapnik in the year of 1990. The SVM is based on a structural risk minimization principle in the statistical theory, so as to obtain an optimal hyper-plane in a set space domain. Furthermore, positive and negative samples are distinguished. Nowadays, many modifications and applications have still been proposed.
The document cluster is a collection of many documents, and each document has one or more key vocabularies. Each document is regarded as a vector in the SVM. An amount of key vocabularies in each document turns to be a dimension in the SVM. The effect of document classification may be undesirable as the space and dimension of the feature vectors are too high.