1. Field of the Invention
This invention relates to a technique of classifying semistructured data such as XML (Extensible Markup Language) or HTML (Hypertext Markup Language) data.
2. Related Art
In recent years, semistructured data such as XML or HTML data is attracting attention as data formats for use in various databases and data exchanges, and developing various data mining techniques for the semistructured data is becoming an important issue.
Now, consider the basis of data mining, which is Supervised Learning (Classification Learning) for classifying instances having a structure (the above mentioned semistructured data) into two classes.
Supervised Learning in this context is to learn a classifier, such that, given a set of instances that belong to either of two classes (called a positive class and a negative class), the classifier correctly classifies a new instance into the positive class or the negative class when it is not known which class the new instance belongs to. The instances that belong to the positive class are called positive instances, and the instances that belong to the negative class are called negative instances.
Generally, Supervised Learning involves representing an instance as a point (vector) in a feature space. Learning the classifier is to determine a rule (a set of separating planes) for appropriately classifying a set (point set) of positive instances and negative instances in the feature space (more properly, classifying unknown points yet to be given).
If an instance is easily represented as a vector, in other words, if its attribute required for classification (basis of the feature space) can humanly be determined, the attribute may be simply passed to various learning algorithms. However, an instance having a structure, i.e., an instance represented as an array, tree, graph, and so on, cannot be directly represented as a vector. In many of such cases, it may be effective to define substructures of the instance as its attributes. For example, a chemical compound can be represented as a graph, and its activity can be determined (to some extent) from the stacked substructures.
The above mentioned semistructured data can be modeled as a labeled ordered tree. One strategy of Supervised Learning for processing instances having such a structure is Relational Learning. Relational Learning defines basic relations between elements, and these relations form substructures. The substructures, used as attributes, are constructed successively as the learning proceeds. However, the problem of searching an optimal hypothesis would be NP hard in general, thereby resorting to a heuristic searching approach. The following reference 1, for example, describes Relational Learning in detail.
Reference 1: Furukawa, Ozaki, Ueno. Kinou-ronri Programming. Kyoritsu shuppan, 2001
Another strategy of Supervised Learning for processing instances having a structure is to find partial patterns, define frequent patterns as attributes, and pass them to a standard learning algorithm. This approach has an advantage that it can process data that has no class information. However, the steps of finding the patterns would be again NP hard in most cases.
Thus, these two approaches do not assure that processing will be completed in polynomial time in view of computational complexity.
Now, there is a practical approach to Supervised Learning for processing instances having a complex structure like a tree structure, which uses a Kernel method such as Support Vector Machine. The following reference 2, for example, describes Support Vector Machine in detail.
Reference 2: V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995.
One of the important features of the Kernel method is accessing instances using a Kernel. While common learning algorithms directly access a vector representation of an instance, the Kernel method involves accessing through the inner product of vector representations of two instances. Therefore, given an efficient way of computing the inner product of vector representations, the dimension of the vector representations does not appear explicitly in the computational complexity of learning and classification processes, however high the dimension is. The function that provides the inner product is called a Kernel, with which the Kernel method can realize efficient learning.
As described above, the Kernel method enables building a system that is feasible for learning with instances having a structure. Therefore, to perform data mining of semistructured data such as XML or HTML data, it is necessary to obtain the Kernel for labeled ordered trees that represent such a structure.
There have been proposed some Kernels that use all possible substructures as attributes. The following reference 3 discloses a Kernel for syntactic analysis trees of natural language sentences.
Reference 3: M. Collins and Nigel Duffy. Parsing with a Single Neuron: Convolution Kernel for Natural Language Problems. unpublished, 2000.
The method disclosed in the above literature uses the number of appearances of each subtree in a syntactic analysis tree as an attribute for a vector representation of the syntactic analysis tree. Here, because the number of subtrees appearing in one syntactic analysis tree exponentially increases with the size of the tree, explicitly counting the number would be problematic in view of computation time. Therefore, this method proposes recursively computing a vector representation of two trees without counting the number of subtree appearances one by one.
However, the trees that can be processed by the Kernel described in the reference 3 is limited to those like a syntactic analysis tree, in which child nodes can identify each other in a set of child nodes of a certain node. Therefore, the method cannot be applied to a general tree like a labeled ordered tree.
Thus, this invention proposes a Kernel method for a labeled ordered tree, and the object of this invention is to realize classification of semistructured data with this Kernel method.