1. Field of the Invention
The present invention relates to data mining, and particularly to a hierarchal clustering method for large XML data.
2. Description of the Related Art
Data clustering is defined as the problem of grouping similar objects such that similarity between objects of the same group is higher than the similarity between those objects and objects of other groups.
An XML (Extensible Markup Language) document basically comprises components that include elements representing a logical component of a document. Elements can contain other elements and/or text (character data). The boundary of each element is marked with a start tag and an end tag. A start tag starts with the “<” character and ends with the “>” character. An end tag starts with “</” and ends with “>”. The root element contains all other elements in the document. For example, an XML document may have a root element named “paper”. Children of an element are elements that are directly contained in that element. In some XML documents, the element is not enough to describe its content. Such documents are called text-centric documents. Attributes are descriptive information attached to elements. The values of attributes are set inside the start tag of an element. For example, the expression <reference xlink=“./paper/xmlql”> sets the value of the attribute xlink to “./paper/xmlql”. The main difference between elements and attributes is that attributes cannot contain other attributes or elements. Values are sequences of characters that appear between an element's start-tag and end-tag. Like attributes, values cannot contain elements. For example, the expressions “2004” and “Tom” are examples of values.
Due to its nested structure, XML is commonly modeled as a rooted and labeled tree. Nodes of the tree correspond to elements, attributes, and text in XML documents. Edges represent element-sub-element, element-attribute and element-text relationships. This tree model reflects the logical structure of an XML document and can be used to store and query XML data. A path is a series of ordered nodes between the root node and an internal or a leaf node. An exemplary path is the path “/PaPer/author/name”. The W3C XML specification provides detailed information about XML.
An XML document is a self-describing document. XML elements can either be simple or complex. Simple elements contain only values or attributes. On the other hand, complex elements can additionally contain other elements, and therefore a nesting structure is formed. This structure can have any level of nesting.
Some XML documents have to conform to a Document Type Definition (DTD). DTD specifies the elements, the attributes, and the structure of an XML document. Unlike relational database tables, XML documents are semi-structured. A newer specification for XML documents is the XML schema. The XML schema can impose more constraints on an XML document than the DTD. It also has a hierarchal structure that specifies the name and the data type of XML elements. The flexibility of defining the XML structure makes XML able to represent any kind of data, but it also makes it more difficult to process.
Data clustering is defined as the problem of grouping similar objects such that similarity between objects of the same group is higher than the similarity between those objects and objects of other groups. There are several algorithms for clustering XML data. Nearly all XML clustering algorithms follow a similar approach. First, the XML dataset is read. The dataset can be XML documents or XML schema or both. Second, optionally the data is represented in a model, such as a tree model or Vector Space Model (VSM). After that, a similarity function measures the distance between any two XML objects, or parts of the model. Finally, these objects are grouped as an array of clusters or as a hierarchy structure. The main approaches of clustering algorithms include data clustering tools, such as similarity functions, null values, and scalability.
The main data clustering approaches are as follows. In the Partitioning Approach, algorithms start by taking n data points and then classifying them into k (k>n) partitions. Examples of this approach are k-means, k-medoids and CLARANS.
A Hierarchical Approach creates a hierarchical decomposition of the given set of data objects. It can either be done from top-down (divisive) or bottom-up (agglomerative). Hierarchical approaches result in creating a tree that holds a cluster of clusters.
One example of the Hierarchical Approach is the BIRCH algorithm (Balanced Iterative Clustering using Hierarchies). BIRCH, in its first phase, creates a tree that summarizes the input data. This tree is called the Clustering-Feature tree (CF-tree). A single node in the BIRCH tree has a few attributes that summarize the statistical features of its descendant nodes.
The Density-based Approach continues growing a given cluster as long as the density (number of objects or data points) in the neighborhood does not fall below a certain threshold. Examples of this approach include DBSACN, OPTICS and DenClue.
The Grid-based Approach relies on creating a grid structure. This grid is finite and created by quantizing the data object space. This approach is known to be efficient.
The Model-based Approach uses machine learning techniques that learn from the distribution of data points. Examples of this approach are self-organizing feature map (SOM) and COBWEB.
Given a large homogeneous XML dataset, the aforementioned approaches have difficulty clustering such a large homogeneous XML dataset's content and structure while producing an output in the form of hierarchal clusters.
Thus, a hierarchal clustering method for large XML data solving the aforementioned problems is desired.