1. Field of the Invention
The present invention is directed toward the field of analyzing clusters or relationships of data sets, and more particularly toward identifying clusters of terminology, organized in a knowledge base, for terminological systems.
2. Art Background
Various types of data are collected and subsequently analyzed in numerous applications. For example, in scientific experiments, data is collected by researchers, scientists and engineers. Typically, the data includes multiple variables or attributes. In general, the data points of a data set represents xe2x80x9cnxe2x80x9d variables or attributes. For example, if a data points represents a coordinate in three-dimensional space, then the data point, when expressed in rectangular coordinates, consists of the variables {x, y, and z}. In some applications, the variables in a data set are independent (i.e., there is no relationships between the variables of a data point). However, in other applications, one or more variables of a data point may have a predetermined relationship.
Often researchers desire to determine whether there is any correlation among the various data points. For example, a set of data points may be analyzed to determine whether some or all of the data points lie in a line, plane, in some other correlative manner.
Techniques have been developed to determine relationships for data, wherein the variables of the data set are independent. These techniques are generally referred to as multi-variant analysis. In general, multi-variant analysis determines if there are any relationships among the independent variables or attributes in a data set. For example, if each independent variable is plotted in n-dimensional space (i.e., each independent variable is a separate dimension), then multi-variant techniques may be applied to determine whether there is a relationship among the variables or attributes of the data set as depicted in the n-dimensional spacial representation. One goal of the multi-variant analysis is to identify data points that generally form a xe2x80x9cclusterxe2x80x9d when the data points are mapped in an n-dimensional space. This xe2x80x9cclusterxe2x80x9d effect shows a correlation among the data points in the cluster. Although prior art multi-variant techniques identify clusters for data mapped in n-dimensional space, these techniques assume that the variables are independent. Accordingly, it is desirable to develop xe2x80x9cclustering techniquesxe2x80x9d that are optimized to identify clusters of data points, wherein the variables or attributes are related.
Methods and apparatus for determining focal points of clusters in a tree structure is described herein. The clustering techniques of the present invention have application for use in terminological systems, wherein terms are mapped to categories of a classification system, and the clustering techniques are used to identify categories in the classification system that best reflect the terms input to the terminological system.
A cluster processing system determines at least one focal node on a hierarchically arranged tree structure of nodes based on attributes of a data set. The tree structure comprises a plurality of nodes, wherein each node includes an attribute. The tree structure is arranged in a hierarchy to depict relationships among the tree structure attributes. The data set comprises a plurality of data set attributes with associated weight values. The cluster processing system selects a set of nodes from the tree structure with tree structure attributes that correspond with the data set attributes, and then assigns quantitative values to nodes in the set of nodes from the weight values in the data set. At least one cluster of nodes are selected, based on proximity in the tree structure, and at least one focal node on the tree structure for the cluster of nodes is selected. The focal node comprises an attribute most representative of the data set attributes, and selection of the focal node includes evaluating the nodes of the cluster starting from a node at the top of the hierarchy of the tree structure and analyzing downward to select the focal node based on the quantitative values and the relationships of the attributes in the tree structure. The cluster processing system has application for use in a terminological system to learn the meaning of terms (attributes of a data set) by identifying categories (nodes) from a knowledge catalog (trees structure).