The following relates to the information technology arts, and more particularly to the clustering, categorization, and related arts.
Clustering of items is a typical task in information technology. The term “item” in various applications may refer to documents, images, data entries, musical files or content, video files or content with or without soundtracks, Internet URL's, content of a local area network (LAN), hard drive, or other storage medium, or so forth. Each item is characterized by a plurality of dimensions of information, where each dimension refers to an aspect descriptive of items to be grouped. In one approach, the items to be grouped can be represented in a table-like form in which the rows correspond to individual items, and the columns correspond to dimensions (or vice versa). In semi-supervised or supervised clustering (the latter sometimes being referred to as categorization), some or all items may have a class or group label or annotation. Clustering algorithms typically assign items to groups of “like items” based on some indication of similarity. A dataset can be large in terms of the number of items, the number of dimensions, or both.
Existing algorithms for automated clustering include K-Mean and its variants, Principal Component Analysis, Fisher Discriminant Analysis, and so forth. Existing automated clustering algorithms are less robust than would ideally be desirable, and sometimes produce deficient results for irregularly shaped groupings or other arbitrarily shaped groupings. User interaction in existing automated clustering algorithms is also limited, typically to user selection of the initial conditions. If the automated clustering algorithm produces a deficient result, the user's only option may be to restart the automated clustering from the beginning using different initial conditions.
In view of these difficulties, manual clustering is sometimes used. Manual clustering by a human actor has the disadvantage of being labor-intensive, and the quality of the results are heavily dependent upon the skill and effort of the human actor. Generally, it is difficult for a human actor to sort through a large number of items having a large number of dimensions in order to identify useful groupings.
A third approach is interactive visual clustering. In these approaches, the N-dimensional collection of items (where N is the number of dimensions used in characterizing the items) is mapped to a two- or three-dimensional space that is more easily visualized by a human actor. The mapping employs mapping parameters that can be adjusted by the human actor in order to produce visually perceptible groupings of items in the two- or three-dimensional mapping. An example of this approach is the star coordinate visual clustering method, in which each dimension is mapped to a two-dimensional mapping space using mapping parameters including a selected angle (denoted by θ) and unit length (denoted by α). An item P in the N-dimensional space having dimensional values (x1p, x2p, . . . , xNp) is mapped to a point Q(x,y) in the two-dimensional star coordinate space by starting at an origin in the two-dimensional star coordinate space and summing vectors of the form [(αi·xip·cos(θi)),(αi·xip·sin(θi))]T, where θi and αi are the mapping parameters for the i-th dimension, xip is the value of the i-th dimension of the item P, and the vector sum is over i=[1, . . . , N] for N dimensions.
Interactive visual clustering provides a mechanism to assist the human actor in identifying suitable groupings of items. For example, in the star coordinate visualization, two dimensions that have about the same angular θ values will generally be aggregated together. On the other hand, adjusting the α value controls the contribution to the visualization of the corresponding dimension. Items of a candidate group can be denoted by a characteristic color or shape to enable visual tracking of the candidate group as changes are made in the θ and α parameters.
Interactive visual clustering is an aid to manual clustering, and accordingly retains the same deficiencies. The approach remains labor-intensive, and the quality of the results remains dependent upon the skill and effort of the human actor. The interactive visualization assists the human actor in sorting through the items to identify useful groupings. However, it is still difficult for the human actor to deal with a large number of items, dimensions, or both, even with the assistance of interactive visual clustering. Linkages between θ and α parameters of the visualization and the dimensions of the items may be difficult to grasp. Thus, it may not be apparent to the human actor how best to adjust the mapping parameters to make a particular group manifest in the visualization.