In current computing technologies, it is often useful to cluster data. In order to cluster data, a data corpus that contains a number of different objects is presented for clustering. Hidden groups within the data objects are identified and those groups form clusters.
There are a variety of different clustering techniques that find many different applications in current computing technologies. For instance, assume that the group of objects presented for clustering is a group of documents which are to be clustered in terms of topics. The documents and the words in the documents are referred to as observed values because they can be seen by a user. However, the topics or clusters are hidden values because the user does not obviously see the topics from the observed data. Clustering is the act of identifying the hidden topics in the observed data. Therefore, in this example, clustering is the act of finding topics referred to by the various documents in the data set and grouping the documents relative to those topics.
Clustering is often referred to in terms of “hard clustering” and “soft clustering”. Hard clustering means that a data object can only belong to a single cluster, and to no other cluster. Soft clustering means that a single data object can belong to multiple clusters, and the membership in each of those clusters is described using partial numbers, or fractions. For instance, if a document to be clustered discusses traveling to Miami for the Superbowl, the document may belong to a topic cluster identified as “travel” and to a topic cluster identified as “sports”. A soft clustering of the document might indicate that the document belongs to the “travel” cluster with a probability of sixty percent and that it belongs to the “sports” cluster with a probability of forty percent.
Data ranking presents a problem similar to that of clustering. For instance, when one wishes to find all documents that are important to the topic of “travel”, one might desire to have all the documents in the data set ranked according to relevance with respect to travel. Ranking the documents might provide results that indicate that a given document is relevant with a probability of 0.9 with respect to the topic of travel, while another document is relevant with a probability of 0.8, or 0.7, etc. It can thus be seen that creating a ranking of this type, based on relevance, can also be thought of as a soft cluster, in that documents can be relevant to multiple topics (i.e., belong to multiple clusters) to a varying degree.
Indexing also presents a similar problem to that of clustering and ranking. Indexing is the process by which a reduced representation of a document is created for easy storage and retrieval such that the distances between documents in the complete representation are preserved as much as possible. This can be done through clustering, or in other ways.
In the past, there have been substantially two methods for ranking, indexing or clustering. The two kinds of methods are spectral methods and probabilistic methods. In general, a spectral method refers to a technique that extracts eigenvectors, eigenvalues, singularvalues, or singular vectors from a matrix. One example of a spectral method is latent semantic indexing (LSI) which is commonly used for document indexing. Another currently known type of spectral method is referred to as the “HITS” method which is used to rank web pages as authority or hub pages.
Spectral methods have a number of advantages. For instance, they have been shown to be optimal in the sense that they can capture the most information possible given the reduced representation of the data. Also, under some conditions, a spectral method always converges to a globally optimum solution. Further, there are fast methods that exist to solve spectral problems, and since eigenvectors are orthogonal, spectral methods like latent semantic indexing, when applied to documents, give topics with minimum overlap.
However, spectral methods do have some disadvantages. They do not have probabilistic interpretation, which can be very useful when making decisions among multiple systems. Similarly, systems like latent semantic indexing do not allow sharing among representations in the same way that probabilistic models do. Also, spectral methods generally require some type of external restriction and guidance, and spectral clustering generally provides a hard clustering output.
The second type of technique for clustering, ranking, and indexing is the probabilistic technique. A probabilistic technique uses probabilistic, statistical models as generative models to represent the data. Examples of commonly known probabilistic techniques are latent Dirichlet allocation, and probabilistic latent semantic indexing (PLSI).
In general, the advantages associated with probabilistic techniques are simply the opposite of the disadvantages of the spectral methods mentioned above. However, the probabilistic techniques have disadvantages as well. Since they do not have an orthogonal representation, the clusters can be mixed with one another. Similarly, probabilistic techniques do not have the discriminative power of spectral techniques and therefore the generative models tend to be descriptive and explanatory.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.