1. Technical Field
The disclosure relates generally to a method and system for information visualization.
2. Discussion of Technical Background
An information search and retrieval system locates relevant documents stored in a media and renders the documents in a result set in response to a query. The query may come from a user input, and the retrieved documents may be rendered in a ranked order to the user based on relevance, time, or other criteria. To help a user to quickly identify main concepts within the result set, various visualization techniques have been implemented to display retrieved documents in a two-dimensional space.
Due to the computational complexity and limited effectiveness of projecting documents from a high dimensional term space to a two-dimensional space, an intermediate procedure may be applied to reduce the number of dimensions involved in the projection process. Document classification has been applied to classify retrieved documents into predefined classes, the number of which is smaller than the number of terms in the documents. The classes are projected to a two-dimensional map, and the documents are then placed with regard to classes on the two-dimensional map. Document clustering provides another way to reduce the dimensionality by grouping retrieved documents into clusters. The cluster centers are projected to a two-dimensional map and documents are placed with regard to the cluster centers on the two-dimensional map. The rendered map in both document classification and document clustering identifies main concepts, by class and cluster labels respectively.
However, there are some drawbacks in both techniques of document classification and document clustering. First, the semantic relatedness between documents is not clearly represented on the two-dimensional map, so that documents close to each other are not necessarily more related than those placed apart. Second, the choice of cluster or class in which to classify a particular document may appear arbitrary when the document includes multiple topics represented by different clusters/classes and/or when multiple clusters/classes describe similar topics. Third, the placement of labels representing main concepts does not take into account the global distribution of concepts across classes/clusters on the two-dimensional map. So the concepts that occur in documents scattered in multiple classes/clusters are likely under-represented, i.e., not significant enough to be selected as labels.
Accordingly, there exists a need for a document visualization technique to overcome the above drawbacks.