A number of well-known techniques exist for organizing and visualizing documents in a file system. For example, a number of organization and visualization techniques are described in Readings in Information Visualization: Using Vision to Think, edited by Stuart K. Card et al., Morgan Kaufman Publishers, Inc., San Fransisco, Calif. (1999). For example, Wise et al., “Visualizing the Non-Visual: Spatial Analysis and Interaction with Information from Text Documents” (441-450), Proceedings of IEEE Information Visualization '95, 51-58 (1995), discuss various attempts to visualize large quantities of textual information, most importantly the “Galaxies” visualization which “displays cluster and document inter-relatedness by reducing a high dimensional representation of documents and clusters to a 2D scatter plot of ‘docupoints’ that appear as do stars in the night sky.” Hendley et al., “Case Study: Narcissus: Visualizing Information” (503-509), Proceedings of IEEE Information Visualization '95, 90-96 (1995), discusses a representation of a three-dimensional information space that is self-organizing. Points, such as web pages, exert a repulsive force from one another that is proportional to their dissimilarity, eventually reaching a steady state.
Typically, files are maintained in a file system that uses a hierarchical structure. While such hierarchical structures provide an effective mechanism for organizing files in the file system, they suffer from a number of limitations, which if overcome, could signficantly increase the efficiency and consistency of file systems. Specifically, such hierarchical structures must rely on the computer user(s) to maintain the hierarchy. Thus, a number of self-organizing techniques have been disclosed or suggested for organizing file systems. For example, associative memory techniques have been applied in file systems. An associative memory relies more on associated recollections to pick out a particular memory than on absolute memory locations. See, for example, T. Kohonen, Self-Organization and Associative Memory, Springer-Verlag, New York, 1987 and T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, 78(9), 1990: 1464-1480, each incorporated by reference herein. Kohonen's self-organizing feature map algorithm addresses the problem of preserving the relative distances among points when doing a dimensionality reduction from N>2 dimensions to two. For example in three dimensions, it is possible to have four points which are equidistant from one another (i.e., the vertices of a regular tetrahedron), but it is not possible to preserve this equidistance relationship when projecting these points to a plane since on a plane, at most three points can be equidistant.
Addressing this problem, Kohonen came up with an algorithm for representing higher dimensional objects in two dimensions, by considering them as collections of grid points, all enclosed by a convex region not containing any other grid points. The distance between higher dimension points when interpreted in this kind of two dimensional “projection” is equal to the distance between closest grid points. These Kohonen feature maps do a better job of preserving relative distance then do standard projections. Dimensionality reduction is important in rendering a navigation system through a high dimension document space.
A number of information retrieval mechanisms exist for accessing information based on a semantic analysis of documents. For example, vector space methods in information retrieval identify relevant documents by determining a similarity between two documents. The most important vector space information retrieval models include the Vector Space Method (VSM), the Generalized Vector Space Method (GVSM), described in S. Wong et al., “Generalized Vector Space Model in Information Retrieval,” ACM SIGIR Conference on Research and Development of Information Retrieval 1985: 18-25, and the method of Latent Semantic Indexing (LSI), described in S. Deerwester et al., “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science 1990 41(6): 391-407.
Generally, the VSM information retrieval model looks at a document as a vector of frequencies of words, where the similarity between two documents, d and d′, is the vector dot product. The GVSM information retrieval model tries to solve the problem in VSM where virtually synonymous words are treated as orthogonal. GVSM uses a training collection of documents, or training matrix to “condition” the dot product. In VSM, single word documents will have a non-zero similarity if they differ. In GVSM, single word documents will have a similarity equal to how well the words are correlated to one another in the training documents. The LSI information retrieval model goes in a different direction, trying to get at the problem of polysemy, where words can have different meanings, but in comparing word frequencies in documents analogous meanings of the same words are removed. Aside from this, LSI is a very useful technique for determining principal components for dimensionality reduction.
Vector space methods use either word frequencies, normalized word frequencies, or some other term weighting scheme to coordinatize documents within the vector space. The most popular term weighting schemes are based on the term-frequency (tf) multiplied by the inverse document frequency (idf), often referred to as “tf×idf.” See, for example, G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing & Management 1988 24(5): 513-523 and S. Robertson, et al, “Okapi at TREC-3,” The Third Text Retrieval Conference, National Institute of Standards and Technology Special Publication, 1995: 500-525.
The most traditional tf×idf term weighting is f*log (N/n), where f is the frequency of the word in the current document, N is the total documents in the local corpus, and n is the number of documents in the local corpus containing the word. Once these weights are determined, they are normalized to ensure document vectors of length one (1). Normalization allows distance between documents to be viewed as the angle between document vectors, and the cosine of the angle is then a measure of the similarity between the vectors, which may be computed by taking the coordinate by coordinate dot product. Many other forms of tf×idf have been proposed, some of which do not use normalization. In any case, the key to tf×idf term weighting is the idf term. If a document is viewed purely as a vector of word counts, then very commonly occurring words would dominate, and documents could be seen to be close if they use commonly occurring words, such as “and” and “the,” in similar numbers. The inverse document frequency solves this problem by giving such words a very low idf. Since words such as “and” and “the” will occur in virtually every document, the N/n in the tf×idf term weighting computation will be close to one and the log of N/n will be close to 0. Thus, these commonly occurring words will have negligible term weights.
LSI and other vector space methods use only a single corpus when coordinating documents within the vector space using term weighting schemes. The use of LSI in conjunction with standard term-weighting schemes enables the most discriminable terms or phrases to rise to the top of the decomposition, as the principal right singular vectors. However, with a single corpus it is not possible to distinguish the discriminable terms, phrases and concepts from the “important” terms, phrases and concepts. In particular, such single corpus term weighting schemes do not evaluate “importance” from a personal standpoint. The indistinguishability between importance and discriminability is borne out in an article by F. Jian and M. Littman entitled “Approximate Dimension Equalization in Vector-based Information Retrieval,” Proceedings of the Seventeenth International Conference on Machine Learning, Morgan Kaufman, 2000: 423-430. Jian and Littman provide theoretical and experimental evidence to show that the dimensions that remain after an LSI-based dimension reduction should be weighted uniformly. In fact, they show how GVSM, because of its effective weighting, acts like a severely dimensionally reduced version of LSI, and so consistently under-performs LSI in retrieval tasks. Thus, with single corpus tf×idf term weighting schemes, the first few singular vectors should thus not be viewed as the most “important” in any sense of the word.
Existing single-corpus information retrieval methods do not allow the concept of “importance” to be assessed from the vantage point of a given individual, or otherwise. For example, suppose an artificial intelligence researcher has a number of documents that use the terms “artificial” and “intelligence.” Examination of the researcher's own documents does not permit an assessment of the importance of these terms for the researcher. It could, in fact, be that these terms are simply very commonly used.
Therefore, a need exists for an improved self-organizing personal file (and navigational) system. A further need exists for a file management system which requires minimal user involvement for organization. In a landscape of pervasive computing devices, for example, with information coming at users from all directions, much of which a user would like to save within his or her personal collection, it is not practical to have to save every document within a personally created hierarchy. Yet another need exists for a computer filing system that is highly interactive, and gives the user a navigational space, with landmarks to get his/her bearings within the search space, along with improved search facilities based on the underlying semantics of documents. An object of this invention is to provide an improved method for determining the relevance of a document to a query, or proximity of one document to another based on two-corpus, relative term weighting. An additional object of this invention is to provide the user with a rich spatial representation of files, that is highly interactive and optimized for efficient navigation.