1. Technical Field
The disclosed embodiments generally relate to the field of database creation, ordering and management.
2. Description of the Related Art
Document collections can be modeled by using term-frequency vectors. A term-frequency vector is a vector having a plurality of entries each corresponding to a particular term that is present in one or more document collections. For a document collection, each entry is used to tally the number of occurrences of a term to which the entry corresponds in the document collection. A conventional method of generating document collections from a corpus is described in U.S. Pat. No. 5,442,778 to Pedersen et al., entitled “Scatter-Gather: A Cluster-Based Method and Apparatus for Browsing Large Document Collections,” the disclosure of which is incorporated herein by reference in its entirety.
Term-frequency vectors have been viewed as vectors in a high-dimensional vector space. However, mathematical difficulties with clustering high-dimensional random vectors using distances have been exposed in K. S. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft: “When Is ‘Nearest Neighbor’ Meaningful?”, Proceedings 7th International Conference on Database Theory (ICDT'99), pp 217-235, Jerusalem, Israel (1999). As a result, information-theoretic methods have been developed where the vectors are viewed as either empirical distributions (histograms) or outcomes of multinomial distributions.
What is needed is a method and system for grouping documents into collections such that documents that are contiguous in time and topic appear in the same cluster.
A need exists for a method and system for efficiently selecting an optimal partitioning of time-ordered document clusters.
A further need exists for a method and system for determining an optimal partitioning of time-ordered document clusters based on the number of parameters used to describe the partitions.
The present disclosure is directed to solving one or more of the above-listed problems.