The disclosed embodiments generally relate to the field of data base management, and more particularly to clustering a set of documents in a document repository into cluster groups, and then organizing the clustered groups into an ordered reading list based upon the relational strength and usefulness to a topic. Such an ordered reading list comprises a document trail for efficient topical reading by a user.
The ability to store documents electronically has led to an information explosion. Information bases such as the Internet, corporate digital data networks, electronic government record warehouses, and so forth, store vast quantities of information, which motivates development of effective information organization systems. Two commonly used organizational approaches are categorization and clustering. In categorization, a set of classes are predefined, and documents are grouped into classes based on content similarity measures. Clustering is similar, except that no predefined classes are defined, rather, documents are grouped or clustered based on similarity, and groups of similar documents define the set of classes. U.S. Pat. Nos. 7,539,653 and 7,711,747 are typical examples of clustering techniques.
The use of such clustering management system to facilitate organization, or even when such documents are organized into groups manually, is usually followed by readers/users of the clustered groups manually reading through the data of the documents therein, and then making subjective judgment calls about whether or not a document is relevant or useful to a related topic. The problem involved is that such a judgment can only occur by the manual reading of the entire document itself. Manual reading of related documents usually involves a lot of wasted time due to document redundancies and overlap. It is not uncommon for each document in a series to have much duplicate information already provided by documents earlier in the series. People reading such a series of documents often must spend a significant amount of time trying to determine what novel content exists in each subsequent document in the series. This frequently leads to “skimming” where readers attempt to quickly parse documents at some level of granularity (e.g., by paragraph) to try to quickly determine if the information provided is novel. This can lead to a waste of time and missed information.
Thus, there is a need for improved systems and methods for further organizing a document repository for more efficient reader/user review of accessible documents by minimizing presented overlap, redundancy or non-useful information, and highlighting desired new, particularly useful or strongly related information to the desired topic.
The present embodiments are directed to solving one or more of the specified problems and providing a fulfillment of the desired needs.