The disclosed embodiments generally relate to the field of data base management, and more particularly to clustering a set of documents in a document repository into cluster groups, and then organizing the clustered groups into an ordered reading list based upon the relational strength and usefulness to a topic. Such an ordered reading list comprises a document trail for efficient topical reading by a user. The documents are displayed to a reader/user with visual cues associated with document fragments indicating characteristic aspects of the fragment.
The ability to store documents electronically has led to an information explosion. Information bases such as the Internet, corporate digital data networks, electronic government record warehouses, and so forth, store vast quantities of information, which motivates development of effective information organization systems. Two commonly used organizational approaches are categorization and clustering. In categorization, a set of classes are predefined, and documents are grouped into classes based on content similarity measures. Clustering is similar, except that no predefined classes are defined, rather, documents are grouped or clustered based on similarity, and groups of similar documents define the set of classes. U.S. Pat. Nos. 7,539,653 and 7,711,747 are typical examples of clustering techniques.
The use of such clustering management system to facilitate organization, or even when such documents are organized into groups manually, is usually followed by readers/users of the clustered groups manually reading through the data of the documents therein, and then making subjective judgment calls about whether or not a document is relevant or useful to a related topic. The problem involved is that such a judgment can only occur by the manual reading of the entire document itself. Manual reading of related documents usually involves a lot of wasted time due to document redundancies and overlap. It is not uncommon for each document in a series to have much duplicate information already provided by documents earlier in the series. People reading such a series of documents often must spend a significant amount of time trying to determine what novel content exists in each subsequent document in the series. This frequently leads to “skimming” where readers attempt to quickly parse documents at some level of granularity (e.g., by paragraph) to try to quickly determine if the information provided is novel or useful. This can lead to a waste of time and missed information.
Many proposed solutions in the conceptual space shared by the subject embodiments attempt to make decisions on behalf of a user. Documents are split into fragments of information (typically at the paragraph level) and those fragments are grouped into categories by topic. Sophisticated text analysis techniques are used to determine whether two paragraphs (often written by different authors in different documents) convey the same basic idea. In many cases, information fragments deemed “redundant” are discarded before the user has a chance to see and decide. This can result in a loss of context. Stitching fragments from different documents (written in different voices by different authors, with potentially different sentiments and points of view) can result in a compilation of difficult to understand and cobbled together concepts. Additionally, many solutions in this space use “seed documents” or search engine results to determine the starting position and ranking order of the documents, which loses contextual information such as chronology or dependency.
Thus, there is a need for improved systems and methods for further organizing a document repository for more efficient reader/user review of accessible documents by minimizing presented overlap, redundancy or non-useful information, and highlighting desired new, particularly useful or strongly related information to the desired topic. Such needed systems and methods would keep the original documents in the document trail sequence completely intact and only highlight the fragments of information and the preselected intended characteristic aspects using clear visual cues that allow the users to immediately identify at least information in the following categories:                New—information that appears later in the document sequence, but is seen for the first time in the current document.        Novel—unique information that only appears in the current document;        Redundant—duplicate information that has appeared previously in the document sequence; and,        Current position in the trail—where the document that the reader is currently reviewing exists in the overall trail of documents.        
The present embodiments are directed to solving one or more of the specified problems and providing a fulfillment of the desired needs.