The present invention relates generally to the field of information analysis and display. More specifically, it provides methods for partitioning tree-structured textual material into topically related clusters of adjacent items, then developing digests of each cluster. The digests include both shorter overviews and arbitrarily long summaries. The tree-structured material involved could be for example, but is not limited to, trees containing the messages and postings of an archived discussion within a newsgroup, discussion list, or on-line forum. This invention also provides methods for partitioning a two-dimensional tree visualization, called a treetable, into conveniently sized segments for detailed exploration. The segments may be grouped into regions corresponding to the topically related clusters.
To establish some terminology, a “tree” or “tree structure” is a standard term denoting an abstract data structure that models information as a set of nodes connected by directed edges such that: (a) there is exactly one element having no incoming edges, called the “root”; and (b) all other nodes have exactly one incoming edge. A leaf node is a node with no outgoing edges. All nodes besides the root node and the leaf nodes can be called “interior nodes”. The “parent” of a node is the source of its incoming edge, and the “children” of a node are the targets of its outgoing edges. A “subtree” of a tree is a set of nodes consisting of a “subtree root” node that has no parent in the subtree, and other nodes all having parents in the subtree.
The present invention is intended for use in connection with tree structures whose interior nodes represent substantial amounts of logically related textual information. For example, in the tree-structures formed by archived discussions, the nodes represent individual messages or contributions, and a message represented by a child node is a response to the message represented by its parent. The creation of the parent-child links in archived discussions can be established by a combination of conventional means utilizing header information, and deeper means, as described in U.S. Patent Application Publication No. US 2002/0073157, filed Dec. 8, 2000 , incorporated by reference hereinabove.
Tree-structured archived discussions on a particular subject are usually represented for exploration by indented lists. Each contribution is represented by some identification information, such as contributor name and date, indented under the identification information for its parent. The individual contributions may then be accessed for reading by selecting one of the list items. However, archived discussions pay varying amounts of attention to the ostensible subject and initial contribution, and often branch Into several subtopics, so the reader cannot assume, based only on the ostensible subject, whether any portion of the discussion is actually of interest, and, if so, what parts of the discussion. A more informative representation of the overall content of an archived discussion is described in U.S. Pat. No. 7,003,724, issued Feb. 21, 2006, incorporated by reference hereinabove. In that representation, initial substantive fragments of each contribution, containing actual text of the message rather than quotes or quote introduction, are embedded within a reduced-width linear tree tailored to text embedding. This representation is suitable as a level of presentation of the discussion, and also as the content of an emailed digest summarizing activity in the discussion list, or as a client side digest of such activity. Client-side accumulation of email from discussion lists, involving concatenating, or sampling, messages from all mail received in a particular period, is introduced in U.S. patent application Ser. No. 09/717,278, filed Nov. 22, 2000, titled “Systems and Methods for Performing Sender-Independent Managing of Electronic Messages, by Michelle Baldonado, Paula Newman, and William Janssen, incorporated by reference hereinabove.
Yet another method of representing the overall content of an archived discussion is described in U.S. Pat. No. 6,976,212, issued Dec. 13, 2005, and U.S. Pat. No. 6,944,818, issued Sep. 13, 2005, both incorporated by reference hereinabove. In this method, the conversation tree is presented in a two dimensional tabular form called a “treetabte”. In such a tractable, each cell represents a single node and exactly spans the cells representing its children if Way, and a substantive initial fragment of the message associated with the node is displayed in the cell, to the extent that space allows. The individual columns and subtrees of the treatable may be selected for expansion (reducing other parts of the tree), to view more of the associated texts, and the full texts of each column may be selected for display in auxiliary windows or frames. (Note that a similar representation is described in an article entitled “Structured Graphs: a visualization for scalable graph-based case tools” by M. Sifer and 3. Potter, in the Australian Computer Journal, Volume 28 Number 1, and also in later papers authored by M. Sifer and other colleagues, but in these references the potential of the structure is not exploited for purposes of exploring trees whose nodes have associated significant text.)
While the latter two methods (reduced-width linear trees and treetables) with embedded initial fragments are useful methods of providing overviews for smaller discussions and other tree-structured textual material, they are less useful for larger discussions. For example, for a stored conversation consisting of 93 messages, a reduced-width linear tree containing initial fragments requires over 11 standard-size display windows. Alternatively, if such a conversation is represented in a treetable that can be contained in a single window, the cells are too small to contain any indicative content, and there are too many columns to expand individually to determine if there is content of interest.
Therefore, more accessible digests of such larger discussions are needed. Current approaches to text processing address some related problems. Methods have been developed for segmenting individual documents into extents dealing with different approximate subtopics, and for identifying the topics covered by the most indicative words, as described in U.S. Pat. No. 7,130,837, issued Oct. 31, 2006, incorporated by reference hereinabove. Methods have also be developed for summarizing identified topic extents by collections of extracted sentences, and for associating summary elements wit the text extents covered, as described, far example, in a paper by Branimir Boguraev and Mary Neff entitled “Discourse Segmentation in Aid of Document Summarization”, in the Proceedings of the Hawaii International Conference on System Sciences (2000). Methods have also been developed for summarizing colletions of separate documents by grouping them by topic, generally using centroid-based clustering methods, and then extracting sentences dealing with each topic. An example of such an approach is described by Dragomir Radev, Hongyan Jing, and Malgorzata Budzikowska in the paper “Centroid-based summarization of multiple documents: sentence extraction” in the Proceedings of the ANLP/NAACL. 2000 Workshop on Automatic Summarization (Seattle, Wash., April, 2000) pages 21-29. However, tree-structured discussions are neither single documents nor collections of independent documents, and specialized methods are needed for their segmentation and summarization.
Two limited approaches seem to have been developed, to date, relating to segmenting tree-structured discussions, but none, as far as can be ascertained at this time, to summarizing those discussions. A paper by K. Tajima, Y. Mizuuchi, M. Kitagawa, and K. Tanaka entitled “Cut as a querying unit for WWW, Netnews, and E-mail”, in the Proceedings of the 9th ACM Conference on Hypertext and Hypermedia (1998) descries a method for identifying overlapping subtrees of a discussion as units of information retrieval, to put retrieved messages into a useful context. The clustering method processes the thread tree bottom-up and, at each step, combines a parent with currently open child subtrees, separately or together, if the similarity between the parent word vector and the centroid vector of the child subtree or subtrees exceeds an (unspecified) absolute input threshold. The word vectors used to represent the vectors handle quoted passages by reducing the weights of quoted words, in order to keep inter-message distances from being too small. While no results are given, if the threshold is set relatively high, this method would probably lead to shallow subtrees, suitable as query results. However, it is unlikely that the method would lead to clustering results suitable for subtopic identification or digesting. Based on our experiments, quoted words require more detailed treatment, and some trials of a similar single-link clustering method using distances between a node and the centroid of an adjacent cluster produced unsatisfactory results.
Another approach related to discussion tree segmentation is described in a paper by H. Ozaku, K. Uchimoto, M. Murata, and H. Isahara entitled “Topic Search for Intelligent Network News Reader HISHO”, in the Proceedings of the 2000 ACM Symposium on Applied Computing. This paper describes a method for retrieving many discussions relating to a query topic, and then attempting to filter out discussion subtrees irrelevant to the topic. The method uses, for the most part, noun keywords to represent messages, and tries to find “topic changing articles” where the proportion of never-seen-keywords shifts, and “topic branching articles” where a message gives rise to several responses distinguished by their keyword usage and their referenced quotes. This strategy is reported as of limited success in finding topic-changing articles (recall=57%) and larger success in finding topic branching articles.
The present invention incorporates methods of dividing a tree-structured discussion into major subtopics, and of developing digests containing segments for each such subtopic. Two types of digests are developed, that may be inspected in sequence. Shorter digests, which we will call “overviews”, choose a set of texts in each subtopic based on topic-relevance and potential for providing coherent sequences, and represent each such text by one or more extracted sentences. Potentially longer digests, which we will call “summaries”, choose a set of extracted sentences representing a proportion of the text associated with a subtopic, by a combination of features resting on inherent properties of the sentences, and on the content of a developing summary.
The present invention also provides methods for pre-segmenting a large tree or treetable for purposes of visualization and deeper exploration of individual nodes, with the segments sized so as to allow inclusion of at least some amount of content-indicative text for each node. There have been many approaches developed to allow investigation of detailed areas of large visualizations, usually distinguished as either “fisheye” approaches, that expand part of visualization at the expense of other parts, or focus plus context” approaches that extract and expand part of visualization into another window. Some examples of these approaches as applied to trees and treetables are: (a) in-situ expansions of nodes in the neighborhood of a selected node within a “Degree of Interest Tree”, described in U.S. Pat. No. 6,944,830, issued Sep. 13, 2005, incorporated by reference hereinabove, (b) in-situ expansion of treetable columns and complete subtrees (all nodes descended from a given node, and extraction of sets of columns and complete subtrees into another window, as described in U.S. Pat. No. 6,976,212, issued Dec. 13, 2005, incorporated by reference hereinabove, and (c) iterative restriction of the display to subtrees or user-defined sets of nodes, is provided in a wettable-like visualization described in the papers ‘The SGF metadata framework and its support for social awareness on the World Wide Web”, by O. Licehti et al. in World Wide Web (Baltzer), 1999. 2(4., and “M. Sifer and O. Liechti, “Zooming in One Dimension Can Be Better Than Two: An Interface for Placing Search Results in Context wit a Restricted Sitemap” by M. Sifer and O. Liechti, in Proceedings of the 1999 IEEE Symposium on Visual Languages (Tokyo, Japan) 72-79.
These methods all have some problems when used in connection with large trees. In-situ expansion of neighborhoods, columns, or subtrees can be disorienting when the expanded nodes are to contain significant amounts of text, because the shape of the tree changes dramatically, and little space is left for unexpanded nodes. Also, for large trees, subtree extraction (or restriction of the display to a subtree) tends to be an iterative process. This may be suitable when the tree represents a generalization hierarchy, so that higher level nodes provide good cues as to the content of lower level ones, but not otherwise, for example when the trees represent discussions, or the network of linked nodes on a website, reduced to a tree by removing cyclic paths. Finally, leaving the specification of sets of nodes to be extracted to users is problematic both because it is laborious, and because successive extractions are still generally needed to make sufficient node-identification information visible to permit an intelligent selection.
For this reason, the methods provided in this invention pre-partition a tree or treetable into segments of related nodes whose approximate maximum dimension permits significant text to be presented for each node. The segments can be visually differentiated in an outline depiction of the tree or treetable as a whole, and individual segments then extracted for deeper exploration. The segments may also be constrained to represent only nodes within the same logical grouping, which may be an identified subtopic, or collection of less-focused material, or other type of grouping. When the segments are so constrained, regions of adjacent segments associated with each such grouping can also be visually differentiated from other such regions.
The method for partitioning is related to, but different from, the large body of work on partitioning graphs (collections of nodes linked by edges, but not necessarily hierarchic) into a roughly equal-size subgraphs, given either the number of subgraphs to be found or a maximum size per subgraph, and possibly some additional constraints on the subgraphs, with the purpose being to minimize the number of edges between subgraphs. Much of the work derives from an algorithm described by B. W. Kernighan and S. Lin in the paper “An efficient heuristic procedure for partitioning graphs”, in The Bell System Technical Journal, 49(2) 1970, in which a greedy algorithm obtains an initial partitioning, and then nodes are iteratively moved among subgraphs to improve the quality of the partition. Such methods have applications in VLSI design, distribution of processes and data among processors, and sparse matrix representations. The problem addressed by the methods of the present invention is different, in that the permissible subgraphs are far more constrained; subgraphs must represent either subtrees or sets of subtrees whose roots have a common parent (and sometimes are part of the same logical grouping), and must largely respect a given layout dimensionality. This, in turn, permits the careful initial partitioning algorithm described in this invention to be sufficient to the purpose.
Further advantages of the invention will become apparent as the following description proceeds.