1. Technical Field
The present invention relates to a method and system for generating relevant threads of documents from a collection of time-stamped documents.
2. Related Art
Organizing and searching document collections for the purpose of perusal by human users may relate to the following categories: automatic construction of hypertext and hyperlinks; burst analysis and prediction of news events; and clustering and automatic identification of communities.
There has been lot of work on automatically generating hypertext, beginning with the thesis of Allan [4]. Works closely to the present invention are Dalamagas and Dunlop [10] and Dalamagas [9]. The preceding works consider the problem of automatic creation of hyperlinks for news hypertext which is tailored to the domain of newspaper archives. The method in the preceding works is based on traditional clustering tools. Dalamagas [9] also explores the use of elementary graph-theoretic tools such as connected components to identify threads in news articles and builds a prototype system. Smeaton and Morrissey [21] use standard information retrieval techniques to compute a graph that is based both on node-node similarity and overall layout of the hypertext; they then use this graph to automatically create hyperlinks. Blustein [8] explored the problem of automatically creating hyperlinks between journal articles. Green [13] develops a notion of semantic relatedness to generate hypertext. However, most of these work focus on examining the text of a news article and adding hyperlinks to other news articles based on terms, dates, events, people, etc; they do not address the problem of identifying threads in news collection.
Automatically identifying threads in document collections is also related to burst analysis and event analysis. Kleinberg [17] models the generation of bursts by a two-state automaton and proceeds to automatically detecting bursts in sequence of events; Kleinberg looks for the burst of a single keyword. News event prediction and analysis are important topics that have been explored in several contexts. Clustering and text retrieval techniques were used by Yang et al. [25; 5] to automatically detect novel events from a temporally-ordered sequence of news stories. Statistical methods were used by Swan and Allan [23] to automatically generate an interactive timeline displaying major events in a corpus. Uramoto and Takeda [24] describe methods for relating multiple newspaper articles based on a graph constructed from the similarity matrix.
News articles and news groups have been analyzed in the context of search and data mining. Agrawal et al. [2] study the use of link-based methods and graph-theoretic approach to partition authors into opposite camps within a given topic in the context of newsgroups. Finding news articles on the web that are relevant to news currently being broadcast was explored by Henzinger et al. [14]. Allan et al. [6] propose several methods for constructing one-sentence temporal summaries of news stories. Smith [22] examines collocations of dates/place names to detect events in a digital library of historical documents.
Community identification has been studied extensively in the context of web pages, web sites, and search results. Trawling refers to the process of automatically enumerating communities from a crawl of the web, where a community is defined to be a dense bipartite subgraph; an algorithm to do trawling via pruning and the a priori algorithm [3] was presented in [19]. A network flow approach to identifying web communities was given by Flake et al. [11]. Local search methods were used to identify communities satisfying certain special properties in the work of [18]; their interest was to extract storylines from search results.