Repositories for documents are well known in the art. Within such repositories, literally thousands of documents of various types—text, spreadsheets, presentations, diagrams, ad hoc databases, programming code, etc.—maybe stored according to any desired hierarchy. Given the sheer quantity of documents within such repositories, it is desirable to provide systems and techniques for navigating within the repositories. For example, U.S. Patent Application Publication No. U.S. 2005/00659930 filed on Sep. 12, 2003 in the name of Swaminathan et al. and entitled “Navigating A Software Project Repository” (“the '930 application”) describes a repository navigation tool comprising a backend system for processing documents in a repository and a front end system for accessing the processed documents. FIG. 1 illustrates the backend system of the repository navigation tool system described in the '930 application. As shown, the backend system 100 extracts relevant files from the various project repositories 110 using repository adapters 121. The extracted files are treated by the extraction tool 120 as essentially uniform materials, that are subsequently stored in a file store 125.
As shown, the extraction tool 120 communicates with a classification tool 130, a segmentation tool 140, and a linking tool 150. The classification tool 130 operates to classify each document provided by the extraction tool 120 into one of a plurality of categories. In turn, the segmentation tool 140 divides the extracted and classified documents into one or more segments. As used herein, and as further described in the '930 application, a segment of a document comprises a subset of information that is grouped in some distinguishable and well-delineated manner from surrounding information such that the segmentation tool 140 is able to discern an author's intent to communicate to a reader that the subset of information may be treated as a single, discrete piece of information. Further still, the linking tool 150, is operative to analyze the resulting segments for the existence of relationships between the various segments, and subsequently store information concerning the discovered relationships in a link repository 155. Based on the links established in this manner, the front end system illustrated and described in the '930 application may be used to identify documents that are related to each other by virtue of similarity of their corresponding segments.
The '930 application describes a particular technique for operation of the segmentation tool 140. In particular, the '930 application describes segmentation of documents based on structure of a document associated with that document's specific type, as well as the content of the document. For example, in the case of a Word document (i.e., a document produced using Microsoft's “WORD” text editor application), the segmentation tool 140, using a so-called component object model (COM) application protocol interface (API), accesses the content of a document to discover various structural feature specific to a Word document, e.g., titles, outline levels, section indicia and the relationship of various paragraphs to these structures. Based on this information, the segmentation tool 140 infers the existence of segments consisting of text associated with the high level structural features such as sections. In another example, slides within a presentation developed using Microsoft's “POWERPOINT” application are accessed via a corresponding COM API to determine the existence of various slides, shapes and shape text within the document, which features are again used to infer segments. In the case of documents developed according to templates, the segmentation tool 140 is provided with an additional tool for determining segments to the extent that the known structure of the template can be used to segment boundaries.
Other approaches to segmentation employ more trivial means such as segmenting documents into fixed sized units or segmenting into minimal entities (such as paragraphs) and then grouping subsequent paragraphs based on similarity to create the segments. In the former approach, the segments thus formed may not be as expected by the user (they can be either too large or small) and clearly doesn't take the user's perspective into consideration. With regard to the latter approach, the computational complexity required to first interpret the semantic content of each minimal entity, and subsequently infer similarity between minimal entities, is very high.
While the segmentation approach described in the '930 application, as well as the other techniques described above, have been useful additions to the prior art, further refinements for performing segment identification would represent an advancement in the art.