Repositories for documents are well known in the art. Within such repositories, literally thousands of documents of various types—text, spreadsheets, presentations, diagrams, ad hoc databases, programming code, etc.—maybe stored according to any desired hierarchy. Given the sheer quantity of documents within such repositories, it is desirable to provide systems and techniques for navigating within the repositories. For example, U.S. Pat. No. 7,383,269 in the name of Swaminathan et al. and entitled “Navigating A Software Project Repository” (“the '269 patent”) describes a repository navigation tool comprising a backend system for processing documents in a repository and a front end system for accessing the processed documents. FIG. 1 illustrates the backend system of the repository navigation tool system described in the '269 patent. As shown, the backend system 100 extracts relevant files from the various project repositories 110 using repository adapters 121. The extracted files are treated by the extraction tool 120 as essentially uniform materials, that are subsequently stored in a file store 125.
As shown, the extraction tool 120 communicates with a classification tool 130, a segmentation tool 140, and a linking tool 150. The classification tool 130 operates to classify each document provided by the extraction tool 120 into one of a plurality of categories. In turn, the segmentation tool 140 divides the extracted and classified documents into one or more segments. As used herein, and as further described in the '269 patent, a segment of a document comprises a subset of information that is grouped in some distinguishable and well-delineated manner from surrounding information such that the segmentation tool 140 is able to discern an author's intent to communicate to a reader that the subset of information may be treated as a single, discrete piece of information. Further still, the linking tool 150, is operative to analyze the resulting segments for the existence of relationships between the various segments, and subsequently store information concerning the discovered relationships in a link repository 155. Based on the links established in this manner, the front end system illustrated and described in the '269 patent may be used to identify documents that are related to each other by virtue of similarity of their corresponding segments.
The '269 patent describes a particular technique for operation of the linking tool 150. In particular, the '269 patent describes characterization of each segment as an n-dimensional vector, where n represents the available “universe” of keywords extracted from the segments. For each segment, the vectors is populated by the frequency of each of the n different keywords within that segment. That is, magnitude of a segment's vector along a particular keyword dimension is equal to the frequency of that keyword in the segment. Using this representation, similarity of segments may be determined using so-called cosine similarity analysis, i.e., by determining the dot product between segment vectors. While the repository navigation tool described in the '269 patent has been a useful addition to the prior art, further refinements for determining segment similarity (i.e., for discovering relationships between segments) would represent an advancement in the art.