Many visualization systems have been built to help the information analyst sift though massive quantities of expository language text found in an electronic format in computer databases and the like. These types of systems have been critically important to identify key documents for intensive analysis. However, ultimately relevant documents are identified that require the time consuming effort of reading.
Efforts to speed this process has led to research in the area of Information Retrieval (IR), which has set a precedent for certain approaches as has research in applied Mathematics and Statistics. An example of this work is in automatic text theme identification with the end being to provide automated textual summaries of documents. ["Automatic Text Theme Generation and the Analysis of Text Structure", Salton, G and Amit Singhal, July 1994, TR 94-1438, Cornel Univ, Dept of Computer Science.] The mathematical basis for this approach is the standard Vector Space Model (VSM) used in IR. In the VSM each document is represented as a vector of weights with each weight corresponding to a particular word or concept in the text. Each paragraph is represented as a vector based on the words contained in the whole document. Similarities between paragraphs are calculated using a cosine measurement (normalized dot product) and are used to create a text relationship map. In the text relationship map, nodes are the paragraphs and links are the paragraph similarities. All groups of three mutually related (based on the similarity measure) paragraphs are identified and merged. These groups are then shown as triangles on the map. For each triangle, a centroid vector is created. A theme similarity parameter may then be used to merge triangles. The merging stops when further merges would fall outside the parameter range specified. The resulting merged triangles may then be associated with themes. A "tour" or summary of a document may be produced by ordering the merged triangles in chronological order and producing a summary for each of the merged triangle sets.
Another example used in IR is an algorithm for finding sub-topic structure in expository text that uses a moving window approach. [Multi-Paragraph Segmentation Of Expository Text, Marti A. Hearst, ACL '94, Las Cruces, NM]. Rather than using existing sentences and paragraphs, the words from the text are divided into token-sequences and blocks, each having a preselected length. For example, 20 words may be assigned as a token-sequence, which may then be described as a pseudo-sentence, and 6 token sequences may then be assigned as a block, which may then be described as a pseudo paragraph. Adjacent blocks are compared using cosine similarity measure on the full set of words within each block. Two adjacent blocks form a window. By shifting each window over by one token sequence, a comparison may be made for the next pair of adjacent windows. The cosine calculation for each window is centered over the gap between the blocks. Boundaries for topic changes are found by identifying the points of greatest change in the smoothed cosine-gap sequence from the moving windows after applying a set of rules. A typical set of rules might include having at least three intervening token sequences between boundaries and specifying that all boundaries must be moved to the end of the nearest paragraph.
In the VSM, certain "filters" are often used to identify the best words to characterize a document. Examples include filters which throw out words that occur too frequently or not frequently enough to allow documents within a corpus, or pieces within a document, to be successfully contrasted to one another. Certain articles of speech, conjunctions, certain adverbs (collectively called stop words) are thought to be devoid of theme content and are usually omitted from the document in VSM-based analysis. [Faloutsos, Christos, and Douglas Oard, "A survey of Information Retrieval and Filtering Methods"] Another useful and much more sophisticated filter is described by Bookstein whereby words which occur non-randomly in block of expository text are identified and selected as key topic words for thematic evolution, [Bookstein, A., S. T. Klein, and T. Raita (1995) Proceeding of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 319:327].
Various methods in IR have been also been used to compress vocabulary by looking at how words are associated with one another. In one approach, for example, a conditional probability matrix may be built such that each (i,j) entry represents the probability that word I occurs in a document (or corpus) given that word j also occurs. [Charniak, Eugene, "Statistical Language Learning", 1993, MIT Press]
Very generally in the VSM, the n-dimensional vector used to characterize the vocabulary for a particular document can be viewed as a signal, although the order of the terms in the vector is not related to chronological or narrative order. Both Hearst and Salton have created mathematical signals to represent a particular text as noted above. Hearst creates a smoothed token gap sequence that corresponds to the narrative order of the text. Merged paragraphs may also form a narrative based signal.
While all of these methods have advantages for IR, there still exists a need for an improved method of automatically partitioning an unstructured electronically formatted natural language document into its sub-topic structure.