There are several known approaches of solving the problem of automatic summarization of stored electronic documents. These approaches include (1) the use of different kinds of statistics gathered from the text, (2) information extraction, based on a word's position in the text or based on a document design, (3) search of “cue words” as marks for text of importance and desired for representation in the summary, and (4) usage of discursive text analysis to define elements, which represent the center of a document subtopic discussion.
These methods were modified as the means of linguistic text analysis evolved. At the earliest stages, these forms of analysis only allowed one to divide text into words and sentences and to conduct elementary morphological words analysis. Commonly, the summary was made up from the sentences of initial text that received the highest rank, or that met some other criteria. The statistics, in such cases, were collected on text word usage rate. That is, the more the word was found in the text, the weightier it was considered. Auxiliary words and other words considered not to be significant were filtered out according to a set of predetermined lists.
Alternatively, so called “tf*idf” word estimation was used, where the distribution of a word in a document set was taken into consideration. Such estimation is discussed in U.S. Pat. No. 6, 128,634 to Golovchinsky, et al., for highlighting significant terms for the purpose of making a quick review of the document relatively easy.
A similar approach is used in U.S. Pat. No. 5, 924,108 to Fein, et al., where the estimation of a sentence is made as the arithmetic mean of word estimation. The method of “cue words” in this patent relies on the presence of certain words or their combinations in the sentence. In U.S. Pat. No. 5,638,543 to Peterson, et al. a method is described to extract single sentences.
There are some systems that use different combinations of the aforementioned approaches. For example, U.S. Pat. No. 5, 978, 820 to Mase, et al. defines a document's type with the help of different statistic values, such as the average number of words and symbols in a sentence, the average number of sentences in the paragraph, and so on. Then, the topic of the document is defined on the basis of the specific word usage. A summary is compiled as the totality of sentences, which are included in the original document or those that have certain predetermined words.
In Kupiec, et al., “A Training Document Summarizer”, ACM Press Proceeding of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995, pp. 68–73, the probability of a sentence being included in the summary is computed on the basis of such characteristics as sentence length, presence of certain words, sentence position, presence or frequency words and proper names.
However, in all of these prior works, only shallow text analysis is carried out, which cannot produce high accuracy. All of these prior methods fail to determine the significance of information content. The use of more advanced means of text analysis, such as tagging, advance the work of these methods due to more exact significant word determination, usage of lemmas in the calculations of statistics, and the search of patterns. Nevertheless, these improvements are limited and do not offer efficiency.
The next stage in the development of means of linguistic text analysis using some measure of abstracting is the appearance of systems that mark out syntactic structures, such as noun phrases, surface subjects, actions and objects in the sentences, but for very limited purposes. That is, as implemented, it is possible to make the simplest semantic text analysis to reveal deep text objects and relations between them. For example, results of deep text analysis is used in U.S. Pat. No. 6,185,592 to Boguraev, et al., where, for text segments, the most significant noun phrase groups are marked on the basis of their usage frequency in weighted semantic roles. A resulting document summary report presents the number of these noun phrases and their context.
Thus, a limited attempt has been made to build a summary report on the basis of automatically extracted knowledge from a text document at the object level. However, determining deep semantic relations between the objects themselves, in particular knowledge on the level of facts and also the main functional relations between the facts themselves, has not been considered.