One of the major problems at present in the field of automatic processing of text information presented in natural languages is the synthesis of text based on information objects extracted from text data. One of the applied problems of text synthesis based on extracted information is automatic text annotation.
Automatic annotation is a text data processing routine for subsequent extraction of basis information from the data and its further processing. At present, the existing methods for automatic annotation may be divided into two types. The distinguishing feature of the first type of annotation is the fact that the annotation text consists of sentences of the source text, being the so-called method of “extraction-based summarization”. The methods of the second type of annotation, “abstraction-based summarization”, present an annotation text which is synthesized on the basis of the content of the source text. Given the technical complexity in the realization of automatic text synthesis and extraction of information therefrom, the main methods of annotation are methods of the “extraction-based summarization” type. Examples of automatic annotation of the “extraction-based summarization” type are the methods: TextRank, the method of annotation based on terminology and semantics, and the method of annotation based on latent semantic analysis.
The TextRank annotation method is an extremely simple algorithm for automatic annotation which presents the source text in the form of a graph whose nodes are sentences, while its graph edges are the “relation” between two sentences. The relation is defined by the number of identical words in the given sentences. Each edge in the graph has a weight, while each vertex is assigned a rating, computed on the basis of two criteria:                The number of edges emerging from other vertices,        The rating of these edges.        
The nodes with the highest rating contain sentences which will be used in the annotation text. The chief defect of this method of annotation is that fact that it makes practically no allowance for the text semantics, and therefore the annotation is not always true and accurate.
The annotation algorithm based on terminology and semantics ranks the sentences of the source text by using metrics based on terms extracted from the text. With the aid of an ontology, a correlation is established between each term from the text and the terms from the heading, and on this basis the weight of each term is computed. The weight of a sentence is computed as the sum of the weights of all the terms used therein.
The method based on latent semantic analysis is also based on a ranking of sentences with the aid of terms. The foundation of the method is the principle of selection of sentences having maximum importance in terms of a particular topic. However, this method as well has drawbacks. Since the sentences are selected by the principle that the importance of the sentence is a maximum in at least one topic, this means that a sentence whose importance is good in all topics, but not a maximum in any of them, will not make it into the annotation. Besides this, topics of slight importance are not filtered out, so that the size of the annotation may be larger than is needed.
The specification discloses a method of automatic annotation of text data of the “abstraction-based summarization” type, which remedies the deficiencies of the existing methods and enables a text synthesis with high accuracy based on extracted data—information objects—from the text.