Developments in computer technology have allowed processing of large volume of data, including computer-processing of texts. Computer-based generation of summaries or abstracts of a document is one of the challenging tasks of computer linguistics. The main challenge in computer-based generation of a summary of a document is two fold: (i) processing speed (as some of the summaries need to be generated “on the fly”) and (ii) accuracy (i.e. providing the summary without losing the overall meaning of the document).
Such computer-generated summaries are used in several areas of computer technologies, such as search engines (for generating snippets for inclusion into a Search Engine Results Page or, simply, SERP), for providing user summaries of various documents to enable a more efficient computer search, generating newsfeeds from news articles, maintaining databases of textual information, computer-based translation of texts and the like.
Generally speaking, there exist two types of computer-based approaches to generating a summary of a given document. The first type is a generative method of summarization and involves selecting words or phrases (not whole sentences) of a particular document. The method then generates a summary based upon the selected words or phrases.
The second type—an extractive summarization—is a process of selecting and extracting “text spans” (typically, sentences) from the document. The extracted text spans are then re-arranged in some order to form a summary.
US2015/0293905 discloses a method for summarizing a document. A concept is detected for each sentence in the document. Relevance measures between the sentences are computed according to the detected concepts. And then a concept-aware graph is constructed, wherein a node in the graph represents a sentence in the document and an edge between two nodes represents a relevance measure between these two sentences.
U.S. Pat. No. 7,899,666 discloses a method and system for automatically extracting relations between concepts included in electronic text. Aspects of the exemplary embodiment include a semantic network comprising a plurality of lemmas that are grouped into synsets representing concepts, each of the synsets having a corresponding sense, and a plurality of links connected between the synsets that represent semantic relations between the synsets. The semantic network further includes semantic information comprising at least one of: 1) an expanded set of semantic relation links representing: hierarchical semantic relations, synset/corpus semantic relations verb/subject semantic relations, verb/direct object semantic relations, and fine grain/coarse grain sematic relationship; 2) a hierarchical category tree having a plurality of categories, wherein each of the categories contains a group of one or more synsets and a set of attributes, wherein the set of attributes of each of the categories are associated with each of the synsets in the respective category; and 3) a plurality of domains, wherein one or more of the domains is associated with at least a portion of the synsets, wherein each domain adds information regarding linguistic context in which the corresponding synset is used in a language. A linguistic engine uses the semantic network to performing semantic disambiguation on the electronic text using one or more of the expanded set of semantic relation links, the hierarchical category tree, and the plurality of domains to assign a respective one of the senses to elements in the electronic text independently from contextual reference.
“Context-Based Similarity Analysis for Document Summarization” by Prabha et al., relates to document summarization. Document classification methods are used to assign the category of the documents Bernoulli model of randomness is used for document summarization process. The Bernoulli model of randomness is used to find the probability of the co-occurrences of two terms in a large corpus. The lexical association between term is used to produce a context sensitive weight to the document terms. The document indexing and summarization scheme is enhanced with linguistic analysis mechanism. Context sensitive index model is improved with semantic weight values. Concept relationship based lexical association measure estimation is performed for index process. Bernoulli lexical association measure is used to perform the document classification process. The Java language and Oracle relational database are used for the system development process. The proposed model gives a higher weight to the content-carrying terms and as a result, the sentences are presented in such a way that the most informative sentences appear on the top of the summary, making a positive impact on the quality of the summary.