The need to review mass amount of text is very common. Magazines and databases of articles typically provide a summary for each article to enable the reader to determine its main subject, content, and relevancy to his interest. These summaries are typically prepared by experts, for example, by the author of the article, or by another expert who reads the article, and prepares a summary of it. However, there are many cases in which the reader faces an article with no summary, in which the provided summary is not suitable for his needs, that its length is too long or too short for his needs, etc. In other cases, there are databases providers that prepare their own summaries or articles. The art has tried to provide automated summarization of articles, however, without much success. The automated methods that have been suggested thus far have not been successful in providing accurate summaries. Moreover, those suggested methods are typically language dependent, i.e., they require adaptation of the software for each specific language, wherein such an adaptation is typically very complicated.
Document e.g., an article) summaries should use a minimum number of words to express a document's main ideas. As such, high quality summaries can significantly reduce the information overload many professionals in a variety of fields must contend with on a daily basis, assist in the automated classification and filtering of documents, and increase search engines precision. Automated summarization methods can use different levels of linguistic analysis: morphological, syntactic, semantic and discourse/pragmatic. Although the summary quality is expected to improve when a summarization technique includes language specific knowledge, the inclusion of that knowledge impedes the use of the summarizer on multiple languages. Only systems that perform equally well on different languages without language-specific knowledge (including linguistic analysis) can be considered language-independent summarizers.
The publication of information on the Internet in an ever-increasing variety of languages dictates the importance of developing multilingual summarization approaches. There is a particular need for language-independent statistical techniques that can be readily applied to text in any language without depending on language-specific linguistic tools. In the absence of such techniques, the only alternative to language-independent summarization would be the labor-intensive translation of the entire document into a common language.
Linear combinations of several statistical sentence ranking methods were applied in the MEAD (Radev et al. 2001; Experiments in single and multidocument summarization using mead; First Document Understanding Conference) and SUMMA (Saggion et al., 2003; Robust generic and query-based summarization; In EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics) approaches, both of which use the vector space model for text representation and a set of predefined or user-specified weights for a combination of position, frequency, title, and centroid-based (MEAD) features.
Kallel et al. 2004 (Summarization at laris laboratory; In Proceedings of the Document Understanding Conference) and Liu et al. 2006b (Multiple documents summarization based on genetic algorithm; Lecture Notes in Computer Science, 4223:355) used genetic algorithms (GAs), which are known as prominent search and optimization methods, to find sets of sentences that maximize summary quality metrics, starting from a random selection of sentences as the initial population. In this capacity, however, the high computational complexity of GAs is a disadvantage. To choose the best summary, multiple candidates should be generated and evaluated for each document (or document cluster). Following a different approach, Turney 2000 (Learning algorithms for keyphrase extraction; Information Retrieval, 2(4):303-336) used a GA to learn an optimized set of parameters for a keyword extractor embedded in the Extractor tool. 3. Or{hacek over ( )}asan et al. (2000; Enhancing preference-based anaphora resolution with genetic algorithms, Proceedings of the Second International Conference on Natural Language Processing, volume 1835, pages 185-195, Patras, Greece, June 2-4) enhanced the preference-based anaphora resolution algorithms by using a GA to find an optimal set of values for the outcomes of fourteen indicators and apply the optimal combination of values from data on one text to a different text. With such an approach, training may be the only time-consuming phase in the operation.
It is an object of the present invention to provide automated summarization method, which is more accurate compared to the prior art.
It is another object of the present invention to provide such automated summarization method which is language independent, and which can almost equally be performed on different languages.
It is still another object of the present invention to provide an automated method for summarizing which after a one time performance of a training stage is applied to accurately summarize articles in a real time stage.
It is another object of the present invention to provide a cross-lingual summarization method, which is trained on a human-generated corpus in one language, and is then applied in a real time stage to summarize documents in other languages.
Other objects and advantages of the present invention will become apparent as the description proceeds.