Multi-document summarization is the process of generating a generic or topic-focused summary by reducing documents in size while retaining the main characteristics of the original documents. Since one reason of causing the problem of data overload is that many documents share the same or similar topics, automatic multi-document summarization has attracted much attention in recent years. The explosive increase of documents on the Internet has driven the need for summarization applications. For example, the informative snippets generation in web search can assist users in further exploring snippets, and in a Question/Answer system, a question-based summary is often required to provide information asked in the question. Another example is short summaries for news groups in news services, which can facilitate users to better understand the news articles in the group news. The document summarization can be either generic or query-relevant. Generic multi-document summarization should reflect the general content of the documents without any additional information. Query-relevant multi-document summarization should focus on the information expressed in the given query, i.e., the summaries must be biased to the given query. The system can handle generic and query-relevant multi-document summarization.
The major issues for multi-document summarization are as follows: first of all, the information contained in different documents often overlaps with each other, therefore, it is necessary to find an effective way to merge the documents while recognizing and removing redundancy. Another issue is identifying important difference between documents and covering the informative content as much as possible issue. Current multi-document summarization approaches usually focus on the sentences by terms matrix, either perform matrix factorization or sentence similarity analysis on it, and group the sentences into clusters. Then, the summaries can be created by extracting representative sentences from each sentence cluster. The problem of these existing approaches is that they ignore the context dependency of the sentences and treat them as independent of each other during the sentence clustering and extraction. However, the sentences within the same document or the same document cluster do have mutual influence which can be utilized as additional knowledge to help the summarization. Thus, given a collection of documents, discovering the hidden topics in the documents by document clustering can benefit the sentence context analysis during the summarization.
To demonstrate the usefulness of the hidden topics embedded in the document clusters, a simple example is shown in Table 1. The synthetic dataset contains four very short articles, each of which contains only two sentences (8 sentences in total). The task is to generate a two-sentence generic summary for these articles.
TABLE 1D1S1: Apple Inc. is a corporation manufacturing consumerelectronics.S2: Apple's design seems a lot more revolutionary to mostAmericans.D2S4: The design of Apple's products is more revolutionary thanothers in the market.D3S5: Apple is a corporation manufacturing consumer electronics.S6: The prices of Apple's machines are relatively high with thesame performance.D4S7: Apple is a corporation manufacturing consumer electronics.S8: With the similar performance, Apple's machines have higherprice than others.
In the illustrative example of Table 1, A represents the Di represents the ith document and Sj is the jth sentence. Looking at the data directly, D1 and D2 talks about the nice design of Apple's products, and D3 and D4 are related to the high prices. A high quality summary should includes the two features of Apple's products. However, if the eight sentences were clustered into two groups solely based on the sentence similarity, S1, S5 and S7 are the same and should be assigned into one cluster. And the rest sentences are the other group discussing about Apple's products. If the summary were limited to be two-sentence long, the summary can only cover one feature of the Apple's products, either nice design or high price. Thus, the summary is not comprehensive.