Traditionally a summary from a corpus, for example, a collection comprising of online documents, is an abstraction of the text constructed from keywords. These keywords are extracted according to a certain importance measure. Previous work in keyword extraction has only considered frequency of occurrence in the corpus, and not the diversity of the keywords. As a result, common keywords are emphasized in the corpus while less obvious but equally pertinent, but less frequent, keywords or phrases are left unrevealed.
For keyword extraction, most traditional methods have used supervised learning methods, which require manually labeled training samples. Only a few studies have been carried out on unsupervised keyword extraction problems for general content documents. For example, words may be ranked by a weighting method referred to as “TF*PDF,” where “TF” refers to Term Frequency and “PDF” refers to Proportional Document Frequency. This method gives extra weight to keywords appearing in many documents based on the assumption that words which appear in more documents are more important than others. Another weighting method is Latent Semantic Analysis (LSA), which generates information on the interrelationships between keywords, documents, and concepts. LSA alone suffers from a similar problem as TF*PDF because it also relies only on document and keyword counts. As a result, relevant topics may again be obscured by the infrequency of their keywords in the corpus.