Keywords can provide a compact representation of the essential content in a document or a corpus of documents. Keywords are widely used to define queries within information retrieval (IR) systems because they are easy to define, revise, remember, and share. In comparison to mathematical signatures, keywords can be independent of a given corpus and can be applied across multiple corpora and IR systems. Keywords have also been applied to improve the functionality of IR systems and to enrich the presentation of search results. However, despite their utility for analysis, indexing, and retrieval, most documents do not have assigned keywords.
Most of the existing approaches for assigning keywords focus on manual assignment by professional curators who may use a fixed taxonomy, or rely on the author's judgment to provide a representative list. Accordingly, state-of-the-art methods commonly focus on automatically extracting words from documents as an aid to either suggest keywords for a professional indexer or generate summary features for documents that would otherwise be inaccessible. Early approaches to automatically extracting keywords focused on evaluating corpus-level statistics of individual words across the corpus of documents. However, corpus-level approaches can miss important keywords within the corpus that are not statistically discriminating. These corpus-oriented methods can also tend to operate on single words, which can be a limitation because a single word is often used in multiple and different contexts.
Keyword extraction methods that operate on individual documents can be applied to avoid the drawbacks of corpus-oriented methods. However, many of these document-oriented approaches rely on natural language processing and are not highly efficient. Accordingly, a need exists for automatic keyword extraction methods and apparatus that operate from individual documents and that are unsupervised, domain-independent, and language-independent.