Books represent one of the oldest forms of written communication and have been used since thousands of years ago as a means to store and transmit information. Despite this fact and given that a large fraction of the electronic documents available online and elsewhere consist of short texts such as Web pages, news articles, scientific reports, and others, the focus of natural language processing techniques to date has been on the automation of methods targeting short documents. A change however can be seen: more and more books are becoming available in electronic format, in projects such as the Million Books project, the Gutenberg project, or Google Book Search. Similarly, a large number of the books published in recent years are often available—for purchase or through libraries—in electronic format. This means that the need for language processing techniques able to handle very large documents such as books is becoming increasingly important.
A back-of-the-book index typically consists of the most important keywords addressed in a book, with pointers to the relevant pages inside the book. The construction of such indexes is one of the few tasks related to publishing that still requires extensive human labor. Although there is a certain degree of computer assistance, consisting of tools that help the professional indexer to organize and edit the index, there are no methods that would allow for complete or nearly-complete automation of the task.
In addition to helping professional indexers in their task, an automatically generated back-of-the-book index can also be useful for the automatic storage and retrieval of a document; as a quick reference to the content of a book for potential readers, researchers, or students; or as a starting point for generating ontologies tailored to the content of the book.
Keywords are not only used as entries in back-of-the-book indexes, but can be used to give a concise, high-level description of a document's contents that can help to determine a document's relevance, or as a low-cost measure of similarity between documents. They are also used in a topic search, in which a keyword is entered into a search engine and all documents with this particular keyword attached are returned to a user. It can be seen that improved keyword extraction methods have a wide range of applications for short documents as well as in back-of-the-book index generation for large documents.
Unfortunately, only a small fraction of documents have keywords assigned to them, and manually attaching keywords to existing documents is a very laborious task. Therefore, automation of this process using artificial intelligence, for example, machine learning techniques, is of interest. In implementing keyword extraction, any phrase in a new document can be identified—extracted—as a keyword. Then, machine learning or another computational technique is used to determine properties that distinguish candidate words that are keywords from those that are not.
The state-of-the-art in keyword extraction is currently represented by supervised learning methods, where a system is trained to recognize keywords in a text based on lexical and syntactic features. This approach was first suggested in Turney, 1999; and U.S. Pat. No. 6,470,307, where parameterized heuristic rules are combined with a special-purpose genetic algorithm into a system for keyword extraction (GenEx) that automatically identifies keywords in a document. Training GenEx on a new collection is computationally very expensive. A different learning algorithm was used in Kea [Frank et al., 1999]. Very briefly, Kea is a supervised system that uses a Naïve Bayes learning algorithm and several features, including information theoretic features such as tf.idf and positional features reflecting the position of the words with respect to the beginning of the text. Training Kea is much quicker than for training GenEx. Finally, in recent work, [Hulth, 2003] a system for keyword extraction from abstracts has been proposed that uses supervised learning with lexical and syntactic features, which were shown to improve keyword extraction significantly over previously published results.
A related task that requires keyword extraction is that of annotating a document with links to sources of additional information. An example of a collection of such documents is found in Wikipedia, an online encyclopedia, which is provided with manually-assigned keywords in the form of annotations consisting of hyperlinks to pages within or outside Wikipedia that are embedded within the text of each article. These annotations are currently performed by human contributors of articles to Wikipedia by hand following a Wikipedia “manual of style,” which gives guidelines concerning the selection of important concepts in a text, as well as the assignment of links to appropriate related articles. A system that could automatically perform the annotation task would aid contributors to Wikipedia, but there are also many other applications that could benefit from such a system.
Thus, there are many benefits to be gleaned from the capability of automatic extraction of keywords and automatic annotation of electronic text using this capability, and a large number of potential applications for these technologies. However, even state-of-the art systems for automatic keyword extraction and annotation still perform at relatively low levels using the standard information retrieval metrics: precision, recall, and F-measure; and they often fail to produce keywords or annotations approaching the quality of those that are manually constructed by human authors or professional indexers. There is, therefore, a need for improved keyword extraction methods and systems to enhance the quality of automatically-generated indexes and for use in linking other relevant information to electronic documents.