A "topic" in a document is any entity, concept or event explicitly referred to therein, and a "significant topic" is a topic central to what is sometimes called the "aboutness" of a document. Significant topics are thus those topics that constitute the central thrust of a document or part of a document. The notion "significant," like the notion "relevant," is both task and user dependent. What is significant for an application that answers specific questions is different from what is significant for an application that conveys the sense of particular documents; what is significant in a domain for a naive user may be quite different from what is significant to an expert.
In order to identify significant topics in a document, a significance measure is needed, i.e.,a method for determining which concepts in the document are relatively important. In the absence of reliable full-scale syntactic and semantic parsing, frequency measures are often used to determine significance.
One of the earliest statistical techniques for identifying significant topics in a document for use in creating automatic abstracts was proposed by Luhn, who developed a method of making a list of stems and/or words, sometimes called keywords, removing keywords on a stop list, and then calculating the frequency of the remaining keywords. See H. P. Luhn, "The Automatic Creation of Literature Abstracts," IBM Journal of Research and Development, vol. 2(2), pp. 159-165 (1958). This method, which is based on the intuition that frequency of reference to a concept is significant, can be used to locate at least some important concepts in full text, especially when frequency of a keyword in a document is calculated relative to its frequency in a large corpus, as in standard information retrieval (IR) techniques. See G. Salton, Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer, (Addison-Wesley, Reading, Mass., 1989).
However, the ambiguity of stems (trad might refer to trader or tradition) and of isolated words (state might be a political entity or a mode of being) means that lists of keywords have not usually been used to represent the aboutness of a document to human beings. Instead, techniques such as identifying sentences with multiple keywords have been used since Luhn for automatic creation of abstracts. See C. D. Paice, "Constructing Literature Abstracts by Computer: Techniques and Prospects," Information Processing & Management, vol. 26(1), pp. 171-186 (1990).
The challenge in preparing an abbreviated representation of an article is to identify heuristics which make it possible to represent to the user the sense in which an author used an expression in the document, without performing full sense disambiguation. In an important sense, every document can be viewed as forming its own "self-contained" world. A document is written to get across a particular idea or set of ideas. The task of the author, at least in documents intended for public distribution, is to convey to the reader what general knowledge is assumed and to inform the reader of the context so that ambiguous expressions can be easily identified. These references are governed by certain standard conventions.
For example, in an edited document such as a newspaper article, the first reference to a named entity such as a person, place or organization typically uses a relatively full form of the name in a version which is sufficient to disambiguate the reference for the expected audience. Later in the document, the same entity is usually referred to by a shorter, more ambiguous form of the name. See N. Wacholder, Y. Ravin and M. Choi, "Disambiguation of Proper Names in Text," Proceedings of the Applied Natural Language Processing Conference, pp. 202-208 (Washington, D.C., March 1997). An article might first refer to Columbia University or, more formally, Columbia University in the City of New York, and later refer only to Columbia. Without the initial disambiguating reference, Columbia by itself is quite ambiguous. It might be a city, i.e, Columbia, M.d., a bank Columbia Savings and Loan, the space shuttle Columbia, Columbia Pictures, or one of many other entity names containing the word Columbia.
Nominator, a domain-general software module developed at the IBM's T. J. Watson Research Center, is capable of identifying and disambiguating proper names in a text document. See id. Nominator categorizes them, and links expressions in the same document which refer to the same entity. See id. The module first builds a list of proper names in each document and then applies heuristics in order to link names which refer to the same entity, e.g., Hillary Clinton and Mrs. Clinton, but not Bill Clinton. Although the Nominator technique produces reliable links between references to the same entity in a document, the technique is strictly limited to identifying and conveying a list proper nouns of proper nouns for indexing and information retrieval purposes. Nominator is not capable of identifying common noun phrases or the aboutness of a particular article.
Common noun phrases (NP's) also manifest a pattern of referential linking in documents, although it is more subtle and complicated than the proper name behavior. Any article of more than minimal length contains repeated references to important concepts. In general, when a word appears as a head of an NP, i.e., the noun that typically contributes the most syntactically and semantically to the meaning of the NP, it is used in the same sense throughout the document especially in articles of newspaper length. Some of the references to the head are elliptical and therefore very ambiguous, at least out of context, but some of the references are usually longer and therefore more specific and more informative. The different references to a concept implicitly or explicitly refer to each other and collectively form an abstract construct that conveys the sense that the author presumably intended to convey. See M. Kameyama, "Recognizing Referential Links: An Information Extraction Perspective," Computational Linguistics (Jul. 7, 1997).
Recently, the effort to develop techniques for domain-independent content characterization has been addressed. See B. Boguraev and C. Kennedy, "Technical Terminology for Domain Specification and Document Characterization," Information Extraction: A Multi-Disciplinary Approach to an Emerging Information Technology, pp. 73-96 (Lecture Notes in Computer Science Series, Springer-Verlag, Berlin, 1997). Boguraev and Kennedy take as a starting point the question of the applicability to document characterization of the approach of Justeson and Katz to identify technical terms in a corpus. See J. S. Justeson and S. M. Katz, "Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text," Natural Language Engineering, vol. 1 (1), pp. 9-27 (1995). Justeson and Katz developed a well-defined algorithm for identifying technical terminology, repeated multi-word phrases such as central processing Unit in the computer domain or word sense in the lexical semantic domain. This algorithm identifies candidate technical terms in a corpus by locating NP's consisting of nouns, adjectives, and sometimes prepositional phrases. Technical terms are defined as those NP's, or their subparts, which occur above some frequency threshold in a corpus.
However, as Boguraev and Kennedy observe, the technical term technique is not simply adaptable to the task of content characterization of documents. For an open-ended set of documents and document types, there is no domain to restrict the technical terms. Moreover, patterns of lexicalization of technical terms in a corpus do not necessarily apply to individual documents, especially short ones. Boguraev and Kennedy therefore propose relaxing the notion of a technical term to include an exhaustive list of "discourse referents" in a wide variety of text documents, and determining which referents are important by some measure of discourse prominence.