This invention relates to the indexing of textual material, and it relates more particularly to a method for locating subjects during indexing of a manuscript.
Many documents could be made more useful by the addition of a subject-page (or back-of-the-book) index. This is particularly true of reference documents where, often, the reader wants the answer to a specific question quickly. In this case, a table of contents will be somewhat helpful; but, typically, a table of contents does not have the level of detail that a subject-page index has. However, producing a back-of-the-book index manually is difficult and time-consuming.
In order to produce an index manually, a person typically reads through the manuscript circling subjects as they appear in the text. Often, someone else then transfers each subject and the page number to index cards. Page numbers are collated, and the subjects alphabetized to produce a subject-page index. Of course, if the document is altered, and the page numbers change, the index will have to be changed also.
Several references, discussed below, illustrate the types of work which have been done in the indexing field. G. J. Carney has published an article entitled "Computer-Assisted Index Preparation," which appears at pp. 329-338 in Proceedings of the American Documentation Institute 1966 Annual Meeting, Vol. 3 (1966). Carney actually uses a human indexer to whom the system presents a display of significant words of a text in sentence context, and along with the page numbers thereof. The indexer reviews such words and selects desired items for the index. An article by H. Borko entitled "Experiments in Book Indexing by Computer," appearing in Information Storage and Retrieval, Vol. 6, pp. 5-16, Pergamon Press, 1970, shows a one-pass system which generates an index while also generating the subjects used in the index. However, final selection in that arrangement was accomplished by a human, because no algorithm had been found that would not include phrases of little or no read utility for index users. A J. M. Janas article, "Automatic Recognition of the Part-of-Speech for English Texts," appeared in Information Processing and Management, Vol. 13, pp. 205-213, Pergamon Press, 1977. Janas does not deal directly with the problem of preparing an index from a subject list; but the author does discuss briefly the idea of recognizing parts of speech using, in part, word endings which are characteristic of the parts of speech. A U.S.A. patent to Cassada, U.S. Pat. No. 3,947,825, deals with an index search machine which compares abstracts of groups of words in stored information with an abstract of a user search request. Only a comparison showing a match causes the corresponding words underlying the abstracted material in the store to be searched in detail.
The task of indexing can be partly eased if a manuscript is prepared with a text processor, such as that represented by the known troff and nroff text processors available with the UNIX.TM. operating system licensed by the American Telephone and Telegraph Company. In this case, the author, with a subject list in hand, inserts a macro along with the name of a subject each time that the subject appears in the text. When the document is formatted, the macro separately prints the subject name and the page number. Page numbers still must be collated, and the subjects alphabetized; but those two functions can be performed by known techniques. In this way, it would be easier to recreate the text if the document were altered significantly and page numbers changed. However, the user still must decide which combinations of words constitute an occurrence of a particular subject and try to locate every occurrence of every subject in the document.