1. Field of the Invention
The present invention relates generally to apparatuses for automatically creating indexes for books, documents and the like, and more specifically, to an apparatus for automatically generating an index, glossary, concordance, a keyword list and the like and editing the same as required.
2. Description of the Related Art
Indexes are attached to many of manuals explaining operations of machines and computers as well as to many of books systematically describing technical matters. Indexes enhance the value of the books. Without an index, even a value of a book with excellent contents is depreciated by half. No one looks through even half of a bulky book to find just one word.
An index is not only useful for operating a machine. An index is also important for the studies concerning the humanities. If it had not been for a concordance of Bible or the works of Shakespeare, researchers would be at a loss facing the originals.
Indexes had been created as follows before documents could be made up by electrotonic apparatuses such as a word processor.
First, an author and/or an editor of a book or a manual marks strings of characters which are thought to be appropriate as an index entries by a specific code. Thereafter, desirably a plurality of persons copy the marked strings on other pieces of paper together with the occurrence pages of the strings. At this time, one word is entered in one paper. When many strings are expected to extracted, it is preferable that the size of the paper is small.
After all the marked strings are copied on the pieces of paper, all the pieces of paper are rearranged, for example, alphabetically in accordance with the strings entered therein. The strings entered in the rearranged pieces of paper are adopted as index entries in its order.
Creating indexes manually as described above has an advantage of adaptability in extracting entries. For example, any code can be used for use as a mark. There is no limitation on the size of the mark.
However, the above-described manner has the following problems. First, a large number of manhours are required for marking entries, copying the marked strings on pieces of paper and alphabetically rearranging the pieces of paper in which the strings are entered. The more the number of strings to be extracted is increased, the more rapidly the required time is increased.
Secondly, there is a high possibility of occurrence of errors. As the number of strings to be extracted is increased and the number of people working is increased, the number of errors included in the completed indexes is thought to be larger.
The third problem is that it is difficult to create indexes before the completion of the document. In case the document is revised, the index should be looked into to review the strings related to the revision. Even after the completion of the document, such problem can occur.
The fourth problem is that when an author of the document is different from a person creating the index, the created index might not exactly match the contents of the document. Such problem could occur, for example, in the selection of index entries and reference pages. In such a case, one who uses the index will be frustrated failing to find the desired and adequate information.
The fifth problem is that when strings to be adopted as entries are marked by different persons, the completed indexes might be made different. This results from the fact that criteria for use in marking the strings are not shared. The reason for such lack of common criteria is that knowledge necessary for creating indexes is not collected.
A large part of the above-described problems is resolved by the development of computerized text processing system such as a word processor. A document is ordinarily stored as coded textual data in a computer.
Assume that strings to be adopted as index entries in this textual data are given some mark. As long as the marks are distinguishable from other characters used in the document, the computer is capable of rearranging the marked strings very easily in relation with the pages of the occurrences thereof in the alphabetical order or in accordance with other arbitrary rules.
The above-described process is deterministic. A computer is capable of carrying out this process with unerring precision unless a hardware error or a program error occurs. The development of the text processing system supported by the computer eliminates restrictions and errors from the works most requiring manpower in index making.
Even in this case, however, the paper is simply replaced by an electric display apparatus. It is still a task of an author or an expert editor to apply a particular symbol to a string or a strings of characters to be adopted as an index entry. Accordingly, the problem relating to the selection of entries still remains to be solved.
Several methods of solving this problem are proposed. One is described in "Support System of Autoindex of Manuals and Autoglossary" (Collected Articles of the 37th National Conference of '88 society of Information Processing) and the other is disclosed in Japanese Patent Application entitled "System of Extracting Japanese Text Keyword" (Japanese Patent Laying-Open No. 63-217418) by Kobayashi et al.
Before explaining the above-described two proposals, characteristics of the Japanese language dealt by the two proposals will be described.
As is well known in the field of linguistics, the Japanese language comprises two types of alphabets "hiragana" and "katakana" each having about 50 syllables and logographic, rather than alphabetic, "kanji"(chinese character). The kanji characters are read in "on" pronunciation originated from the Chinese language and in "kun" pronunciation originated from the Japanese language. The pronunciation of the kanji characters can be resolved by taking words and phrases into consideration.
Conventionally, in the Japanese language, a newly introduced concept has been expressed commonly by "kango" which is a combination of a plurality of kanji characters. Recently, a foreign word expressing a new concept is often transliterated and expressed in "katakana". Accordingly, for example, most of the technical words are expressed in kanji characters or katakana.
The proposal by Takahashi et al. is based on such characteristics of the Japanese language. According to Takahashi et al., the delimitation of a sentence by hiragana enables extraction of words comprising only kango or katakana, thereby enabling extraction of a technical word.
However, this proposal is only applicable to the Japanese language. While Takahashi et al. state that they formed an index creating apparatus having the above-described functions, they do not disclose an constitution thereof.
FIG. 1 is a block diagram of the apparatus proposed by Kobayashi et al. Referring to FIG. 1, the apparatus comprises a display unit 502, a keyboard 504, a Japanese text file 508, a Japanese text editor 506, a string extracting module 510, a keyword extracting module 512, a non-keyword file 518, a string occurrence tally table 520, a database updating module 516 and a keyword database 514.
Japanese text file 508 is for storing a coded Japanese text. Japanese text editor 506 edits a Japanese text file based on input from keyboard 504 by an operator. String extracting module 510 is for extracting technical words included in the document by selecting a word comprising only kanji or katakana in the same manner as by Takahashi et al.
Non-keyword file 518 is for storing a set of words which are not characteristic of the text. Keyword extracting module 512 is for selecting only an appropriate strings by comparing the string extracted by string extracting module 510 with the set of words, which are thought to be inappropriate for keywords, stored in non-keyword file 518, for extracting strings of frequent occurrence as keywords among the selected strings and for displaying the same in display unit 502. String occurrence tally table 520 is a working table for storing the string occurrence counted by keyword extracting module 512.
Keyword database 514 is for storing the extracted set of keywords. Database updating module 516 is for adding to keyword database 514 appropriate keywords selected by the operator among the keywords extracted by keyword extracting module 512 and displayed in display unit 502 in response to the input from keyboard 504 by the operator. Module 516 is also for adding to non-keyword file 518 the inappropriate keywords among the displayed keywords designated by the operator through keyboard 504.
In the processing of the Japanese language, different and separate ranges of internal codes are assigned to hiragana, katakana and kanji. Accordingly, such distinction among kanji, katakana and hiragana as described above can be easily made by checking the range of codes.
Referring to FIG. 1, this apparatus operates as follows. Keyboard 504 applies Japanese text data to Japanese text editor 506. Japanese text editor 506 converts characters of the inputted Japanese words to internal codes. Japanese text editor 506 stores the coded Japanese text in Japanese text file 508 to enable the later revision.
String extracting module 510 receives the coded text data from Japanese text editor 506. String extracting module 510 makes the determination as to which belong to the inputted characters among hiragana, katakana and kanji by checking the range of the codes as described above and extracts the strings including only katakana and/or kanji.
Keyword extracting module 512 compares each string extracted by string extracting module 510 with the set of non-keywords stored in non-keyword file 518. Module 512 selects only the strings not stored in non-keyword file 518. Every time a keyword considered appropriate occurs, keyword extracting module 512 counts the frequency of the occurrence of the keyword by using string occurrence tally table 512.
Database updating module 516 first displays the keywords extracted by keyword extracting module 512 in display unit 502. The operator checks the list of the keywords displayed in display unit 502. When a word inappropriate for a keyword is displayed, the operator instructs the database updating module 516 through keyboard 504 to exclude the word from the list of the keywords. Database updating module 516 adds to non-keyword file 518 the string designated by the operator as inappropriate for a keyword. Database updating module 516 also adds to keyword database 514 the strings not designated as inappropriate by the operator.
In the apparatus shown in FIG. 1, strings which can be used as keywords are extracted from the text in accordance with the characteristics of the Japanese language. Furthermore, strings inappropriate for keywords are excluded from the extracted strings. No manual operation is required for extracting the strings from the text. In addition, the contents of non-keyword file 518 are referred to in extracting the appropriate keywords. Accordingly, it is considered that the criteria of selection of words in creating the index are normalized to some extent.
However, the proposal by Kobayashi et al. has the following problems. In the apparatus according to Kobayashi et al., extraction of strings is carried out based only on the characteristics of the Japanese language. Kobayashi et al. do not disclose applicability of the apparatus to other languages.
In addition, in the apparatus according to Kobayashi et al, the extracted strings might be deleted by referring to the non-keyword file. Once registered in the non-keyword file, the word will not be adopted as an index entry thereafter. Besides, the determination as to whether or not the word is registered in the non-keyword file is made by an operator. Accordingly, when a plurality of operators use this apparatus, the choice of the keywords is limited because the non-keyword file is a sum of the respective non-keyword files of the operators.
Furthermore, in the apparatus according to Kobayashi et al., the word adopted as a keyword should have high frequency of occurrence in the document. While a word of frequent occurrence in one document is surely thought to be important, the opposite is not true. Namely, there could be a case where a string occurring just once in the document expresses an important concept. It is possible for the apparatus according to Kobayashi et al. to allow such an important word to be left out from the index.
Furthermore, in the disclosure by Kobayashi et al., no consideration is given to a field of knowledge in the document for which the index is used. Thus, difference in words in different fields is ignored. The apparatus according to Kobayashi et al. allows important words to be left out from an index or unimportant words to occur through the document, costing much time in editing the index.