Heretofore, in order to extract necessary information from a large amount of documents, development of a search apparatus for characteristic words has been performed. As a method to realize the search of characteristic words, a method can be considered which reads each document in order according to an input list of document numbers, counts the number of words included in the document, and extracts highly frequent words as characteristic. However, as this reading process of documents is a random access and it is necessary to repeatedly read document data, there is a problem that the search speed is slow. Further, although an approach can be considered which samples the document to read and read only a part of the document, there is a problem in this method that the accuracy is greatly reduced.
In order to address such problem, for example, Non Patent document 1 discloses a search system which compacts a list of words that appear in a document with a document number as a key, and performs search in a state that the compacted list is held to a memory as data for associating document words. Since the search system disclosed in Non Patent document 1 can refer to a sequence of words included in the input document list at a high speed by the data in the memory, a related word can be returned at a high speed.
Moreover, Non Patent document 2 discloses a search system including as components, a frequency-ordered index obtained by sorting inverted indexes included in a document set in order of frequency, and a means to accept queries to this frequency-ordered index.
In response to the query, the search system disclosed in Non Patent document 2 firstly reads the frequency-ordered index in ascending order (in order of highly frequent words). Next, this search system compares a list of document numbers for each word with an input document list, and determines the frequency of each word within the document set that is specified by the input document list.
This process ends at the time when a frequency f(k) of the kth word which has been read becomes greater than a frequency of a word in the frequency ordered-index to be read next in the document set (all the document sets to be searched). As described above, as the reading process is performed in the same order every time according to the frequency-ordered index, sequential access of the reading process can be realized. Therefore, according to the search system disclosed in Non Patent document 2, it is considered that the search speed can be improved.