This invention relates to a full-text data base retrieving device for memorizing a plurality of texts (character code sequences) as a full-text data base to retrieve a text from the full-text data base on the basis of a retrieving condition such as key words, and relates to a full-text index producing device for producing a complementary file (full-text index) which is used in retrieval.
In order to retrieve a text from a full-text data base at a high speed, a complementary file (full-text index) is produced in concern with the full-text data base to be referred on retrieving the text from the full-text data base. In general, the full-text index has any one of first though fifth types. The full-text indexes of the first through the fifth types may be called first through fifth type indexes, respectively.
A single word is used as a key in the the first type index. A character sequence having a predetermined length is used as the key in the second type index. The character sequence having a same character sort is used as the key in the third type index. The single word and the character sequence are used as the key in the fourth type index. The single word and the character sequence are used as the key in the fifth type index. A full-text index file may have a combination of any one of the first through the third type indexes and any one of the fourth and the fifth type indexes.
In a text retrieval of English text, use is often made of a full-text index file having the first and the fourth type indexes or the first and the fifth type indexes. The full-text index file of the type described will be called a first full-text index file. Each word is punctuated with a space in English text. On the other hand, it is necessary to divide a solid writing text into each word with reference to a word dictionary in order to produce the first full-text index file in a text retrieval of Japanese text. This process will be called a morphological analysis. A full-text data base retrieving device having the first full-text index file will be called a first full-text data base retrieving device. In the first full-text data base retrieving device, a word which is at least partially coincident with a key word is retrieved from a key group of the first full-text index file when the key word is given as a query. When a coincident key (word) exists in the first full-text index file, the full-text data base retrieving device reads the text ID or the location in the text as a retrieval result from the first full-text index file.
A full-text index file having the second and the fourth type indexes will be called a second full-text index file. A full-text data base retrieving device having the second full-text index file will be called a second full-text data base retrieving device. In the second full-text data base retrieving device, a character sequence of a key word is divided when the key word is given as a query. It will be assumed that the key word is "" and that the predetermined length is equal to one. The second full-text data base retrieving device divides "" into "", "", and "" each of which is a key character. The second full-text data base retrieving device retrieves the second full-text index file on the basis of the each key character to obtain a set of texts each of which has "", a set of texts each of which has "", and a set of texts each of which has "". On the basis of these sets, the second full-text data base retrieving device obtains a set of texts each of which has "", "", and "".
It will be assumed that the key word is "" and that the predetermined length is equal to two. The second full-text data base retrieving device divides "" into "" and "" each of which is a key character. The second full-text data base retrieving device retrieves the second full-text index file on the basis of the each key character to obtain a set of texts each of which has "" and a set of texts each of which has "". On the basis of these sets, the second full-text data base retrieving device obtains a set of texts each of which has "" and "". The set of texts may includes a rubbish. More specifically, three characters may not be arranged in order of "" even if three characters of "", "", and "" are included in a text. For example, the text including the character sequence of ". . . . . . " becomes the rubbish. In order to remove the rubbish, it is necessary to carry out character string watching between the text and the key word in concern to the text of the retrieval result.
A full-text index file having the second and the fifth type indexes will be called a third full-text index file. A full-text data base retrieving device having the third full-text index file will be called a third full-text data base retrieving device. In the third full-text data base retrieving device, a character sequence of a key word is divided when the key word is given as a query. It will be assumed that the key word is "" and that the predetermined length is equal to one. The third full-text data base retrieving device divides "" into "", "", and "" each of which is a key character. The third full-text data base retrieving device retrieves the third full-text index file on the basis of the each key character to obtain a set of text ID and location in the text which has "", a set of text ID and location in the text which has "", and a set of text ID and location in the text which has "". The third full data base retrieving device combines the elements of these sets to obtain a location at which three characters of "", "", and "" appears as a character sequence of "" in a same text.
It will be assumed that the key word is "" and that the predetermined length is equal to two. The third full-text data base retrieving device divides "" into "" and "" each of which is a key character. The third full-text data base retrieving device judges the location at which the character sequence of "" as a similar manner described above. The rubbish does not occur in the third full-text data base retrieving device.
A full-text index file having the third and the fourth type indexes will be called a fourth full-text index file. The fourth full-text index file uses, as a key character sequence, a character sequence obtained by dividing a text by a same sort of characters such as Chinese character, Japanese cursive syllabary, and square Japanese syllabary. A full-text data base retrieving device having the fourth full-text index file will be called a fourth full-text data base retrieving device.
It will be assumed that the text is "". Each of "", "", and "" becomes the key character sequence. The key word of the query is divided in a similar manner described above. The fourth full-text text data base retrieving device retrieves the fourth index file on the basis of the key word. For example, the key word is divided into "" and "" when the key word is "". The fourth full-text data base retrieving device retrieves the fourth index file to obtain a text including "" and "".
A full-text index file having the third and the fifth type indexes may be called a fifth full-text index file. A full-text data base retrieving device having the fifth full-text index file will be called a fifth full-text data base retrieving device. The fifth full-text data base retrieving device is operable in a manner similar to the fourth full-text data base retrieving device.
By the way, the first full-text data base retrieving device must use the above-mentioned morphological analysis on producing the first full-text index file in case of Japanese text. In this analysis, it is necessary to divide each text into words with reference to a word dictionary having a hundred thousand through a several hundred thousand words. Therefore, it takes a long time to produce the first full-text index file. Furthermore, it is a case that some texts have a word which is not included in the word dictionary. As a result, it is difficult to analyze all of texts with a high accuracy. Namely, it is difficult for the first full-text index file to have a high accuracy.
As described above, the retrieval results may include the rubbish in the second full-text data base retrieving device. Similarly, the retrieval results may include the rubbish in the fourth full-text data base retrieving device as known in the art. In order to remove the rubbish, it is necessary to carry out character string watching between the text and the key word in concern to the text of the retrieval result. As a result, It is difficult to carry out the retrieval at a high speed in each of the second and the fourth full-text data base retrieving devices.
On the other hand, it is possible to produce each of the third and the fifth full-text index files at a short time. The rubbish does not occurs in each of the third and the fifth full-text data base retrieving devices on retrieval.
However, it is difficult to carry out a retrieval at a high speed with the full-text index file having a small capacity in any one of the third and the fourth full-text data base retrieving device as will be describe later.