The present invention relates to a system and method for searching a large volume of data stored in a disk in, for example, document file format at a high speed while allowing desired ambiguity.
Conventionally, there has been demand to search a large volume of data, such as newspaper accounts, patent publications, or scientific/engineering literature, written in natural language and stored in a disk at a high speed. There have been proposed various search methods. Such methods are roughly divided as follows.
Keyword search scheme: This scheme previously indexes individual document and keywords indicating contents of the document. At the moment, methods for determining the keywords include automatic keyword extraction by the morphemic analysis, manual appending of keywords, and a combination of them. However, this scheme can only search character strings indexed as keywords. In addition, because the accuracy of automatic keyword extraction by the morphemic analysis depends on the accuracy of a dictionary, there is a disadvantage that it requires much manpower for maintaining the dictionary.
Full text search scheme without index: This is a scheme not to use an index, but to scan all documents to be searched every time a character string which is wanted to be searched is designated. Some systems use special hardware to increase the search speed. However, a system employing special hardware increases cost, and may impose restrictions on the type of machine which can be used in a client/server environment.
Full text search scheme with index: The present invention relates to a scheme for searching full document with index. Known techniques intending to increase the speed of full text search with use of index include the following.
Japanese Published Unexamined Patent Application (PUPA) No. 4-205560 discloses creating a search file by dividing a character string to be searched into search units, each of which is a unit for performing the search; appending ascending codes for every search unit; appending attribute codes to the divided search units, the attribute code indicating a logical partition for the search unit; and appending character position sequence codes to every characters in the character string to be searched, the sequence codes indicating the position of character in the search unit, whereby character position information consisting of a search unit identification code, the character position sequence code, and the attribute code is created and stored in areas by character type.
Japanese PUPA No. 4-215181 discloses creating a search file for reducing the number of collation for character strings for search processing and for enabling a general purpose information processing system to perform high speed collation, in which search file character set position information is grouped by character set type, the character set position information indicating at which position in the character string each character set constituting the character string to be searched is positioned.
There frequently arises a necessity to search not only document containing character strings fully matching a search character string, but also document containing character strings partially matching it. For example, there is a case where the user is uncertain of the search character string, or where the search character string may appear in variations so that it is difficult to enumerate all such variations. A typical method for designating a partial character string in the prior art is to use regular expression. According to such method, it is possible to designate repetition of any character zero or more times, repetition of any character one or more times, the end of a line, the top of a line, or any character within a range of specific character codes.
In addition, Japanese PUPA No. 63-99830 discloses a system having a function for partial matching between search character string data and character string data to be searched, the system comprising a table which accumulates data on synonym relationship in the search character string data, and data indicating whether the search character string data appears in any character string data to be searched.
Furthermore, Japanese PUPA No. 62-221027 discloses, when a character string taken out from the top of a character string for partial search is not retrieved from a dictionary, reducing the number of invalid searches by performing forward search for a next taken-out character string the length of which is incremented by one, whereby a word can be taken out at a relatively high speed and efficiently.
Furthermore, Japanese PUPA Nos. 4-326164 and 5-174067 disclose a database search system comprising search means which stores self-correlation information for every articles to be searched, finds the degree of matching between the self-correlation information of search keys and the self-correlation information of the articles to be searched, and outputs the number of articles in the descending order of the degree of matching.
However, these character string search techniques of the prior art have difficulty in designating the degree of ambiguity of character strings to be searched or the like so that the result of a search may be not desirable for the user, or contain many unnatural character strings. In addition, in these character string search techniques of the prior art, when a character string to be searched is "This is a pencil", it is impossible to determine character strings such as not only "This is a pemcil" (one wrong character) and "This is a red pencil" (one interposed word) but also "This is a red pemcil" (one wrong character and one interposed word) as a character string similar to the search character string.