1. Field of the Invention
The present invention relates to an information retrieval system, and more specifically to a string collating system for searching for a character string of arbitrary length within a given distance from a reference character string.
2. Description of Related Art
At present, character string collating systems have been used for retrieval of text data base, collation of feature sequence in pattern recognition systems, extraction of key words from texts drafted by use of word processors, aid of language translation, address filtering of electronic mail, etc. In other words, the string collating systems are indispensable to current data processing systems.
In the string collating systems, it has been desired to extract or pick up from a plurality of character strings, not only a character string perfectly consistent with a reference string, but also a character string having some degree of similarity in comparison with the reference string. The reason for this is that (1) there is possibility of a text having one or more misspelled words in a text data base retrieval, and (2) retrieval has often to be performed by an uncertain key word. In addition, in the case of collating feature sequences with a reference feature sequence in the pattern recognition, it is a rare case that a feature sequence completely consistent with the reference feature sequence is found. Therefore, it has been required to find out a feature sequence having a highest degree of similarity to a reference feature sequence from a number of feature sequences.
As one means for measuring the degree of similarity between a reference character string and character strings to be collated, the conception called "distance" has been used. This distance means that assuming that a unitary operation is defined by deletion of one character, substitution of one character, or insertion of one character, a distance between two given character strings is defined by a minimum number of unitary operations required for changing one of the two given character strings to the other.
The conception called the "distance" is described in detail in "Approximate String Matching" by Patrick Hall and Geoff Dowling in Computing Surveys, 1980, Vol. 12, No. 4, Page 381.
Referring to FIGS. 1A, 1B and 1C, examples of a unitary operation such as deletion of one character, substitution of one character, or insertion of one character, are illustrated. In these Figures, "ABCD" is indicated as an original character string, and three modified strings obtained by performing one unitary operation are shown below each original character string "ABCD". In these figures, "C" means any character excluding "C", and "X" means any arbitrary character. In addition, a character having an upper bar and "X" have the same meaning in the following description.
Referring to FIG. 1D, there are shown a table indicating examples of character strings separate from a character string "ABCD" within an extent of distance "3". It will be seen from FIG. 1D that "ABCD" is separate from the character string "ABCD" by a distance "1", and therefore, is nearer to the character string "ABCD" than "ACXD" which is separate from the character string "ABCD" by a distance "2".
Japanese Patent Application Laid-open No. 61-95442 and a corresponding European Patent Application laid-open No. 0178651 disclose a character string collating system capable of searching for a character string within an extent of a distance "1" from a reference string. However, the extraction of character strings within an extent of a distance "1" is not sufficient in order to use the character string collating system in a pattern recognition system for voice recognition or handwritten letter recognition. In the voice recognition, for example, a feature sequence extracted from a given voice (the feature sequence corresponds to a string to be collated) involves various fluctuations due to differences in age, the distinction of sex, native place, etc. of a speaker. Therefore, it is a rare case that a feature sequence extracted from a given voice is within an extent of a distance "1" from a template of a prepared feature sequence (corresponding to a reference string). Therefore, in order to use the character string collating system in a pattern recognition system, it is necessary to extract a group of character strings within an extent of a further separate distance, and to select a character string having the nearest distance among the group of extracted character strings. The above mentioned laid-open application has disclosed a string collating system meeting with this requirement.
For example, the string collating system disclosed in above mentioned laid-open application can search for a character string "AXBXCXD" separate from a reference string "ABCD" by a distance "3". However, the string collating system disclosed in above mentioned laid-open application does not take deletion of character (which is one of the unitary operation) into consideration, it cannot search for a character string "AD" separate from a reference string "ABCD" by a distance "2". In other words, the string collating system disclosed in above mentioned laid-open application cannot evenly extract all character strings of different lengths within a predetermined distance from a reference string. It can be said from a different viewpoint that the string collating system disclosed in above mentioned laid-open application cannot extract a character string having a high degree of similarity to a reference string, but often searches for only a character string having a low degree of similarity