Speech recognition, optical reading, correspondence searches of gene and protein sequences in bioinformatics, and database searches in general are examples of situations, in which there is a need to find a specific input symbol string among symbol strings. The symbol string can then be made up for example of consecutive characters or consecutive symbols representing phonemes. Often there is a danger that the input symbol string is not completely correct. The aim is, however, to find among the symbol strings of a database, for instance, the symbol string that completely corresponds to the input symbol string, or the symbol string that resembles it the most, if a fully corresponding input symbol string cannot be found.
A solution for searching for a symbol string is previously known, in which the symbol string is searched among symbol strings made into a trie data structure. The symbol strings are then grouped into branches in such a manner that all symbol strings starting with the same symbols belong to the same branch. The symbol strings in one branch divide into new branches at the symbols, from which onwards the symbol strings differ from each other.
The “tree-like” trie data structure has been employed in the search for symbol strings in such a manner that the branches of a data structure are searched until the leaves. Each new symbol encountered on the branch indicates a calculation point, at which a distance is calculated between a sample symbol string formed by the symbols of the calculation point and the calculation points preceding it and the searched input symbol string by comparing them in alternative ways. The distance refers to any reference value that describes how many changes are required to make the compared symbol strings correspond to each other. One known way of calculating the distance is the Levenshtein algorithm.
The calculation ends when the distances for all calculation points of all branches of the trie data structure are calculated. After this, a comparison is made to find the shortest distance. To produce a response, the symbol string of the branch or the symbol strings of the branches with the shortest distances in the last calculation points are selected.
The most significant weakness of the above-mentioned prior-art solution is that it requires a relatively large amount of calculation. The best possible symbol string, i.e. the one closest to the input symbol string, can only be found after all calculation points in the trie data structure are calculated. Because in database searches, for instance, the number of symbol strings in the database is extremely large, this means that the number of required calculations becomes very large and, therefore, the time required for the calculations is long. Obtaining a response to the input, therefore, requires a lot of time.