1. Field of the Invention
The present invention relates to a device for processing strings, which device predicts a probability of occurrence of a character following a given string, and, in particular, to a device for processing strings, which device predicts a probability of occurrence of each character of a string from a left context of the character in the string.
2. Description of the Related Art
PPM (Prediction by Partial Matching) is well used as a statistical language model in text compression. PPM* is a variant of PPM (see xe2x80x98Unbounded Length Context for PPMxe2x80x99, the Computer Journal, Vol. 40, No. 2/3, 1997, pages 67-75, written by J. G. Cleary and M. J. Teahan of the Department of Computer Science, University of Waikato, Hamilton, New Zealand, and xe2x80x98Japanese Word Segmentation by a PPM* Modelxe2x80x99, NL report, 128-2 (1998. 11.5), pages 9-16, written by Hiroki Oda and Kenji Kita of the Faculty of Engineering, Tokushima University). The PPM* is characterized in that no upper limit is set on the number of order n (context length) of the model.
In PPM*, a string indexing structure through which it is possible to store past contexts compactly, and to refer to and to perform additions/deletions on them flexibly at high speed is needed. As such a string indexing structure, a trie (see the above-mentioned document xe2x80x98Unbounded Length Context for PPMxe2x80x99) or the like is used in the related art.
However, when the trie or the like is used as the string indexing structure, increase in the scale of context requires a large storage capacity.
Further, PPM* in the related art uses a relatively simple context-selection method, and performance of predicting an appearance probability of each character of an input string is not sufficient.
An object of the present invention is to provide a string-processing device in which the size of a string indexing structure to be stored can be reduced even when the scale of context increases.
Another object of the present invention is to provide a string-processing device having high performance in predicting an appearance probability of each character of an input string.
In order to achieve the above-mentioned objects, a device for processing strings according to the present invention comprises:
a corpus-DB portion in which a corpus is stored;
an index portion in which a series of position numbers built for the corpus is stored;
a searching portion which searches for positions of occurrences of a given string in the corpus using the series of position numbers; and
a predicting portion which, using the result of search performed by the searching portion, predicts a probability of occurrence of a character following the given string.
In this arrangement, when a probability of occurrence of a character following the given string is predicted in an algorithm such as PPM*, a series of position numbers (such as a suffix array) built for a corpus is used instead of a trie or the like. Thereby, in comparison to the related art in which a trie or the like is used, it is possible to search positions of occurrences of the given string at high speed through binary search. As a result, it is possible to improve the performance of predicting a probability of occurrence of a subsequent character. Furthermore, it is possible to reduce the amount of storage required for a string indexing structure.
Other objects and further features of the present invention will become more apparent from the following detailed description when read in conjunction with the accompanying drawings.