1. Field of the Invention
The present invention relates to an information processing apparatus, an information processing method, and a computer program product for reduction in the number of states in a deterministic finite state automaton.
2. Description of the Related Art
A finite state automaton (FSA), which represents combinations of a finite number of states, transitions, and actions, is called a finite automaton or a finite state machine (FSM) and used in various fields. One type of the finite state automaton, in which when an input symbol is provided, a transition destination state is uniquely determined, is called a deterministic finite state automaton (DFA). The DFAs includes one in which there are outgoing transitions from each state for all input symbols, and also includes one in which there are outgoing transitions for only some input symbols.
As an example that uses the DFA in a field of speech recognition, a dictionary that records therein phoneme sequences of recognizable words is represented by the DFA. In this example, when the number of states represented by the DFA is reduced, processing efficiency can be enhanced. Another example of the application is text search, in which the DFA can be used as follows. When a text is to be searched for plural keywords, the DFA is created based on the plural keywords and, upon reaching a final state, it is notified that any one of the plural keywords is found. Also when such processing is performed, the processing efficiency can be enhanced by reducing the number of states in the DFA.
It has been known that when a DFA is provided, the DFA has one with a minimal number of states. Various methods for obtaining the DFA with a minimal number of states are proposed. As conventional methods, a method of Hopcroft and Ullman (see Introduction to Automata Theory, Languages, and Computation, Second Edition, Chap. 4, Sec. 4 “Equivalence and Minimization of Automata”, John E. Hopcroft, Rajeev Motwanai, Jeffrey D. Ullman, 2000), a method of Aho et al. (see Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman, Compilers Principles, Techniques, and Tools, 1985, pp. 142-143), and a method of Hopcroft (see J. E. Hopcroft, An n log n algorithm for minimizing the states in a finite automaton, Theory of Machines and Computations, Academic Press, New York, 1971, pp. 189-196) are known.
However, in the above methods, when processing related to minimization of the number of states is interrupted, an intermediate result cannot be obtained. That is, reduction in the number of states in a large-scale DFA that requires much time for the minimization process cannot be achieved incrementally. As methods for solving the problem, methods disclosed in “An incremental DFA minimization algorithm, by Bruce W. Watson, Workshop on Finite State Method in Natural Language Processing (FSMNLP '01), 2001”, and “An efficient incremental DFA minimization algorithm, by Bruce W. Watson and Jan Daciuk, Natural Language Engineering, 9(1), 2003, pp. 49 to 64” (hereinafter, “methods of Watson et al.”) are conventionally known. By using these methods, when processing related to reduction in the number of states is interrupted, a DFA at the interruption can be obtained, and when the processing is completed, a DFA with a minimized number of states can be obtained.
The reason why the methods of Watson et al. enable to interrupt the processing is that distinguishability between two states is incrementally checked for all pairs one by one. For example, an entire string of input symbols leading a certain state p to a final state is defined as a language L(p). A fact that two states p and q are distinguishable means L(p)≠L(q). Similarly, a fact that the two states p and q are indistinguishable means L(p)=L(q). When it is recognized that the two states are indistinguishable, one of the two states can be regarded as the same as the other and be deleted at that time. Therefore, even when the processing is interrupted, a DFA in which some states are determined indistinguishable and the number of states is reduced by the previous processing can be obtained.
However, among the methods of Watson et al., in the method described in “An incremental DFA minimization algorithm, Workshop on Finite State Method in Natural Language Processing (FSMNLP '01), 2001”, the process of checking distinguishability between two states for all pairs and merging indistinguishable states into one is performed. Therefore, when the number of states increases, the amount of processing is increased rapidly. In the method described in “An efficient incremental DFA minimization algorithm, Natural Language Engineering, 9(1), 2003, pp. 49 to 64”, 2-tuples of states determined distinguishable need to be stored to shorten the processing time. Therefore, a memory area proportional to the square of the number of states is needed at a maximum, and accordingly a large memory area is required when the number of states is large.