A regular expression, as a formal language used to express a sequence of strings with specific rules, is simply called a regex and is widely used to compare, or express to search and find, strings by using processing devices such as computers. According to a formal language theory in a field of computer science that deals with regular expressions, the regular expression is based on ε which represents an empty string and only one character (e.g., a, b, or c) and various patterns of strings may be represented by combining characters in use of operators such as concatenation (e.g., abc, bbbb, or baba), selection (e.g., ab|c, or ab|ba), repetition (e.g., c*). Further, because it may occur that regular expressions are too long or complicate, new regular expressions in a form of adding various extension grammars have been introduced for convenience. For example, the new regular expressions include Perl compatible regular expressions (PCRE) that are implemented according to a method used in Perl, a computer programming language, POSIX regular expressions defined in the standards for UNIX-like computer operating system environments, etc.
Because search time and memory usage largely depend on under which method search is performed even in searching strings by using regular expressions, studies on how to search strings effectively by using regular expressions are actively conducted.
First of all, a method for searching a string by converting a regular expression to nondeterministic finite automata (NFAs) as one of conventional technologies for string search by using regular expressions has been introduced.
FIG. 1 is a drawing exemplarily representing general NFA converted under conventional technology. In FIG. 1, circles in which there are numbers represent states of NFA and each arrow and its character in parallel indicate a transition from one state to the other state along the arrow if the character is inputted. Further, a concentric circle among all the circles where there are numbers shows a final state. If the final state is reached as a result of a transition(s) starting from the initial state, it means that a desired string is found. By referring to FIG. 1, the desired string may be either of three characters including only ‘a’ or ‘b’ in sequence or three characters ‘bad’ in sequence and it may be expressed as “[ab]{3}|bad” in a regular expression and also may be illustrated in a form of NFA as shown in FIG. 1.
A look at a course of searching a desired string in an input string “gjekf3jmbab0d1f” by using the NFA illustrated in FIG. 1 will be taken.
First, while all the characters in the input string are read one by one, whether one state can be moved to another state is tested, starting from a state 0 as an initial state. Because “g” is inputted as the first character of the input string and there is no transition moving from the state 0 to another state by corresponding to the entry of “g”, the state continuously remains the state 0. Since there is no transition corresponding to the second character “j” as well, the next character is read while it remains the state 0. Accordingly, in such a case, a number of a state required to test a state transition, i.e., an active state, is totally one since the active state only includes the state 0.
If the front part of the input string, i.e., “gjekf3 μm”, is inputted while the state continuously remains the state 0 in such a way, the character next to them is “b”. If “b” is inputted in the NFA in FIG. 1, there are transitions moving from the state 0 to states 1 and 2. Therefore, the state 0 moves to the states 1 and 2, respectively. Accordingly, just after “b” is inputted, the states required to test the state transition, i.e., the active states, become states 0, 1, and 2, which are totally three states.
The following inputted character becomes “a”. If “a” is inputted, there exist transitions through which the states 0, 1, and 2 as the active states move to states 1, 3, and 4, respectively, in the NFA in FIG. 1, so that the states 0, 1, and 2 move to states 1, 3, and 4, respectively. Accordingly, just after characters “ba” are inputted, the states required to test the state transition, i.e., the active states, become the states 0, 1, 3, and 4, which are totally four states.
The following inputted character is “b”. If the character “b” is inputted, there exist transitions through which the state 0 as the active state moves to the states 1 and 2, the state 1 as the active state to the state 3, and the state 3 as the active state to a state 5, in the NFA in FIG. 1, so that the state 0 moves to the states 1 and 2, the state 1 moves to the state 3, and the state 3 moves to the state 5. Accordingly, just after “bab” are inputted, the states required to test the state transition, the active states, are states 0, 1, 2, 3, and 5, which become totally five. As this caused the state to reach the final state of the NFA in FIG. 1, a desired string from the given input string, i.e., “bab”, can be dealt with to be found.
As explained above, there may exist a transition through which one state moves to two or more states in the NFA by corresponding to an inputted character according to the conventional technology illustrated in FIG. 1. Accordingly, if a desired string is searched while characters included in an input string are inputted one by one by using the NFA according to the conventional technology, the number of the states required to test, i.e., the number of active states, increases. Whenever the characters included in the input string are inputted, all the active states must be tested. Therefore, there occurs a problem of declining the search speed as much as the number of increased active states.
As another example of conventional technologies that search a desired string by using regular expressions, a method for searching the desired string by converting NFA to deterministic finite automaton (DFA) has been introduced as well.
When the desired string is searched by using the DFA, there is an advantage that it may improve the search speed and simplify a course of processing an input string because the number of the active states is always kept as one, but there exists a problem of requiring considerable memory resources in a course of converting the NFA to the DFA. In addition, if there are multiple desired strings to search, i.e., regular expressions, or if their patterns are complicated, the memory usage for processing them drastically increases and there is even a limit that it is impracticable to process multiple or long-patterned regular expressions by using the DFA. In particular, as the demand for computer security is recently increasing, it is general to search strings by integrating various patterns of strings to search, but in such a case, it may be impossible to perform string search by using the DFA in normal computer memory resources because corresponding regular expressions may be so complicated.
To solve the problem that may occur by the method for searching strings by using the general DFA, a method for reducing memory usage by compressing the DFA has been introduced. In other words, it is a method for integrating multiple states whose transitions are similar or integrating multiple transitions commonly used at several states into one transition by gathering them. However, according to this method, there may occur a case of an impossibility of normally processing input strings or further transition failure because states or transitions which are not completely identical to one another may be integrated. Because the additional memory references are required when such transition failure occurs, even though the memory usage is less than that before compression, the memory reference counts increase more than before compression. As a result, it could not be impossible to prevent the search speed from being declined.
Accordingly, the necessity of a technology for allowing strings to be rapidly searched by minimizing the number of active states and using less memory while the NFA is used.