Regular expression (RE) describes a character string matching mode and is used for checking whether a character string contains a certain type of sub-string and replacing a matched sub-string or taking a sub-string meeting the matching condition from a certain character string. The RE is a word mode formed by common characters (e.g., characters a to z) and special characters (e.g., metacharacters * and /). A word mode is matched with a found character string through the RE, and the RE is an expression mode commonly used in character string mode matching. Generally, the RE is merely used to express a mode and is required to be converted into finite automata (FA) for being used in a computer to perform high-efficient mode matching. The FA includes several states in which each state will transfer to other states after receiving one or more characters. Each FA has several start states and accept states. When performing matching, the FA starts from a start state and regards characters of a target character string as an input of the current state in sequence. The process is cycled continuously till the accept state is reached or the target character string matching is completed. If a final state is an accept state, the matching is considered to be successful. Otherwise, the matching is considered to be unsuccessful.
Taking RE “(time\x20|now\t)\d{3,5}s” as an example, “\x20” is a hexadecimal expression of ASCII value of a blank, “\d” represents any numeral from 0 to 9, “\t” represents a tab, “|” represents logic “or”, “time\x20” represents a continuous character string, and “{3,5}” represents that the repeating number of a numeral is at least three and at most five. The RE is used for matching whether a character string is a keyword “time” followed by a blank, or a keyword “now” followed by a tab, and then a numeral of 3 to 5, and finally a time unit of second (s). FIG. 1 is a schematic structural view of state transferring of FA after RE conversion. As shown in FIG. 1, each circle represents a state, and the numeral in the circle represents an index of the state. In FIG. 1, from a start state “0”, one character is processed each time, and each time an effective character (character tagged on an arrow line between states represented by two circles) is input, a next state is entered. Upon matching, when an input character is not acceptable for the state, return to the start state “0”. In a state “11” and a state “12”, if an input character is “\d”, enter a corresponding next state“12” or a state “13”. If an input character is “s”, enter a state “14”, where the state “14” is an accept state and is expressed by two concentric circles. If matching is performed to the accept state, it indicates that the matching is successful, and if the accept state is not reached after the target character string is processed, it indicates that the matching is failed. The two-dimensional storage structure corresponding to the FA is as shown in Table 1.
TABLE 1AcceptIndexstateOtherstime\x20now\t0-9s0001510022003300440085006600770088000990001010000111100001212120000131313000014141000000000000
Generally, in an FA state table, each column represents a ASCII character, so Table 1 should have 256 columns but for the sake of convenient description and display, characters that are not present in RE “(time\x20|now\t)\d{3,5}s” are uniformly represented by an “others” column. All the cells in the column are “0”, and accordingly, each state in FIG. 1 is corresponding to a row in Table 1, and the other blanks are “0”.
When performing matching on a target character string according to Table 1, the start state is assigned to be “0”, and a target character is read from the start state in sequence, with the state number as a row index and the character as a column index, whereby a target state value is found and is assigned to the current state. It is determined whether the “accept state” column of the row where the state is in is 1, and if yes, the matching is successful and matching is completed. If the matching is not successful after the input of the target character string is completed, the matching is failed.
If the RE is “(time\x20|now\t)\d{3,5}s”, where the string“time\x20” is corresponding to State 0 to State 4, the five states are similar to one another. That is, after an effective character is input, the state transfers to a next state. Otherwise, the state transfers to State 0. Similarly, “\d{3,5}” is corresponding to State 8 to State 12, the five states are still similar to one another. That is, when a numeral is input, the state transfers to a next state, and when inputting “s” at a state meeting numeral number, the state transfers to State 14. Otherwise, the state transfers to State 0. Thus, in the two-dimensional storage structure shown in Table 1, when the transferring states of the continuous character in corresponding RE are similar to one another, lots of redundancy is generated by storage structure. For example, when the transferring states represented by the RE are increased, the generated redundancy and the occupied memory space are increased in proportion. Thus, in the two-dimensional storage structure shown in Table 1, when the state number is less than 256, the state index in each cell may be represented by a byte, and thus the storage space required by each row is 256 bytes. When the state number is in the range of 256 and 65536, the state index in each cell is required to be represented by two types, and thus the storage space required by each row is 512 bytes. Therefore, the more the state number is, the more the byte number occupied by each cell is, and accordingly the larger the occupied memory space is.