A common problem of using deterministic finite state machine (FSM) for regular-expression matching is that certain combinations of regular expressions can require very large numbers of states and state transitions, often called a “state explosion”, when being mapped on the same FSM, resulting in very large storage requirements. This applies in particular to regular expressions that contain a lot of “overlaps.” “Overlaps” are portions of patterns for which matching input strings in the input stream are likely to match multiple other patterns or portions of patterns too.
There are many types of “overlaps.” One type is caused by the use of metacharacters such as “.” followed by a quantifier. This metacharacter is used to match any character. Example of quantifier may include “*” or “+”. Quantifiers are used to match multiple characters in place of the quantifier. Another type of “overlap” is caused by the use of character classes containing a large number of characters also followed by a quantifier.
Two sample descriptions of these two types of regular expressions are:<regex subexpression 1a>.*<regex subexpression 1b>and<regex subexpression 2a>[0-9a-zA-Z]+<regex subexpression 2b>
The regex subexpression can be any regular expression or string. Two examples of the above expressions are:regex1=abcd.*efgh regex2=pqrs[^\n]*tuvw 
The first expression regex1 specifies that the input should contain a string “abcd”, followed by another string “efgh” and that there can be any number of characters in between. The second expression regex2 specifies that the input should contain a string “pqrs”, followed by a second string “tuvw”, and that there can be any number of characters in between except that none of theses characters is allowed to be a newline character, denoted by ^ (not) “\n” (newline).
If these regular expressions would be mapped directly on a single state diagram, then the state vector has to represent whether at any given moment during the match operation the first subexpressions of the regular expression, namely “abcd” and “pqrs” already have been found in the input stream or not. Because of the “.*” and “[^\n]*” there are several combinations possible regarding the order in which these subpatterns can be found, e.g., none of the first subpatterns have been found, only subpattern “abcd” has been found, only subpattern “pqrs” has been found, both subpatterns have been found. All of these combinations have to be encoded into the state vector since the expressions are scanned in parallel and processed at the same time. It will be clear that in case of many such regular expressions, the number of different combinations that have to be encoded in the state vector can increase very rapidly, resulting in a possible “state explosion” and corresponding increase in storage requirements.