1. Technical Field
The present disclosure relates to regular expression processing. In particular, the present disclosure relates to apparatuses and methods for matching an input string with regular expression.
2. Discussion of Related Art
Regular expression set matching may be used to locate all occurrences of substrings of a given input string matching a regular expression. Exact set matching, also known as keyword matching or keyword scanning, is widely used in a number of applications, such as virus scanning and intrusion detection. However, keyword based approaches only allow the defining of static keywords.
Intrusion detection software and virus scanners may use a regular expression (regex) to capture more precise information and to perform deep packet scanning. The regular expression is a string of symbols (for example, characters, letters, and digits) that defines a pattern used in a search for a matching input string. The symbols used by the regular expression and by the input string are drawn from a set of the symbols, known as an alphabet of the regex.
Deep packet inspection enables advanced security functions as well as internet data mining, eavesdropping, and censorship. The processing needs for applications that perform deep packet inspection are increasing, due to the combined increase in network speeds, and network threats, such as viruses, malicious software (malware) and network attacks. Regular expressions can be used to express families of patterns. Matching input data against a set of regular expressions can be a very complex task and greatly depends on the features implemented in regular expressions. Several different formalism techniques are available, each building on the features of a “simpler syntax” and adding more features.
Several programming languages (for example, perl) directly provide regular expression support to ease programmer tasks when dealing with text analysis. Extended context free grammars (i.e., a context free grammar with regular expressions) may be used in a high level parser generator. Antivirus software may use regular expressions to scan for virus signatures in files and data. Genome researchers need to match Deoxyribonucleic acid (DNA) base sequences and patterns in their data. While very basic patterns can be searched using keywords, the more advanced patterns require a construct that is able to express more general patterns.
One approach handles regular expressions by building either a Deterministic Finite Automaton (DFA) or a Non-Deterministic Finite Automaton (NFA) from the expression set and then simulates the execution of these finite automata (also known as finite state automata, or state machines). A DFA is a state machine in which for each state and corresponding input, there is only one transition to a subsequent state. In contrast, an NFA is a state machine in which for each state and corresponding input, there may be a number of possible subsequent states.
A DFA approach must use a separate state for every possible partial match for every possible pattern instance, thereby leading to exponential memory requirements. Further, a DFA approach cannot count symbols, which again forces a complete expansion of every alternative, thereby leading again to exponential space requirements.
A NFA approach may require more than one state traversal per input character, and therefore are potentially very slow in practice and may require a large amount of history memory. Further, an NFA approach is non-deterministic, which may lead to an exponential time to simulate it using backtrack, or exponential space to encode every possible output state after each transition.