This invention relates generally to memory technology and in particular to a regular expression compiler for a new high performance intelligent content search memory.
Many modern applications depend on fast information search and retrieval. With the advent of the world-wide-web and the phenomenal growth in its usage, content search has become a critical capability. A large number of servers get deployed in web search applications due to the performance limitations of the state of the art microprocessors for regular expression driven search.
There have been significant research and development resources devoted to the topic of searching of lexical information or patterns in strings. Regular expressions have been used extensively since the mid 1950s to describe the patterns in strings for content search, lexical analysis, information retrieval systems and the like. Regular expressions were first studied by S. C. Kleene in mid-1950s to describe the events of nervous activity. It is well understood in the industry that regular expression (RE) can also be represented using finite state automata (FSA). Non-deterministic FSA (NFA) and deterministic FSA (DFA) are two types of FSAs that have been used extensively over the history of computing. Rabin and Scott were the first to show the equivalence of DFA and NFA as far as their ability to recognize languages in 1959. In general a significant body of research exists on regular expressions. Theory of regular expressions can be found in “Introduction to Automata Theory, Languages and Computation” by Hopcroft and Ullman and a significant discussion of the topics can also be found in book “Compilers: Principles, Techniques and Tools” by Aho, Sethi and Ullman.
Computers are increasingly networked within enterprises and around the world. These networked computers are changing the paradigm of information management and security. Vast amount of information, including highly confidential, personal and sensitive information is now being generated, accessed and stored over the network. This information needs to be protected from unauthorized access. Further, there is a continuous onslaught of spam, viruses, and other inappropriate content on the users through email, web access, instant messaging, web download and other means, resulting in significant loss of productivity and resources.
Enterprise and service provider networks are rapidly evolving from 10/100 Mbps line rates to 1 Gbps, 10 Gbps and higher line rates. Traditional model of perimeter security to protect information systems pose many issues due to the blurring boundary of an organizaton's perimeter. Today as employees, contractors, remote users, partners and customers require access to enterprise networks from outside, a perimeter security model is inadequate. This usage model poses serious security vulnerabilities to critical information and computing resources for these organizations. Thus the traditional model of perimeter security has to be bolstered with security at the core of the network. Further, the convergence of new sources of threats and high line rate networks is making software based perimeter security to stop the external and internal attacks inadequate. There is a clear need for enabling security processing in hardware inside core or end systems beside a perimeter security as one of the prominent means of security to thwart ever increasing security breaches and attacks.
FBI and other leading research institutions have reported in recent years that over 70% of intrusions in organizations have been internal. Hence a perimeter defense relying on protecting an organization from external attacks is not sufficient as discussed above. Organizations are also required to screen outbound traffic to prevent accidental or malicious disclosure of proprietary and confidential information as well as to prevent its network resources from being used to proliferate spam, viruses, worms and other malware. There is a clear need to inspect the data payloads of the network traffic to protect and secure an organization's network for inbound and outbound security.
Data transported using TCP/IP or other protocols is processed at the source, the destination or intermediate systems in the network or a combination thereof to provide data security or other services like secure sockets layer (SSL) for socket layer security, Transport layer security, encryption/decryption, RDMA, RDMA security, application layer security, virtualization or higher application layer processing, which may further involve application level protocol processing (for example, protocol processing for HTTP, HTTPS, XML, SGML, Secure XML, other XML derivatives, Telnet, FTP, IP Storage, NFS, CIFS, DAFS, and the like). Many of these processing tasks put a significant burden on the host processor that can have a direct impact on the performance of applications and the hardware system. Hence, some of these tasks need to be accelerated using dedicated hardware for example SSL, or TLS acceleration. As the usage of XML increases for web applications, it is creating a significant performance burden on the host processor and can also benefit significantly from hardware acceleration. Detection of spam, viruses and other inappropriate content require deep packet inspection and analysis. Such tasks can put huge processing burden on the host processor and can substantially lower network line rate. Hence, deep packet content search and analysis hardware is also required.
Internet has become an essential tool for doing business at small to large organizations. HTML based static web is being transformed into a dynamic environment over last several years with deployment of XML based services. XML is becoming the lingua-franca of the web and its usage is expected to increase substantially. XML is a descriptive language that offers many advantages by making the documents self-describing for automated processing but is also known to cause huge performance overhead for best of class server processors. Decisions can be made by processing the intelligence embedded in XML documents to enable business to business transactions as well as other information exchange. However, due to the performance overload on the best of class server processors from analyzing XML documents, they cannot be used in systems that require network line rate XML processing to provide intelligent networking. There is a clear need for acceleration solutions for XML document parsing and content inspection at network line rates which are approaching 1 Gbps and 10 Gbps, to realize the benefits of a dynamic web based on XML services.
Regular expressions can be used to represent the content search strings for a variety of applications like those discussed above. A set of regular expressions can then form a rule set for searching for a specific application and can be applied to any document, file, message, packet or stream of data for examination of the same. Regular expressions are used in describing anti-spam rules, anti-virus rules, anti-spyware rules, anti-phishing rules, intrusion detection rules, intrusion prevention rules, extrusion detection rules, extrusion prevention rules, digital rights management rules, legal compliance rules, worm detection rules, instant message inspection rules, VOIP security rules, XML document security and search constructs, genetics, proteomics, XML based protocols like XMPP, web search, database search, bioinformatics, signature recognition, speech recognition, web indexing and the like. These expressions get converted into NFAs or DFAs for evaluation on a general purpose processor. However, significant performance and storage limitations arise for each type of the representation. For example an N character regular expression can take up to the order of 2N memory for the states of a DFA, while the same for an NFA is in the order of N. On the other hand the performance for the DFA evaluation for an M byte input data stream is in the order of M memory accesses and the order of (N*M) processor cycles for the NFA representation on modern microprocessors.
When the number of regular expressions increases, the impact on the performance deteriorates as well. For example, in an application like anti-spam, there may be hundreds of regular expression rules. These regular expressions can be evaluated on the server processors using individual NFAs or DFAs. It may also be possible to create a composite DFA to represent the rules. Assuming that there are X REs for an application, then a DFA based representation of each individual RE would result up to the order of (X*2N) states however the evaluation time would grow up to the order of (X*N) memory cycles. Generally, due to the potential expansion in the number of states for a DFA they would need to be stored in off chip memories. Using a typical access time latency of main memory systems of 60 ns, it would require about (X*60 ns* N*M) time to process an X RE DFA with N states over an M byte data stream. This can result in tens of Mbps performance for modest size of X, N & M. Such performance is obviously significantly below the needs of today's network line rates of 1 Gbps to 10 Gbps and beyond. On the other hand, if a composite DFA is created, it can result in an upper bound of storage in the order of 2N*X which may not be within physical limits of memory size for typical commercial computing systems even for a few hundred REs. Thus the upper bound in memory expansion for DFAs can be a significant issue. Then on the other hand NFAs are non-deterministic in nature and can result in multiple state transitions that can happen simultaneously. NFAs can only be processed on a state of the art microprocessor in a scalar fashion, resulting in multiple executions of the NFA for each of the enabled paths. X REs with N characters on average can be represented in the upper bound of (X*N) states as NFAs. However, each NFA would require M iterations for an M-byte stream, causing an upper bound of (X*N*M* processor cycles per loop). Assuming the number of processing cycles are in the order of 10 cycles, then for a best of class processor at 4 GHz, the processing time can be around (X*N*M*2.5 ns), which for a nominal N of 8 and X in tens can result in below 100 Mbps performance. There is a clear need to create high performance regular expression based content search acceleration which can provide the performance in line with the network rates which are going to 1 Gbps and 10 Gbps.
The methods for converting a regular expression to Thompson's NFA and DFA are well known. The resulting automata are able to distinguish whether a string belongs to the language defined by the regular expression however it is not very efficient to figure out if a specific sub-expression of a regular expression is in a matching string or the extent of the string. Tagged NFAs enable such queries to be conducted efficiently without having to scan the matching string again. For a discussion on Tagged NFA refer to the paper “NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions”, by Ville Laurikari, Helsinki University of Technology, Finland.