History of Regular Expression Side-Effect Development
The current state of regular expression side-effects in the marketplace is that the following are available: (1) Perl-like regex grammars, which offer the capturing of the N-th parenthetical expression and offer the return of a string by performing a global replacement of one regex with a specific string literal; (2) Perl itself has in addition to this the ability to embed side-effect statements in the regex, but these are terribly flawed in that they execute as the backtracking engine encounters them in its forward and backwards movements through the stream, rather than executing them as true side-effects if and only if they are involved as a matching sub-expression of the final best match determined. Compiling regular expression side-effects to Java™ or C# code builds upon two previous patent documents (written by the applicant) on regular expression side-effects, which patent documents show how a finite automata can offer side-effects that are truly accurate and very powerful grammatically to the programmer, correcting the flaws of Perl side-effects. The '231 patent shows that functional statements of the host grammar can be embedded in DoPatterns, such that the pattern matching characteristics of a sub-expression are specified with the Pattern composition grammar, and then the functional statements that are to execute if and only if this subpattern is found to be part of the best total match, are wrapped in a “pre-list” and a “post-list” as comma separated statements. The '231 patent also teaches that the DoPattern could contain variables scoped to the DoPattern, and that the statements in the “pre-list” and “post-list” could not only use the variables defined and initialized in the DoPattern “pre-list”, but that the functional statements in these statement lists that wrap that matching characteristics could access variables in outer scopes, such as the parameters of the rule in which this DoPattern is declared. The '892 patent extended the first to allow the DoPattern, and closely related CapturePattern to access at side-effect time (in the pre-list and post-list) the member variables of the struct in which the DoPattern is declared. Thus member functions (previously called “rules”) of the struct would just define and return regular expressions (of data-type Pattern), and one best practice identified would be to declare such rules in the base class to only specify the regular expression matching characteristics, and then declare a sub-class which redefines that rule to include side-effects, and duplicating the matching characteristics of that rule as defined by the base class.
In both cases, a virtual machine was described which was capable of executing the regex match, and additionally accumulating the side-effect instruction opcode stream corresponding to the DoPattern pre-list and post-list statements (of all DoPattern sub-expressions involved in the best match to the data), to be executed as a result of the match. The resultant expressivity of such a grammar was disclosed, with examples of how it offers a new and easier approach to solving the matching problems. In short, a document level expression, or at least an expression that matches a large recognition unit of the document, could be composed, sub-expression by sub-expression, and the side-effects that actually solve the problem at hand embedded into the regular expression. This approach contrasts with what regex programmers typically have to do, which is match the stream against one fine-grained regex at a time, query for substring matches, and then do something with those matches, which means switching in and out of matching mode and functional programming modes. Better in the inventions is to embed the side-effects of the regex match into the regex itself, so that if the goal is to accomplish in side-effects the capture an array of matches, do it by binding the capture into the array into a regex, and then adding a repeat operator to repeat the regex match one or more times. This technique contrasts with the art which for repeated complex capturing requires that you execute the target match against the stream one at a time, and then stuff the parenthetical matches in functional code interleaved with reapplying the regex to the stream.
Novelty of the New Invention
Compiling regular expression side-effects to Java™ or C# code discloses and demonstrates that with certain modifications to the regex engine of the '231 patent itself, this grammar does not need its own virtual machine. Rather, the scripts written in the grammar of the invention can be compiled directly to Java™ classes, or alternatively, C# classes. This means that the side-effect statements are also compiled to Java™ code, and get the benefit of hotspot compilation, resulting in a program whose regular expression side-effects run at speeds comparable to that of regular Java™ functional programming.
Translating a grammar into Java™ code is not a novel thing to do, but the advantages of being able to do this are well-known. This is why, for example, the Groovy and Scala languages are gaining so many adopters. What is new to compiling regular expression side-effects to Java™ or C# code is that the side-effects of a regular expression, going far beyond capturing the N-th parenthetical matches, can be offered without embedding another virtual machine into a library hosted by Java™ or C#; that is, DoPattern and CapturePattern side-effects can be compiled into Java™ code or C# code.