1. Field of the Invention
The present invention relates, in general, to linguistic analysis, and, more particularly, to software, systems and methods for pattern matching in text files.
2. Relevant Background
Linguistic analysis is a field of study involving analytic procedures, concepts, and technique that enable machine-assisted analysis of speech, grammar, and language use. Linguistic analysis techniques are often used to analyze information, typically represented as text, so as to gain understanding of the meaning of the information. By understanding the meaning of the information, machines such as programmed computers can analyze the vast amount of information stored and communicated in digitized form and take actions in response to the meaning. Common uses for linguistic analysis techniques include searching text, searching and replacing text, testing for certain conditions in a text file or data stream, and email filtering as well as a variety of other functions. Text analysis is used in such diverse fields as data compression, pattern recognition, computational biology, database searching, and network security.
Unfortunately, information represented as text is difficult to analyze. Complex language constructs such as phrases, sentences, paragraphs and documents have a variety that makes it difficult to extract meaning from the text. Moreover, even single words themselves may exhibit sufficient variation in spelling, addition of prefixes and suffixes, misspelling, and the like that performing activities such as text matching become problematic.
Many programming languages include functions for performing rudimentary text processing. Such operations include string concatenation, string sorting, string matching, and other simple text manipulation operations that take advantage of the fact that text is represented as a binary value (e.g., an ASCII coded value) in most computer systems which allows mathematical operations to be extended to text. However, information represented as text does not exhibit the same rigid syntax and construction that is typical of numeric information. Hence, linguistic analysis often involves defining patterns that describe a range of possible string values. The defined patterns are then used to perform various functions such as matching and searching. Most programming languages do not contain constructs for efficiently defining text patterns. As a result, string operations implemented in conventional programming languages (or macros such as those generated by the major word processors and editors) tend to be slow, inflexible and relatively prone to errors.
Regular expressions, sometimes referred to as regex's or “REs”, have developed as system for defining text patterns in linguistic analysis applications. Regular expressions are a notation system that defines sets of symbols and syntactic elements used to define text patterns. Once defined, text patterns are used by application software and algorithms to perform various text operations such as matching text, searching and replacing text, testing for certain conditions in a text or data stream, email filtering and other text manipulation tasks. Regular expressions are not software applications themselves but instead are a standardized way of representing text patterns in a way that can be readily used by a variety of software applications.
Regular expressions desirably exhibit “correctness” (i.e., the ability to produce desired results when evaluated by a software application) as well as efficiency (i.e., can be evaluated quickly on a computing platform using a reasonable amount of computing resources). Also, it is desirable that a regular expression is maintainable (i.e., it can be changed so that continued correctness and/or efficiency are obtained in light of changes in the execution environment). Although regular expressions are standardized, there are often a variety of equivalent or substantially equivalent forms in which a text pattern can be represented. “Substantially equivalent” means that in the context of a particular application the result of evaluating a regular expression will be the same. Even when two regular expressions produce the same results, different forms may execute significantly differently on a particular application or computing platform. For example, the processing resources, memory, execution speed, and the like may be significantly different between the alternate forms. Because of this, regular expressions benefit from being optimized for a particular algorithm, application, operating system, hardware platform, or other characteristics that define the run-time environment in which the expression will be used.
The special knowledge of words and language needed to implement text manipulation and linguistic analysis functions is very different than the knowledge required by other software specialists. As a result, regular expressions for a particular application are often written and maintained by linguistic specialists rather than computer programmers or information technology (IT) specialists. This situation creates a difficulty in that a linguistic analyst is responsible for achieving correctness of the regular expression while a programmer or IT specialized may be responsible for achieving efficiency of a regular expression through optimization.
Optimizing involves rewriting or transforming a set of regular expressions into a form that executes more efficiently in a particular software application, operating system environment, or hardware environment. A number of regular expression optimizations are described in Mastering Regular Expressions, 2nd edition, by Jeffrey Friedl, (July, 2002). However, optimizations are typically performed by optimization routines within platform-specific regular expression compilers. For example, in a PERL programming environment a regular expression is compiled, along with other programming constructs that define an application being implemented, into a binary internal representation before it is executed. During this compilation certain optimizations may be performed, however, it is difficult if not impossible for the regular expression author to control or evaluate these optimizations. Performing optimizations at compilation and/or run time does not work interactively with a linguistic specialist authoring a regular expression. Even when debugging processes are available and used, optimized regular expression generated by a compiler may look quite different than the originally authored regular expression, making it difficult for the author to validate that the optimized expression will perform as intended. Hence, the linguistic specialist may author a set of regular expressions designed to behave in a particular manner but will not find out whether the intended behavior is achieved until after the regular expressions are optimized and implemented in a particular application. Moreover, certain types of misbehavior may only be expressed in rare circumstances, in which case errors may take months or years to detect.
Hence, a need exists for systems, methods and software for generating and optimizing regular expressions that provide source-level messages about the optimization process to regular expression authors so that authors can focus on correctness rather than efficiency of the regular expressions. Further, there is a need for a regular expression authoring environment in which regular expressions are optimized and evaluated prior to run-time so that appropriate messages can be generated in response to determining that the optimization will affect correctness of the regular expression.