Numerous aspects of using computer software require a user to be able to search for particular text in a text buffer, e.g., keywords, tags, data values, etc., and use resulting information (e.g., the text itself or information about its context) for some purpose, such as testing against expected values, making changes, debugging, and the like. In the most general sense, the text buffer is some memory or data structure in which the target text is stored, and upon which the user can operate with the assistance of suitable software. Potential sources of the text are numerous and include, for example, pure text files, mark-up language files, source code files, text captured from printed sources, text produced by other software, etc. Additionally, the text can exist in numerous well know formats, use standardized encodings (e.g., ASCII and Unicode), and be presented in various different languages.
Most techniques for manipulating text as described above include some type of parsing operation. In general, parsing involves dividing the text into small components that can be analyzed, sometimes referred to as lexical analysis, and determining the meaning of the components, sometimes referred to as semantic parsing. Of course, most types of text to be analyzed include some lexical information describing how the text should be broken into components or tokens, e.g., punctuation or other keys. For example, parsing this sentence would involve dividing it into words and phrases based on the spaces and punctuation, identifying the type of each component (e.g., verb, adjective, or noun), and then potentially determining more information about the components (e.g., a noun's meaning). Similarly, compiling high level computer language source code into executable machine code includes, among other steps, lexical analysis, e.g., characterizing text strings because they match known keywords, symbols, or data types for various computer language constructs, and semantic parsing, e.g., converting the entire sequence of tokens into a parse tree or other expression that describes the computer program's structure.
In the context of computer software testing, a user is often tasked with identifying certain text expressions in source code (e.g., code in a compiled language, scripting language, markup language, etc.), comparing text generated by other programs with expected values to determine proper operation of the programs, and otherwise automating the processing of text for some purpose. Often, the tools available to the user are either overly simplistic or overly complex.
For example, a simple tool familiar to most computer users is the basic text searching and replacing commands of programs such as text editors, word processors, and mark-up language browsers. These tools are relatively easy to use, but typically lack flexibility and sophistication. On the other hand, tools that employ so-called regular expressions are typically much more difficult to use. A regular expression is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, Perl and Tcl both have a powerful regular expression engine built directly into their syntax. Regular expressions can also be used to compactly describe a set, without having to list all elements of the set. For example, the set containing the three strings Handel, Händel, and Haendel can be described by the pattern “H(ä|ae?)ndel” Most regular expression formalisms provide operations such as alternation, grouping, and quantification. While they are powerful, using regular expressions can require significant skill, i.e., familiarity with the supported operations and syntax, and experience in constructing bug-free expressions. Still other parsing techniques use similarly complicated scripting languages.
Accordingly, it is desirable to have tools and techniques for performing various aspects of text parsing operations that provide adequate power and flexibility, while still being easy to use.