Many of today's communication systems use text-based signaling protocols. The messages of these protocols are pure ASCII text and do not contain information intended to be interpreted as binary values. The syntax of such messages is specified by way of formal grammars. For the receiver to understand such a message, the message must be syntactically analyzed and broken into its constituent parts. The act of doing so is traditionally referred to as “parsing,” and software processes designed to do so are referred to as “parsers.”
A significant amount of effort is devoted to parsing. Such effort is mainly directed to the design of compilers, and more recently, natural language translators. There are many tools available for producing parsers. These tools are typically geared to producing parsers for computer languages, in the form of compilers.
Parsing tools have traditionally operated in a two-step fashion. First, a text string is scanned and grouped into manageable pieces called tokens. In the English language sentence, “Jack ran up the hill,” “ran” would be tokenized as a verb, “the” as an article, and so forth. This phase of parsing is typically referred to as “tokenizing,” and is based on the theory of regular languages and finite state automata.
After being tokenized, the string is syntactically analyzed by the parser to determine whether the sequence of tokens is grammatically correct. This phase is based on the theory of context free languages and push down automata. Traditionally, the parser is not allowed to direct, or otherwise influence, the tokenizing process. This constraint is imposed in part by the decomposition of the process into two steps with different theories underlying each step, and in part by the need of such parsers for “look ahead” tokens.
In protocols such as the media gateway control protocol (MGCP) defined in Request for Comments (RFC) 2705 of the Internet Architecture Board (IAB), Megaco, as defined in RFCs 2885 and 2886, and many others used in telecommunications, there are many instances where tokenization cannot take place independently of syntax analysis. In MGCP, the string “AB05” could represent an identifier, such as part of an endpoint name, or it could represent a hexadecimal number. The correct interpretation of the string would depend on the context in which the string was encountered.
For an English language analogy of this problem, consider the following sentences:                a) “Honey, I forgot to duck;” and        b) “The duck does not like honey.”In sentence a), “duck” is a verb, and the tokenizer should identify it as such. In sentence b), “duck” is a noun, and this also should be identified as such by the tokenizer. Obviously, the tokenizer cannot classify “duck” as a verb or noun based on the character string “duck” alone. More information is needed, and in particular, syntactical information is needed. The parser would thus have to tell the tokenizer what kind of token to expect.        
Two methods have traditionally been employed to handle this problem. First, a parser can be hand-written. Here, the translation from the protocol grammar to the parser source code is done completely by the parser designer or programmer. The result is a specialized, one-of-a-kind parser. Second, standard tools can be used but with significant special cases introduced in the tokenizer. The result is the duplication of a large part of the parser logic within the tokenizer.
Both of these approaches have significant drawbacks, particularly from the viewpoint of maintainability. The first option produces a parser where the connection between the grammar and the parser code is unclear. A small revision of protocol grammar may require a major re-write of the parser. For the second option, a standard tool maintains the connection between the grammar and the parser code, but it is completely ignorant of the duplicated logic within the tokenizer. Making changes to such a parser results in a highly error-prone software engineering task.
In essence, traditional parsing systems using a tokenizer 10 and parser 12 may be illustrated as shown in FIG. 1. A text-based input string 14 is provided to the tokenizer 10, wherein the string is broken down into its respective tokens 16. The tokens 16 are provided to the parser 12, which processes the tokens 16 and provides a corresponding parser output 18 based on the processed tokens 16. There is no reciprocal interplay between the tokenizer 10 and the parser 12.
As such, there is a need for a parsing system wherein the tokenizer and parsing processes can communicate with one another in an effective manner to facilitate parsing without prior knowledge of syntax associated with the string. Further, there is a need for a parsing system that is readily maintained and capable of allowing modifications in the tokenizer or parsing processes without having any significant impact on each other.