Generally speaking, a parser is a program to check whether an input file has the correct grammar for the language, and also to build a data structure describing the program. For example, the input file may be in a markup language such as XML (Extensible Markup Language). By contrast to computer languages such as C or Pascal, the result of processing a markup language input file is typically not executable code. Rather, the result of processing a markup language input file typically includes tagged text. For example, XML provides for the creation of customized tags, enabling the definition, transmission, validation, and interpretation of data between applications and between organizations.
FIG. 1 illustrates an example of an excerpt from an XML source file. The excerpt defines a table record with two row records. The text of the form <X> and </X> are start and end tokens (tags), respectively, which indicate the start and end of a record. It can be seen from FIG. 1 that the records may be nested such that tags are provided at various depths. More particularly, a subsequent start token may be provided before a matching end token is provided for a preceding start token.
Referring to the FIG. 1 example, and taking the <table> token as being at depth 1, the <row> token is at depth 2, and the <id>, <firstname>, <lastname>, <street>, <city>, <state> and <zip> tokens are at depth 3. The FIG. 1 example is a relatively simple one. However, the structures may be arbitrarily complex, so long as the nesting rules are met (i.e., that an end token is provided matching the last provided start token, before any other end token is provided).
FIG. 2 illustrates the contents of a character array buffer, corresponding to the XML source file excerpt in FIG. 1. Conventionally, an XML stream is loaded into such a character array buffer, and the stream is parsed from the character buffer.
With reference to the pseudo code in FIG. 3, we now describe a conventional parsing flow, for parsing a markup language stream (using XML as a specific example). In the main XML parser, at step 1, the XML stream is loaded into the character array buffer (such as is shown in FIG. 2). At step 2, the stream is processed until the start of a new element is detected. In this case, the start of the new element is a start tag, delimited by “<” and “>.”
At step 3, the process_element( ) process (also shown in FIG. 3) is called. Finally, at step 4, processing continues until either the start of a new element is detected or processing of the source file is finished. If a new element is detected, then processing returns to step 3 (indicated by the step label “new_element”).
We now discuss the process_element process, with reference still to FIG. 3. As just described, the process_element process is called by the main XML parser program process. In addition, as will be described, the process_element process is also called recursively.
At step 1 of the process_element process, the start tag is read for the element being processed. At step 2 of the process_element process, the get_stringobject process is called to get a string object, from a symbol table, for the string.
The get_stringobject process will be described in greater detail below. At step 3 of the process_element process, the string object is pushed onto a stack. Finally, at step 4, processing continues until a new element is detected, in which case the process_element process is called recursively, or the end of the element is detected. If the end of the element is detected, then the string object for the element is popped off the stack and processing returns to the calling program (which, it should be recalled, may be the process_element process itself).
We now describe the get_stringobject process, still with reference to FIG. 3. In the get_stringobject process, each character of the start string is checked to determine that it is a valid XML character according to the XML 1.0 specification. At step 2, the symbol table is checked to determine whether there is already a string object for the start string. At step 3, if there is already a string object for the start string in the symbol table, then the already-present string object is returned as the result of the get_stringobject process. Otherwise, a new string object is created in the symbol table for the start string, and the newly-created string object is returned as the result of the get_stringobject process.
Referring specifically to XML, but also relevant to other markup languages, there are at least two relatively “expensive” (i.e., that take lot of time and/or processing power) functions that the parser performs. Two such functions dealt with here are within the get_stringobject processing. One such expensive function is checking that each character of the start string is a valid XML character. Another expensive function is checking the symbol table to determine whether there is already a string object for the start string, particular as the number of string objects referenced in the symbol table increases.