A markup language parser is, in a general sense at least, a program to determine the syntactic structure of a string of symbols in markup language. A markup language (or, perhaps more precisely, markup specification language) describes information (text or data), usually for storage, transmission, or processing by a program. The markup language typically does not specify what should be done with the data.
FIG. 1 illustrates a conventional markup language parser 100 from one simplistic point of view. In broad terms, the parser 100 processes markup language source from a file 106 and provides processed data for use by one or more applications 101. From the simplistic point of view illustrated in FIG. 1, the parser 100 can be considered to include two primary components—a reader 102 and a scanner 104.
The reader 102 reads the contents of the file 106 (including markup language statements which, in the example, are XML) to be processed and stores the contents into a buffer 108, typically of fixed predetermined size. If the size of the file 106 is more than the size of the buffer 108, then the buffer 108 is refreshed with the unread markup language data once the scanner 104 has processed the data that is currently in the buffer 108.
The reader 102 is configured to check for valid markup language characters, tokenize the markup language content (e.g., for XML in one example, to tokenize the markup language content to XMLNames, values and content), and provide the tokens to the scanner 104.
The scanner 104 is configured to process the tokens generated by the reader 102 and to provide string objects and/or values (generically, data 103) to the application 101 based on the tokens. For example, the scanner 104 may operate as a state machine. The string objects and/or values provided to the application 101 by the scanner 104 may be, for example, an XMLName (element name, attribute name), attribute value, element content, etc.
We now briefly discuss circumstances surrounding the conventional passing of data between the reader 102 and the scanner 104. The scanner 104 passes pointer objects to the reader 102. The pointer objects passed by the scanner 104 to the reader 102 are essentially just shells, to be populated by the reader 102. After processing by the reader 102, a pointer object points to a token in the buffer 108, and control is returned to the scanner 104. More particularly, the pointer object indicates an offset into the buffer 108 as well as the length of the token. Then, depending on the type of token being processed, the scanner 104 processes a populated pointer object to either create string objects or to copy data into a buffer 110 in the scanner 104.
It is desired to streamline the operation of the parser.