1. Field of the Invention
This invention relates to XML parsers, and particularly to a method for performing simple-type well-formedness checking, validation, and datatype conversion in one single pass.
2. Description of Background
XML (Extensible Markup Language) has begun to work its way into the business computing infrastructure and underlying protocols such as the Simple Object Access Protocol (SOAP) and Web services. In the performance-critical setting of business computing, however, the flexibility of XML becomes a liability due to the potentially significant performance penalty. XML processing is conceptually a multitiered task, an attribute it inherits from the multiple layers of specifications that govern its use including: XML, XML namespaces, XML Information Set (Infoset), and XML Schema. Traditional XML processor implementations reflect these specification layers directly. Bytes, read off the “wire” or from disk, are converted to some known form. Attribute values and end-of-line sequences are normalized. Namespace declarations and prefixes are resolved, and the tokens are then transformed into some representation of the document Infoset. The Infoset is optionally checked against an XML Schema grammar (XML schema, schema) for validity and rendered to the user through some interface, such as Simple API for XML (SAX) or Document Object Model (DOM) (API stands for application programming interface).
With the widespread adoption of SOAP and Web services, XML-based processing, and parsing of XML documents in particular, is becoming a performance-critical aspect of business computing. In such scenarios, XML is invariably constrained by an XML Schema grammar, which can be used during parsing to improve performance. Although traditional grammar-based parser generation techniques could be applied to the XML Schema grammar, the expressiveness of XML Schema does not lend itself well to the generic intermediate representations associated with these approaches.
Indeed, for parsing in domains other than XML (e.g., programming languages), grammars have long been used to generate optimized special purpose parsers that operate much more efficiently than their generic counterparts, while performing validation checking. The XML specifications were designed to enable the compilation of an XML Schema grammar to a special-purpose parser. However, generic XML parsers, by performing tasks in separate passes, degrade performance of the overall application.
In particular, in validating XML data against XML Schema simple types, it is common practice to scan the document for syntactic constructs such as angle brackets, quotes, entity references etc., before validating the scanned data against the simple type production. When deserialization of the data into datatype-specific objects is also needed, typical applications reparse the input data to perform the conversion, thus resulting in poor performance. Traditional XML parsers that validate against W3C XML Schema simple types do so by first well-formedness checking the data, then validating it against the specific type, and then converting it to a datatype-specific form. In other words, there are multiple passes. In many cases, the upconverted but otherwise raw data is then passed to an application, which reconverts it to application-specific form. These extra passes take much time and thus considerably slow down the parse processing.
Traditional XML parsers are not capable of performing all necessary tasks in a single pass. Thus, it is desired to design and implement an XML parser that can perform well-formedness checking, validation, and datatype conversion for simple types in one single pass.