Field of the Invention
The present invention relates to document processing and more particularly to error handling during document processing.
Description of the Related Art
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the “marking up” of manuscripts, namely the revision instructions by editors; traditionally written with a blue pencil on the manuscripts of authors. The concept of markup continued through early forms of the word processor in which presentation and formatting instructions for text can indicate the text upon which the instructions were to be applied by tagging a start and finish position in the text to denote the target text. The concept of presentation markup continued in the context of content distribution and most notably with the widespread use of the hypertext markup language (HTML).
In HTML, content is structured for presentation according to tags embedded in the contact directing the manner in which the tagged content is to be presented. Advanced forms of HTML provide support for embedded programmatic logic and the referencing of embedded scripts or external executable or interpretable instructions. Of import, HTML has a set of predefined presentation semantics, meaning their specification prescribes how the structured data is to be presented. Other markup languages, like the extensible markup language (XML), have no predefined semantics and often require a plan or schema setting forth a permissible structure for the document.
The XML specification defines an XML document as a text which is—that is, it satisfies a list of syntax rules provided in the XML specification. The definition of an XML document excludes text which contains violations of the “well-formedness rules.” An XML processor encountering such a violation is required to report such errors and to cease normal processing. This policy, occasionally referred to as draconian, stands in notable contrast to the behavior of programs which process HTML, which are designed to produce a reasonable result even in the presence of severe markup errors.
In addition to being well-formed, an XML document must be valid. In this regard, the XML document must contain a reference to schema or grammar, typically embodied within a Document Type Definition (DTD), and the elements and attributes of the XML document must be declared in that DTD and follow the grammatical rules for the elements and attributes that the DTD specifies. As such, XML processors are classified as validating or non-validating depending on whether or not the XML processors check XML documents for validity. A processor which discovers a validity error must be able to report it, but may continue normal processing.
An XML document must be parsed into a usable format for other programs to use. But during processing, the document may fail to be parsed correctly; or, alternatively, parsing may be completed, but fail validation against a schema or data format definition. In either case, full processing of the well-formed document cannot continue without problems as portions of the incoming data will be missing or incomplete. Although there may, in theory, be enough data to continue processing in a limited way through certain paths of application logic, there is no way to determine if there is sufficient data to successfully traverse a path of application logic. In other words, there is no process that allows the parsing of a well-formed document to continue even in a restrictive way.