1. Field of the Invention
The present invention generally relates to validating parser processing for parsing and validating documents, such as XML™ documents, for use in individual data processors interconnected by a network and, more particularly, to hardware validating processors for acceleration of the validation of such documents.
2. Description of the Prior Art
The field of digital communications between computers and the linking of computers into networks has developed rapidly in recent years, similar, in many ways to the proliferation of personal computers of a few years earlier. This increase in interconnectivity and the possibility of remote processing has greatly increased the effective capability and functionality of individual computers in such networked systems. Nevertheless, the variety of uses of individual computers and systems, preferences of their users and the state of the art when computers are placed into service has resulted in a substantial degree of variety of capabilities and configurations of individual machines and their operating systems, collectively referred to as “platforms” which are generally incompatible with each other to some degree particularly at the level of operating system and programming language.
This incompatibility of platform characteristics and the simultaneous requirement for the capability of communication and remote processing and a sufficient degree of compatibility to support it has resulted in the development of object oriented programming (which accommodates the concept of assembling an application as well as data as a group of more or less generalized modules through a referencing system of entities, attributes and relationships) and a number of programming languages to embody it. Extensible Markup Language™ (XML™) is such a language which has come into widespread use and can be transmitted as a document over a network of arbitrary construction and architecture.
In such a language, certain character strings correspond to certain commands or identifications, including special characters and other important data (collectively referred to as control words) which allow data or operations to, in effect, identify themselves so that they may be, thereafter treated as “objects” such that associated data and commands can be translated into the appropriate formats and commands of different applications in different languages in order to engender a degree of compatibility of respective connected platforms sufficient to support the desired processing at a given machine. The detection of these character strings is performed by an operation known as parsing, similar to the more conventional usage of resolving the syntax of an expression, such as a sentence, into its component parts and describing them grammatically.
When parsing an XML™ document, a large portion and possibly a majority of the central processor unit (CPU) execution time is spent traversing the document searching for control words, special characters and other important data as defined for the particular XML™ specification being processed. This is typically done by software which queries each character and determines if it belongs to the predefined set of strings of interest, for example, a set of character strings comprising the following “<command>”, “<data type=dataword>”, “</command>”, etc. If any of the target strings are detected, a token is saved with a pointer to the location in the document for the start of the token and the length of the token. These tokens are accumulated until the entire document has been parsed.
This process must then be followed by processing in order to evaluate the tokens against rules and definitions contained in a “document model” such as the specification of a document type definition (DTD) or an XML™ schema in order to assure that the collection of tokens and the character strings they represent in the document are well-constructed to form an unambiguous and internally consistent document, in its entirety. This processing is known as validation and generally proceeds in much the same fashion as processing for finding character strings of interest discussed above but operating on sixteen-bit (or longer) tokens corresponding to sequences of bytes rather than single eight-bit (or longer) bytes representing characters and checking for consistency between tokens and the content or arguments of other tokens to accommodate the self-definition characteristics and properties of languages such as XML, SGML™ (of which XML™ is a simplified form) and HTML™ (which is essentially a special case of XML™) which support platform independence and interconnectivity.
Both the parsing for finding tokens and the parsing for validation are generally implemented using a conceptually table-based finite state machine (FSM) or state table to search for these strings of interest or consistency between elements found and represented by tokens. The state table resides in memory and is designed to search for the specific patterns of characters or tokens in the document. For parsing to find character strings of interest, the current state is used as the base address into the state table and the ASCII representation of the input character or the token is an index into the table. Character strings of interest may be of any of several types such as an element, an attribute/attribute list or data and elements may be simple elements or aggregates and may be nested. The parsing for validation principally looks at the types of character strings presented and the nesting itself to determine which elements or tokens are associated with another specific token(s) and the hierarchical relationship between them.
The goal of this processing is not only to determine that the document is a valid document that conforms to the language (e.g. XML™) standard and have the correct structure as defined by a DTD or XML™ schema in its entirety but to develop a hierarchical data structure such as a tree structured document object in which the structure will fully represent the informational content of the data. Therefore, while parsing to find character strings of interest is very time consuming and processor intensive, parsing for validation is much more so. That is, since the XML™ data, for example, are textual and not only the data but the data structure, which may be freely specified to express the informational content, must be extracted from such text, it can be readily appreciated that the required processing is particularly time consuming and processor intensive.
At the same time, the potential complexity of the processing needed to properly handle aggregate elements and flexible nesting that may be in multiple hierarchical levels complicates the use of special purpose or hardware processors to reduce the processing load on the CPU of the local computer. That is, while it is generally recognized that special purpose or hardware processors can often provide increased processing speed in comparison with general purpose processors due to the reduced overhead for control of the general purpose processor, itself, it is not assured that a special purpose processor will be feasible or provide any significant advantage in performance as the processing function becomes more complex or with increased requirements for flexibility. In general, increased complexity and/or requirements for flexibility of function can only be accommodated by much increased hardware requirements which may not be economically justified for many applications or for the performance gain that may be possible. It is for this reason that validation parsing has been performed on programmed general purpose computers despite the processing time required.