Extensible Markup Language (XML) is a meta-markup language that provides a format for describing structured data. XML is similar to HTML in that it is a tag-based language. By virtue of its tag-based nature, XML defines a strict tree structure or hierarchy. XML is a derivative of Standard Generalized Markup Language (SGML) that provides a uniform method for describing and exchanging structured data in an open, text-based format. XML utilizes the concepts of elements and namespaces. Compared to HTML, which is a display-oriented markup language, XML is a general-purpose language for representing structured data without including information that describes how to format the data for display.
XML “elements” are structural constructs that consist of a start tag, an end or close tag, and the information or content that is contained between the tags. A “start tag” is formatted as “<tagname>” and an “end tag” is formatted as “</tagname>”. In an XML document, start and end tags can be nested within other start and end tags. All elements that occur within a particular element must have their start and end tags occur before the end tag of that particular element. This defines a tree-like structure that is representative of the XML document. Each element forms a node in this tree, and potentially has “child” or “branch” nodes. The child nodes represent any XML elements that occur between the start and end tags of the “parent” node.
XML accommodates an infinite number of database schemas. A schema is a set of rules for constraining the structure and articulating the information set of XML documents. A schema describes what data structures, shape, and content of XML documents are valid for a given application. For example, one schema might describe how documents used in an online banking exchange are structured. Other schemas may describe XML documents for email, or XML documents for purchasing blue jeans or music over the Internet.
To illustrate a tree structure constructed from XML data, consider an exemplary XML data exchange between different entities, such as client and server computers, in the form of requests and responses. A client might generate a request for information or a request for a certain server action, and a server might generate a response to the client that contains the information or confirms whether the certain action has been performed. The contents of these requests and responses are XML documents. In many cases, the process of generating these XML documents involves building, in memory, a hierarchical tree structure. Once the hierarchical tree structure is built in its entirety, the actual XML document in proper syntactic form can then be assembled. Consider the following exemplary XML code:    <trans:orders xmlns:person=“http://www.schemas.org/people”            xmlns:dsig=http://dsig.org        xmlns:trans=“http://www.schemas.org/transactions”>        <trans:order>                    <trans:sold-to>                            <person:name>                                    <person:last-name>Layman</person:last-name>                    person:first-name>Andrew</person:first-name>                                                </person:name>                                    </trans:sold-to>            <trans:sold-on>1997-03-17</trans:sold-on>            <dsig:digital-signature>1234567890</dsig:digital-signature>                        </trans:order>            </trans:orders>
This code includes three XML namespace declarations that are each designated with “xmlns”. A “namespace” refers to a dictionary or set of element names defined by the schema. Namespaces ensure that element names do not conflict, and clarify who defined which term. They do not give instructions on how to process the elements. Readers still need to know what the elements mean and decide how to process them. Namespaces simply keep the names straight.
Within an XML document, namespace declarations occur as attributes of start tags. Namespace declarations are of the form “xmlns:[prefix]=[uri]”. A namespace declaration indicates that the XML document contains element names that are defined within a specified namespace or schema. “Prefix” is an arbitrary designation that will be used later in the XML document as an indication that an element name is a member of the namespace declared by universal resource indicator “uri”. The prefix is valid only within the context of the specific XML document. “Uri” is either a path to a document describing a specific namespace or schema or a globally unique identifier of a specific namespace or schema. Uri is valid across all XML documents. Namespace declarations are “inherited”, which means that a namespace declaration applies to the element in which it was declared as well as to all elements contained within that element.
With reference to the above XML code, the namespace declarations include a prefix, e.g. “person”, “dsig”, and “trans” respectively, and the expanded namespace to which each prefix refers, e.g. “http://www.schemas.org/people”, “http://dsig.org”, and “http://www.schemas.org/transactions” respectively. This code tells any reader that if an element name begins with “dsig” its meaning is defined by whoever owns the “http://www.dsig.org” namespace. Similarly, elements beginning with the “person” prefix have meanings defined by the “http://www.schemas.org/people” namespace and elements beginning with the “trans” prefix have meanings defined by the “http://www.schemas.org/transactions” namespace.
It is noted that another XML document that incorporated elements from any of the namespaces included in this sample might declare prefixes that are different from those used in this example. As noted earlier, prefixes are arbitrarily defined by the document author and have meaning only within the context of the specific element of the specific document in which they are declared.
FIG. 1 shows a hierarchical tree structure 18 that represents the structure of the above XML code. The tree nodes correspond to elements parsed from the XML document. Such a structure is typically constructed in memory, with each node containing all data necessary for the start and end tags of that node. It has been typical in the past to build the entire tree structure before generating the XML document itself.
In XML 1.0, data types in the schemas are defined using a set of data type definitions (DTD). XML documents have two kinds of constraints: well-formedness and validity. The “well-formedness” constraints are those imposed by the definition of XML itself (such as the rules for the use of the < and > characters and the rules for proper nesting of elements). The “validity” constraints are constraints on document structure provided by a particular DTD or XML-Data schema. Schema or DTD validation is very useful in the Internet realm, because entities are able to validate whether data structures received from random or anonymous sources are appropriate for a given context. Suppose, for example, that a company receives XML data from some random user. The company does not necessarily trust the data at this point, and hence utilizes a validation process to determine whether the XML data is good or whether it is noise that can be rejected outright or sent to a system administrator for special consideration.
FIG. 2 shows current software architecture 20 for processing XML documents. The architecture 20 includes an XML parser 22 that receives and parses XML data. The XML data may arrive in a variety of ways, including as a stream, a URL (universal resource locator), or text. Parsing the XML data results in a list of events. For example, suppose the XML data describes an author and title name for a book, as follows:                <Book>                    <author>X</author>            <title>Y</title>                        </Book>        
The parser 22 parses the XML data and returns the following list:                1. “Book” element        2. BeginChildren        3. “Author” element        4. BeginChildren        5. “X”, text        6. EndChildren        7. “Title” element        8. BeginChildren        9. “Y”, text        10. EndChildren        11. EndChildren        
As the parser 22 parses the XML data, it calls to one or more node factories. A “node factory” is a callback interface that builds node objects used to construct an in-memory tree representation of the XML document. The node factory may also be used to search the XML document, without building a node object. Custom node factories can be constructed to build different kinds of object hierarchies that reflect the XML document.
In architecture 20, there are four node factories, including a namespace node factory 24, a DTD node factory 26, a tree builder node factory 28, and a validation node factory 30. The XML parser 22 calls the namespace node factory 24, which outputs a sequence of name tokens. DTD events are passed to the DTD node factory 26 and XML data events are passed to the validation node factory 30. The DTD node factory 26 builds DTD objects 32 from the DTD events. The DTD objects 32 are used in the validation process of validating the XML data. The DTD node factory 26 may also delegate to the tree builder node factory 28, which builds XML DOM (Document Object Model) fragments 34 for pieces of the tree structure, or XML DOM.
The validation node factory 30 receives the XML data events from the namespace node factory 24 and uses the DTD objects 32 to evaluate whether the data complies with certain constraints defined by the DTD objects. If the XML data is valid, the tree builder node factory 28 builds a complete XML DOM 36 for the XML data. Some elements of the XML DOM 36 may reference fragments 34.
The architecture 20 is configured for DTD-specific considerations. DTD objects have an advantage in that they are known and previously defined. DTD objects also have a drawback, however, in that they are not extensible. Thus, new data type definitions cannot be easily created. Due to this problem, more recent evolutions of XML are beginning to use XML-data schemas as an additional or alternative way to define data types in schemas. XML-data schemas are not restricted like DTD objects, but are extensible and new ones can be created as needed.
Thus, there is a desire to adapt the architecture of FIG. 2 to handle the more extensible XML-data schemas in addition to DTD objects. One problem with this adaptation is that the node factory interface provides a sequential ordered stream of XML tokens, whereas the XML-data schemas define items in a way that is order independent. This means that the node factory has to store certain states until it knows it can process those states.
Accordingly, there is a need for an improved architecture built around the node factory design, which handles XML-data schemas to build in-memory tree representations and DTD objects for validation purposes. More particularly, the improved architecture should leverage existing components (e.g., DTD validation, namespace node factory, and XML parser) for creating an in-memory representation of the schema and be roughly as fast as the existing architecture. The architecture should also maximize code reuse.