The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
The Extensible Markup Language (XML) is the standard for data and documents that is finding wide acceptance in the computer industry. XML describes and provides structure to a body of data, such as a file or data packet. The XML standard provides for tags that delimit sections of a XML entity referred to as XML elements. The following XML document A is provided to illustrate XML.
XML document A <a c=“foo”><b>5</b>  <d>10</d> </a>
XML elements are delimited by a start tag and a corresponding end tag. For example, segment A contains the start tag <b> and the end tag </b> to delimit an element. The data between the elements is referred to as the element's content.
An element has a name and is referred to herein by its name. The name of the element delimited by <b> and the end tag </b> is b and is thus referred to herein as element b or just b.
An element's content may include the elements value, one or more attributes and one or more elements. Such an element is referred to as a complex element. Element a is a complex element, and contains two elements b and d. An element that is contained by another element is referred to as a descendant of that element. Thus, elements b and d are descendants of element a. An element's attributes are also referred to as being contained by the element. An element that contains no other elements is referred to as a simple or leaf element.
An attribute is a name value pair. Element a has attribute c, which has the value ‘foo’.
Element b has the value 5 and element d has the value 10. Element a does not have a value.
By defining elements that contain attributes and descendant elements, a XML document defines a hierarchical tree relationship between the elements, descendant elements, and attributes of the elements.
Node Tree Model
XML documents are represented as a hierarchy of nodes that reflects the XML document's hierarchical nature. A hierarchy of nodes is composed of nodes at multiple levels. The nodes at each level are each linked to one or more nodes at a different level. Each node at a level below the top level is a child node of one or more of the parent nodes at the level above. Nodes at the same level are sibling nodes.
In a tree hierarchy or node tree, each child node has only one parent node, but a parent node may have multiple child nodes. A node that has no parent node linked to it is the root node, and a node that has no child nodes linked to it is a leaf node. A tree hierarchy has a single root node. In a node tree that represents a XML document, a node can correspond to an element, and the child nodes of the node correspond to an attribute or another element contained in the element.
For convenience of expression, an element and attribute of a XML document are referred to as the node that corresponds to that element or attribute within the node tree that represents the XML document. Thus, referring to 5 as the value of node b is just a way of expressing that the value of the element b is 5.
XML Schemas
Information about the structure of specific types of XML documents may be specified in documents referred to as “XML schemas”. For example, the XML schema for a particular type of XML document contains declarations that specify the names for the elements contained in that type of XML document, the hierarchical relationship between the elements contained in that type of XML document, and the data type of values contained in that particular type of XML document. Standards governing XML schemas include: XML Schema, Part 0, Part 1, Part 2, W3C Recommendation, 2 May 2001, the contents of which are incorporated herein by reference; XML Schema Part 1: Structures, Second Edition, W3C Recommendation 28 Oct. 2004, the contents of which are incorporated herein by reference; XML Schema 1.1 Part 2: Datatypes, W3C Working Draft 17 Feb. 2006, the contents of which are incorporated herein by reference; and XML Schema Part 2: Datatypes Second Edition, W3C Recommendation 28 Oct. 2004, the contents of which incorporated herein by reference. XML Schemas as described in this document are not restricted to W3C XML Schemas but include any other mechanisms for describing the structural and/or typing information of XML documents, for example, Relax NG.
Often, for large bodies of XML documents, no XML schema document has been developed or engineered by developers. Described herein are approaches for automatically determining an XML schema to which a collection of XML documents may conform to varying degrees.