XML is short for eXtensible Markup Language, a programming language developed by the World Wide Web Consortium (W3C). Both XML and HTML are derivations of SGML, (Standard Generalized Markup Language), widely used for large documentation projects and the standard for creating a document structure. XML is actually a simplified but functionality-enhanced subset of SGML. XML is “extensible” because, unlike HTML, XML markup symbols are unlimited and self-defining.
HTML is widely used to display web pages on the Internet although HTML can also be used for documentation purposes and need not be rendered in a browser. HTML describes the content of a web page (mainly text and graphic images) only in terms of how it is to be displayed and interacted with. For example, in HTML the letter “p” placed within markup tags (“<p>”) informs the browser that the text that follows should be displayed as a new paragraph. The content to be displayed as the new paragraph is delimited by “</p>”, which signals the end of the paragraph. Thus, in HTML, content, and tags to control the presentation of the content, are intermingled. Hence it is difficult to write a program in HTML to display the first word of every sentence in bold because before every first word of every sentence a tag indicating “start bolding” would have to be inserted and after every first word of every sentence a tag indicating “stop bolding” would have to be inserted.
XML is conceptually related to HTML and is an HTML-like formatting language, but has more functionality than HTML. Like HTML, XML makes use of tags and attributes. But while HTML specifies what each tag and attribute means, and often, how the text between them will look in a browser, XML uses the tags only to delimit pieces of data, and leaves the interpretation of the data to the application that processes the XML file. Thus, a “<p>” in an XML file may be a price, a parameter, a person, an order number, etc. For example, “<p>” could indicate that the data that followed it was a telephone number. If the XML file were processed purely as data by a program, perhaps the telephone number would be dialed. If the XML file were stored with similar data on another computer, the phone number might be stored. If, like an HTML file, the XML file were displayed, perhaps the phone number would be displayed. Hence, XML allows designers to create their own customized tags, thus expanding the amount and kinds of information that can be provided about the data held in files and enabling the definition, transmission, validation, and interpretation of formatted data between applications and between organizations.
The rules for XML files are strict. A forgotten tag, or an attribute without quotes makes an XML file invalid and unusable, while in HTML such a practice is tolerated and is often explicitly allowed. The official W3C XML specification prohibits applications from trying to guess what the creator of an invalid XML file meant to do. If the file is invalid, an application processing the file has to stop and report an error. Thus, it is helpful to validate an XML file before using it and it is especially helpful to have an automated tool to do the validating. It is even more helpful to be able to define a valid structure of an XML file so that the automated validation tool would be able to either verify that a file is correct, or list out the mistakes that were found in the XML file. Such an enabling file structure definition is called a “schema”.
“Schema” is a term borrowed from the database world where it is used to describe the structure of data in relational tables. In the context of XML, a schema describes a model for a class of files. For example, an XML schema can describe the possible arrangement of tags and text in a valid document.
In schemas, models are described in terms of constraints. A constraint defines what can appear in any given context. A content model constraint describes the order and sequence of elements. A datatype constraint describes valid units of data.
For example, a schema might describe a valid <address> with the content model constraint that it consist of a <name> element, followed by one or more <street> elements, followed by exactly one <city>, <state>, and <zip> element. The content of a <zip> might have a further datatype constraint that it consist of either a sequence of exactly five digits or a sequence of five digits, followed by a hyphen, followed by a sequence of exactly four digits. No other text is a valid ZIP code.
A schema enables machine validation of document structure. Every specific, individual file that does not violate any of the constraints of the schema is, by definition, valid according to that schema. For example, using the schema described above, a parser (validation tool) would be able to detect that the following address is not valid:
<address><name>John J. Jones</name><Street>256 Eight Bit Lane</street><city>East Yabip</city><state>MA</state><state>CT</state><zip>blue</zip></address>
The address above violates two constraints of the schema: it does not contain exactly one <state> and the ZIP code is not of the proper form. Therefore, the parser is able to flag the above address as invalid with respect to the <state> and <zip>.
There are many excellent reasons to validate an XML file, for example:                to determine that a purchase order received from a customer is not missing anything and doesn't have anything extra, and that everything the purchase order has is the right datatype (e.g., quantities are all positive numbers, prices are all decimal numbers with two digits after the decimal point, etc.).        to determine that information received from one corporate database is valid before the received data is converted and inserted into the target database. Invalid transactions should be rejected immediately so that the target database is not corrupted.        to verify that the XML file that will control an overnight batch process will be understood by the processor so that 2:00 am telephone calls can be avoided.        to verify that an XML stylesheet will correctly present each of 1000 XML documents being published on a CD-ROM without proofing each document manually.        
Using a schema and a validating parser offers one way to check XML files. (It is understood that even the most advanced validating parser can fail to detect some kinds of errors. Valid files can still contain the wrong content, e.g., a purchase order may ask for a hundred boxes of staples when only ten were actually wanted.) One way to think of a schema is that it is a contract between a producer of information and a consumer of information. The contract is enforced through validation of a particular document against the schema.
One way to define schemas is through the XML Schema Definition language XSD. The XML Schema Definition language enables the definition of structure and data types for XML files according to the W3C XML Schema Part 2: Datatypes specification. A schema, (i.e., an XSD schema), defines the elements, attributes, and data types that conform to the W3C XML Schema Part 1: Structures specification for the XML Schema Definition language. This reference is based on the W3C Apr. 4, 2001 Proposed Recommendation for Datatypes and W3C Mar. 30, 2001 Proposed Recommendation for Structures.
XML schemas as defined by the W3C standard can define a rich set of datatypes including booleans, dates, times, URIs (Uniform Resource Identifiers), integers, decimal numbers, real numbers, currencies and intervals of time. In addition to these simple, predefined types, other types including aggregate types and user-defined types, can be defined. For example, a user could define a “PostalAddress” datatype and then define two elements, “ShippingAddress” and “BillingAddress” to be of that type. Attribute grouping enables the grouping of several attributes that apply to a number of elements. Substitution groups enable different flavors of attributes to be substituted based on features of the data content and express the relationship between similar kinds of elements. Substitution groups are typically used when one of several different elements would be appropriate to use in a given context. For example, a purchase order might permit an “address” to be used, but not necessarily specify what type of address should be used in a particular document. The definition of an “address” substitution group with elements “USAddress” element and “CanadianAddress” as members of that substitution group is a way of indicating that an “address” must either be a valid “USAddress” or a valid “CanadianAddress”. Substitution groups also provide an easy way to add new members, such as “UKAddress”. Substitution groups facilitate the modification of XML files over time and are analogous to the idea of “inheritance” in object-oriented programming, but are applied to data only.
Because datatypes and data structures are so robust, schemas can be extremely complex, running into the thousands of lines of code. To complicate matters, applications typically are not static. In many cases, as applications and user requirements evolve, it is necessary to make global changes to complex schemas according to a set of predefined rules, a time-consuming, difficult, and repetitive task. Hence, maintaining these complex schemas can become a task of enormous proportions with correspondingly enormous potential for the introduction of errors.
Thus it would be very helpful to have a way to describe certain attributes and element definitions external to the schema and to automatically generate an enhanced schema from a simpler input schema.