1. Field of the Invention
The present invention relates to computer software, and deals more particularly with techniques for applying dynamically-variable abstraction levels when parsing and validating structured documents according to a schema (which may have been extended).
2. Description of the Related Art
The popularity of distributed computing networks and network computing has increased tremendously in recent years, due in large part to growing business and consumer use of the public Internet and the subset thereof known as the “World Wide Web” (or simply “Web”). Other types of distributed computing networks, such as corporate intranets and extranets, are also increasingly popular. As solutions providers focus on delivering improved Web-based computing, many of the solutions which are developed are adaptable to other distributed computing environments. Thus, references herein to the Internet and Web are for purposes of illustration and not of limitation.
Use of structured documents encoded in a structured markup language has become increasingly prevalent in recent years as a means for exchanging information between computers in distributed computing networks. In addition, many of today's software products are written to produce and consume information which is represented using these types of structured documents. The Extensible Markup Language, or “XML”, for example, is a markup language which has proven to be extremely popular for encoding structured documents for exchange between parties (and also for describing structured data). XML is very well suited for encoding objects and document content covering a broad spectrum, and has become the standard means of providing a technology-independent representation. XML has also been used as a foundation for many other derivative markup languages, such as the Wireless Markup Language (“WML”), VoiceXML, MathML, and so forth (as is well known in the art). Encoding objects and other document content in XML (or a similar markup language) facilitates exchanging information between disparate systems. (Hereinafter, references to objects represented with markup language encoding in structured documents should also be construed as including document content that may be rendered in object form.)
For the early uses of structured documents, and in particular for XML version 1.0, a Document Type Definition (“DTD”) was used for specifying the grammar for a particular structured document (or set of documents). That is, a DTD was used to specify the set of allowable markup tags, where this set indicates the permissible elements and attributes to be used in the document(s). In more recent years, a “schema” is commonly used instead of a DTD. A schema contains information similar to that in a DTD, but is much more functionally rich, and attempts to specify more requirements for the structured documents which adhere or conform to it. As stated by the World Wide Web Consortium (“W3C”) on its “XML Schema” Web page, “XML Schemas express shared vocabularies and allow machines to carry out rules made by people. They provide a means for defining the structure, content and semantics of XML documents.”. Use of schemas for structured languages is well known in the art.
A schema may be defined within a single file or document, or it may be defined using a collection of documents that are linked together using syntactical elements of the schema notation. The definition within a schema may be extended using a separate document, for example, to provide consumer-specific refinements. The original schema then serves as a base, and the extensions are applied as refinements to that base. In this approach, the base definition is known to each consumer, but each extension is typically known only by its specific consumer. Examples of using schema extensions in this manner will now be described with reference to several examples. (More details on schema extensions may be found at the W3C web site or in a number of readily-available documents that describe the schema notation.) FIG. 1 depicts a base schema 100, which specifies that a valid “person” element in a structured document contains child elements (i.e., nested elements) for the person's name and address and may optionally contain attributes for the person's height and weight. That is, the schema 100 defines a “person” element as being of type “personType” (see 110), and personType is then defined at 120 as being a complex type. The elements of this complex type are a “name” element 130 and an “address” element 140, both of which are specified as required (by setting minOccurs and maxOccurs both to “1”, in this example). The optional “height” and “weight” attributes are defined at 150 and 160, respectively.
The sample markup document 200 in FIG. 2 defines a valid person element 210 that conforms conform to this base schema 100. The syntax at 220 of this sample document identifies the schema to which the document conforms. That is, according to the W3C documents defining the schema notation, the value of the “schemaLocation” attribute shown at 220 is used to “provide hints” as to where the schema can be found. In this example, the schema is identified using a Uniform Resource Identifier (“URI”) with “base.xsd” as the resource name, and might therefore refer to the sample schema 100 in FIG. 1. The manner in which the base schema 100 of FIG. 1 may be extended to support alternative syntax and structures in conforming structured documents will now be described.
A first schema extension 300 is defined in FIG. 3A. A “redefine” element, as shown at 310, is used to specify that this is a schema extension. In a redefine element, the base schema to which the extensions apply is named as the value of the “schemaLocation” attribute. Thus, the redefinition specified in document 300 applies to a base schema in a document stored at “base.xsd”, in this example. See reference number 311 in FIG. 3A. The body 320 of the schema extension 300 specifies that what is being redefined is the complex type named “personType”. See reference number 321. Furthermore, the syntax at 322 specifies that this complex type is being used as a base type that is being extended, and the syntax at 323 indicates that the extension of person type comprises adding a “gender” attribute.
A second schema extension 330 is defined in FIG. 3B. Again, a redefine element is used, as shown at 340, and specifies that this extension redefines the base schema in the document stored at “base.xsd”. In this sample extension 330, the body 350 of the schema extension again specifies that the complex type named “personType” is being redefined (see reference number 351) and that this complex type is being used as a base type that is being extended (see reference number 352). This time, however, the base “personType” is being extended to include an “age” attribute. See reference number 353.
FIG. 3C provides a third schema extension document 360. The redefine element at 370 again refers to the base schema in the document stored at “base.xsd”, and the body 380 again specifies that the complex type named “personType” is being redefined (see reference number 381) and that this complex type is being used as a base type that is being extended (see reference number 382). In this extension, the base type is being extended to include a “maritalStatus” attribute. See reference number 383.
FIGS. 4A-4C provide sample XML documents that conform to the schema extensions specified in FIGS. 3A-3C, respectively. As can be seen by review of these sample documents 400, 430, 460, each document includes the additional attributes defined in the respective schema extension.
As has been demonstrated with the examples of FIGS. 3A-3C, the markup language notation for extending a schema is simple and intuitive. Schema extensions defined in this manner are readily supported by XML parsers of the prior art. However, in the prior art, the object-oriented notion of abstract classes and type casting (also referred to as “object casting”) is beyond the scope of the markup languages and the parsers that process them. As a result, the application that consumes a parsed XML document (referred to hereinafter as a consumer or consumer application) is restricted to a specific extension of an extended schema. That is, a prior art parser will only render objects according to a specific schema extension. Typically, this is an (extended) schema that is referenced within the document to be parsed. Referring again to FIG. 4A, for example, the schema location element at 410 specifies that the resource name for the schema is “ext1.xsd”. This is intended, in the examples used herein, to refer to the extended schema 300 in FIG. 3A. Similarly, in FIGS. 4B and 4C, elements 440 and 470 specify resource names of “ext2.xsd” and “ext3.xsd” for the schema location attribute, and these resource names are intended to refer to the extended schemas 330 and 360 of FIGS. 3B and 3C, respectively.
Selectively specifying which schema should be used as input to the parser is illustrated in FIG. 5. As shown therein, a base schema 500 is extended by three separate schema extensions 510, 511, 512. This scenario corresponds to the examples which have been described, wherein base schema 500 is exemplified by schema document 100 of FIG. 1 and wherein the schema extensions 510, 511, 512 are exemplified by schema extension documents 300, 330, 360 of FIGS. 3A-3C. (As will be obvious, a base schema and its extensions may be much more complicated than the simple examples provided herein for purposes of illustration.) A particular consumer application, a collection of which are represented in FIG. 5 by Consumer 1, Consumer 2, and Consumer 3 at reference number 540, requests that parser 520 parse a particular input document. The parser may use the specific schema identified by the schema location attribute of that input document. Alternatively, the consumer application may instruct the parser 520 as to which schema extension should be used. In either case, the parser generates its output to the consumer application in a form that adheres to the specified schema extension, as indicated generally at reference number 530. So, for example, if Consumer 1 requests parsing according to the schema extension in extension document 510 (“Ext 1”, in the figure), then the input document being parsed must adhere to the syntax of that extension and the parser's output will use the syntax of that extension as well.
With reference to the sample schema extensions in FIGS. 3A-3C, for example, Consumer 1 might be adapted for processing person elements that include a gender attribute, Consumer 2 might be adapted for processing person elements that include an age attribute, and Consumer 3 might be adapted for processing person elements that include a marital status attribute. Because of the extensibility of XML documents and the wide distribution that is possible due to their transportability, it may frequently happen that a receiver of an XML document makes additions to, or changes in, the syntax of that document. For example, an application might receive a document containing person elements that include only the child elements and attributes that were defined in the base schema 100, and might then modify that document to include age attributes in conformance with schema extension 330.
Extensions of this type present problems during the parsing process. XML documents that conform to an extended schema cannot be validated and processed by tools designed for the base (i.e., non-extended) type. Therefore, a validating parser that uses the base schema 100 when parsing one of the extended-schema documents 300, 330, 360 will regard the additional gender, age, and marital status attributes as invalid syntax. An exception will be generated, and the consumer application will not receive the value of the corresponding attribute.
In addition, it may happen that the proper schema is identified for validating the extended syntax of the XML document, but that the consumer application is not adapted for dealing with the extensions. Suppose, for example, that the XML document 400 in FIG. 4A is received as input to an application that only knows about the base schema 100 in FIG. 1. Assuming that the parser 520 in FIG. 5 uses the extended schema identified at 410 in FIG. 4A in the parsing process, the parser will deliver objects or events that may include the gender attribute defined in this schema extension. This may cause problems for the consumer application, which may need to include special code to deal with such “unexpected” input.
Furthermore, schema extensions may be cumulative (i.e., nested), which exacerbates this problem for prior art parsers. Suppose, for example, that the schema extension 330 in FIG. 3B referred to the location of the schema extension 300 in FIG. 3A as its base (e.g., by specifying an attribute such as “ . . . schemaLocation=“ . . . /ext1.xsd” at 340), and the schema extension 360 in FIG. 3C referred to the schema extension 330 as its base (e.g., by specifying “ . . . schemaLocation=“ . . . /ext2.xsd” at 370). In that case, a valid XML document could contain person elements having gender, age, and marital status attributes (in addition to the height and weight attributes from the base schema definition 100). FIG. 6 illustrates, in a composite form, a schema 600 that corresponds to the result of applying these nested extensions. (Note that this schema document 600 is provided only for illustrative purposes. The schema extensions still remain in distinct documents, as in FIGS. 3A-3C.) A document conforming to this nested extended schema is illustrated at 700 in FIG. 7. Pictorially, the nested extensions and their cumulative or composite effect are illustrated in FIG. 8 (see, generally, reference number 800).
In this situation, the validation of document 700 must use the most-specific schema extension, in order to avoid generating exceptions for those attributes that have been added to the base schema. In many cases, the consumer application may not want all of these attribute values, and in fact, receiving the values from the parser may cause problems in the consumer application if it is not adapted for dealing with those attributes (as was noted earlier). Suppose that some consumer application needs (or can process, when present) the gender and age attributes, but does not know about (and therefore cannot use) the marital status attribute. If the objects delivered to this consumer application from the parser were created according to the most-specific schema extension, the parser will not generate syntax errors or exceptions when parsing document 700, but the consumer application will receive an attribute value (i.e., marital status) that it does not recognize. This “extra” attribute may cause the application to fail. Or, programmers may have to write additional error checking logic to deal with such unexpected input values. If, on the other hand, the parsing is performed according to the next-most-specific schema extension (i.e., including the gender and age attributes), then the parser will generate a syntax error during the validation process when it encounters a person element with a “maritalStatus” attribute. This may prevent the consumer application from receiving any of the data for the element that has been flagged by the parser as having invalid syntax, which is obviously an undesirable result.
In the prior art, validation is often turned off in the parser to avoid problems of the types described above. Therefore, the unrecognized syntax in the parsed document is simply ignored. However, this “workaround” then hides true errors in the syntax of input documents. This is also undesirable.
Accordingly, what is needed are improvements to the processing of documents created according to extended schemas.