A regular expression is a description of a pattern composed from combinations of symbols and operators. For example, the following text is a regular expression:
Name :=(Firstname, Middlename?, Lastname)
The above regular expression represents that a “Name” consists of exactly one “Firstname,” followed by zero or one “Middlenames” (the “zero or one” being denoted by the question mark in the expression), followed by exactly one “Lastname.”
The above regular expression also can be represented in the form of a tree. FIG. 1 is a diagram that illustrates a tree structure that represents a regular expression. The illustrated tree indicates that the regular expression is a “SEQUENCE” of three nodes: a “Firstname” node, an “OPTIONAL” node, and a “Lastname” node. The presence of the “OPTIONAL” node indicates that child nodes of that node are not mandatory within instances that conform to the regular expression. Thus, the “OPTIONAL” node corresponds functionally to the question mark that is associated with the “Middlename” in the regular expression discussed above. Because the “Middlename” node follows the “OPTIONAL” node, instances that conform to the regular expression can, but do not need to, contain any “Middlename.”
When a regular expression is used to indicate the structure of an Extensible Markup Language (XML) document, the regular expression is called a “schema.” An XML document conforms to a schema if the elements within that XML document follow the structure indicated in the schema. For example, the following XML data conforms to the regular expression discussed above:
<Name><Firstname>Kohsuke</Firstname><Lastname>Kawaguchi</Lastname></Name>The XML data comprises exactly one “Firstname” element (“Kohsuke”) followed by exactly one “Lastname” element (“Kawaguchi”) as required by the regular expression. The XML data conforms to the structure of the regular expression despite the absence of a “Middlename” element, because the regular expression requires zero or one “Middlenames.” In this case, the XML data comprises zero “Middlename” elements, which is acceptable. XML data that conforms to a schema is often called a “valid instance” with respect to that schema.
Taken together, multiple regular expressions such as the one discussed above may be seen as defining a “type system.” For example, the regular expression discussed above defines the structure of a “Name” type in such a type system. Regular expressions can be used to describe data types.
It is often useful to generate programs that read or write XML data that conforms to a regular expression. For example, one might write a JAVA program that contains a class specifically designed to read instances of the “Name” type from one or more XML documents. Such a class might have an interface similar to the following:
Class Name {Firstname getFirstname( );Lastname getLastname( );Middlename getMiddlename( );}
Using constraints available within the JAVA type system, the above class represents the constraints of the type system defined by the regular expression. To read a “Firstname,” “Lastname,” or “Middlename” element from an XML document, a computer program may invoke the appropriate “getFirstname,” “getLastname,” or “getMiddlename” method of the “Name” class. Within the program code, each method may be implemented specifically to read and return the appropriate type of element. For example, the interface of the “getFirstname” method specifically indicates that the “getFirstname” method is to return data of a “Firstname” type. Thus, the type system defined in the XML document is preserved in the return types of the methods.
Such classes are very useful. Because these classes are so useful, it is beneficial to attempt to automate, to the extent possible, the generation of the interface of these classes and the interfaces of these classes' methods. A computer program that receives a type system, such as one or more regular expressions (i.e., a schema), and attempts to automatically generate class and methods that correspond to the type system, is called a “schema compiler.” The process of generating class and methods that correspond to such a type system is called “data binding.”
Sometimes the automatic generation of class and methods is relatively straightforward. However, complications can arise when the regular expressions to which the interfaces correspond are more complex.
For example, one might define a type “X” in the following manner:
X :=(A, B?, C?)|(B, C?)|C
In plain English, this complex regular expression reads as, “X consists of an arbitrary combination of A, B, and C, but there must be at least one of them,” as that is typically the intention of the schema author when he writes a regular expression like this. The following sequences are all of those which conform to the constraints of this complex regular expression: “A,” “AB,” “AC,” “ABC,” “B,” “BC,” and “C.”
At first glance, it might seem that this complex regular expression could be expressed in simpler terms. However, the constraints defined by this complex regular expression are not the same as the constraints defined by either of the following other simpler regular expressions:
X :=(A|B|C)
X :=(A?, B?, C?)
The sequences “AB,” “AC,” “BC,” and “ABC,” which conform to the complex regular expression discussed previously, don't conform to the first of these other regular expressions. Additionally, the empty sequence, which doesn't conform to the complex regular expression discussed previously, conforms to the second of these other regular expressions.
Unfortunately, existing schema compilers do not handle complex regular expressions in an optimal manner. For example, if an existing schema compiler received, as input, the complex regular expression discussed above, the existing schema converter might generate the following class and methods:
Class X {List<Object> getContent( );}The above interfaces are not very specific. The method “getContent” would merely read and return a list of elements of non-specific “Object” types. In JAVA, “Object” is the most general type. When data is stored in an “Object” type, the more specific information that might have been available concerning that data's original type is not preserved.
Yet, one of the prime reasons that data is stored in an XML document in the first place is so that the specific types (e.g., “Firstname,” “Middlename,” “Lastname”) of the data stored therein are defined. Failing to preserve the specific types of data specified within an XML document tends to defeat the very reasons why the data was stored in XML format in the first place. Thus, the non-specific “getContent” method is not very useful.
Existing schema compilers are limited in effectiveness by their inability to generate, automatically, type-specific methods based on complex regular expressions.