A forerunner of XML, HTML (“Hypertext Markup Language”) was conceived as an easily understandable language for the exchange of scientific and other technical documents. HTML addressed the problem of SGML (“Standard Generalized Markup Language”) complexity by specifying a small set of structural and semantic tags suitable for authoring relatively simple documents. In addition to simplifying the document structure, HTML added support for hypertext. Multimedia capabilities were added later.
In a brief period of time, HTML became very popular and quickly outgrew its original purpose. Since HTML's inception, many new elements have been devised for use within HTML (as a standard) and for adapting HTML to vertical, highly specialized, markets.
A severe shortcoming of HTML is its lack of data structuring mechanisms. HTML documents capture presentation and rendering aspects of marked-up documents, but the formalism does not lend itself to describe the structure of data which is exchanged between computing entities in typical client-server applications.
XML is a proposed standard for describing the structure of semi-structured data. The formalism supports the description of concrete mark-up languages which allow the specification of hierarchical (i.e. tree-like) data structures. Concrete mark-up languages can be specially adapted to particular application domains, such as the airline industry, finance industry, etc. where said concrete mark-up languages allow to model data entities per-use by these applications. XML can also be used to specify HTML as a concrete mark-up language. More information about XML can be found in “Extensible Markup Language (XML) 1.0:W3C Recommendation Feb. 10, 1998”, http://www.w3.org/TR/REC-xml. and E. H. Harold, XML Extensible Markup Language, (IDG Books 1998).
XML, proposed by W3C (the World-Wide Web standardization body), has found wide-spread acceptance in the industry and is rapidly becoming the lingua franca for data representation/description mechanisms used throughout the World Wide Web. It is an open specification and several major industry leaders, among them Microsoft and IBM, are pushing for the use of XML formatted data exchanged between the various IT systems and sub-systems which make up an enterprise as well as a personal computing environment.
Abstract Syntax Notation (“ASN”, currently version One, “ASN.1”) serves a similar purpose: data structures can be described abstractly using the ASN.1 syntax. Its initial intention was to provide a scheme to specify the structure of data to be exchanged between computer systems in a system-independent, common representation. Therefore, ASN.1 also provides several concrete standardized transfer encodings such as “Basic Encoding Rules” (BER), “Definite Encoding Rules” (DER) etc. Due to the complexity of the ASN.1 data description language as well as the multiplicity of encoding rules, ASN.1 usage has been restricted to a limited set of IT applications, mainly related to IT security (e.g., directories, public key infrastructure).
Most data on the Internet is stored in legacy databases, many of which are using a relational data base model and in some cases ASN.1 encoding. For both cases, it is of commercial value to externalize such data into an XML compliant format in order to enable new Internet based applications.
For relational databases as well as for ASN.1 data, some work has been done to map actual data into XML compliant data formats. However, these transformations do not generate the meta-data description, embodied by an XML Document Type Definition (DTD). For example, one proposed translation for ASN.1 encoded data, published at http://asf.gils.net/xer/standard.html, only defines a mapping from one concrete ASN.1 data module onto one concrete XML data representation without generating the XML DTD.
What is needed, therefore, is an efficient method to externalize legacy data into an XML compliant format where the format is specified by an automatically generated XML meta-description (i.e. DTD). Such procedures will enable the access and processing of legacy data by new, Internet generation applications.