A key background of the invention is the emergence of the Internet and the World Wide Web as widespread information and communications technology in the mid 1990s, and more recently in 1998, the invention of Extensible Mark-up Language (XML). Both the web and XML are derivatives of a much older computer technology, Standardised General Mark-up Language (SGML), which originated in the IBM laboratories in the early 1970s as a framework for the documentation of technical text, such as computer manuals. In the late 1980s, Tim Berners-Lee began working on a vastly simplified version of SGML, which was to become the heart of the World Wide Web: HyperText Mark-up Language (HTML). The key deficiency of HTML (progressively removed by version 4 of HTML) was that it was somewhat of a conceptual jumble, mixing historical typesetting tags (presentational concepts) with structural and semantic tags.
The widespread and increasing use of XML, reflects an enormous conceptual leap in data and document definition, in two regards. First, XML is not a mark-up language, but a mark-up language for mark-up languages—a place where, in other words, any mark-up language could be created. Second, XML rigorously separates mark-up for structure and semantics from presentation—which occurs independently in a ‘stylesheet transformation’ area. The great benefit of the XML approach is that content can be ‘multi-purposed’ by means of different stylesheet transformations. A page of text, for example, can be rendered as a web page, or a printed page, or as an image on a portable reading device, or as synthesised voice. This level of flexibility functions perfectly well when multiple stylesheets are applied to a document formed within a single XML schema, as embodied in a Document Type Definition (DTD). The key problem addressed by this invention is the integration of data created within multiple and varied DTDs.
XML creates the conditions of interoperability in two important regards: first, it allows alternative stylesheet transformations which may cross rendering platforms (print, screen, audio etc.); and second, it provides a universal platform for the creation of Document Type, Definitions. The former interoperability is limited insofar as the claimed flexibility is restricted to a single DTD. The latter operates at such a high level of generality that it provides no practical or workable basis for DTD-to-DTD interoperability. It is this latter mechanism which has been created by the invention described here.
XML has become ubiquitous. Alongside this ubiquity, however, has been the burgeoning of varied and functional overlapping schemas, some of which are regarded as industry ‘standards’, some of which are commercial. However, the functional overlap has not been produced functional interoperability, as is the case with this invention.
To take the example of the publishing industry, a number of key XML-based schemas have emerged, with an enormous amount of functional overlap, but without any real possibility of achieving interoperability given the current state of the art. Major areas include:                a) Document formation (DocBook, Text Encoding Initiative);        b) Electronic content creation (HTML and XHTML and their derivatives such as Open eBook and Digital Talking Book)        c) Print rendering (Job Definition Format);        d) B-2-B e-commerce (ONIX, or the Online Information Exchange standard for publishers and booksellers);        e) Library cataloguing (principally the Library of Congress MARC, MODS and METS standards);        f) Digital Rights Management (Extensible Digital Rights Management Language and the MPEG21 Rights Data Dictionary);        g) Internet syndication and resource discovery (Dublin Core, RSS, Atom);        h) E-learning (the Shareable Content Object Reference Model and the Instructional Management Systems standards).        
These standards are an example of what is now called the ‘semantic web’. Each XML schema is an ‘ontology’ consisting of a content tagging schema which describes the scope of a particular software application. These are the basis either of Document Type Definitions (or DTDs in XML file format) or database structures (which can, in turn produce exports into XML files based on the database structure). Tim Berners-Lee predicts that this is the next great step in the development of the internet, and one which promises more accurate resource discovery, machine translation and eventually, artificial intelligence.
There is one great barrier to this vision, and that is the problem of interoperability. Even though each standard or XML DTD has its own functional purpose, there is a remarkable amount of overlap between these standards. The overlap, however, often involves the use of tags in mutually incompatible ways. Our extensive preliminary R&D investigating the approximately twenty major standards that apply in just one industry—the publishing industry—shows that, on average, each standard shares seventy percent of its semantic range with neighbouring standards. Despite this, it is simply not possible to transfer data from one standard to another as each standard has been designed as its own independent, stand alone DTD. This, in fact, points to one of the key deficiencies of XML as a meta-mark-up framework: it does not provide a way for DTDs to relate to each other. In fact, its very openness invites a proliferation of DTDs, and with this proliferation, the problem of interoperability compounds itself.
This produces a practical, commercial problem. In the book publishing and manufacturing supply chain, for instance, different links in the chain use different standards: typesetters, publishers (internet, e-book and print), booksellers, printers, manufacturers of electronic rendering devices and librarians. This disrupts the digital file flow, hindering supply chain integration and the possibilities of automating key aspects of supply chain, manufacturing and distribution processes. Precisely the same practical problems ofinteroperability are now arising in other areas of the electronic commerce environment.
A task in today's IT world involves sharing data between systems. XML has emerged as the so-called ‘syntactic sugar’ to facilitate this task. As an example, Company A may have a commercial obligation to provide Company B with metadata about a series of documents, such as their titles, authors, classificatory categories and ISBNs. Both parties must agree on a common DTD to allow this to happen, which may be devised by the parties or based on an existing standard. In addition, each party must map their internal systems to this common DTD. Finally a further set of information—security constraints, transactional characteristics, network protocols and messaging conditions (whether responses must be synchronous or asynchronous)—must be agreed to before the metadata can be transferred. This complexity arises in the relatively simple transfer of information between two conferring parties.
However, in a scenario where there are many more than two parties, where the information is not covered by a single standard, where the resources and skills of the parties cannot facilitate costly and time-consuming integration, a different approach is needed—one which caters for the complexity of the messages, while providing tools which simplify the provision and extraction of metadata. This approach is one which has been termed semantic and structural interoperability. It involves providing a systematic mapping of associated XML standards to a common XML ‘mesh’, which must track semantic overlays and gaps, schema versioning, namespace resolution, language and encoding variances, and which must provide a comprehensive set of rules covering the data transfer—such as security, transactional and messaging issues.
The idea of a ‘meta-schema’—a schema to cover other related schemas—was initially considered to be sufficient. Research has demonstrated, however, that this is not enough, being subject to many of the same problems as the individual schemas being mapped—versioning, terminological differences and so on.
Mark-up ontologies or software tagging systems use a variety of encoding formats, including Extensible Mark-up Language (XML) and Resource Definition Framework (RDF). Ontologies promise to overcome two of the most serious limitations of the World Wide Web:
1. the fact that search algorithms primarily locate semantically undifferentiated strings of characters; and
2. the fact that rendering alternatives are mostly limited by data entry methods—printed web pages do not live up to the historical standards of design and readability of printed text, and alternative non-visual renderings, such as digital talking books are at best poor.
Specific ontologies are designed to provide more accurate search results than is the case with computer or web-based search engines. Examples include the Dublin Core Metadata Framework and MARC electronic library cataloguing system. However, metadata harvested in one scheme cannot be readily or effectively be used in another.
Specific ontologies are also designed for a particular rendering option. For instance, amongst ontologies describing the structure of textual content, HTML is designed for use in web browsers, DocBook for the production of printed books, Open eBook for rendering to hand held reading devices and Digital Talking Book for voice synthesis. Very limited interoperability is available between these different ontologies for the structure of textual data, and only then if it has been designed into the ontology and its associated presentational stylesheets.
Furthermore, it is not practically possible to harvest accurate metadata from data, as data structuring ontologies and ontologies for metadata are mutually exclusive. The field of the semantic web attempts to improve the inherent deficiencies in current digital technologies both in the area of resource discovery (metadata-based search functions) and rendering (defining structure and semantics in order to be able to support, via stylesheet transformations, alternative rendering options).
Its success, however, has been very limited, primarily because of the semantic dissimilarities between overlapping ontologies and because of the limited rendering options catered for in ontologies which define data structure. At most, one-to-one, schema-to-schema ‘crosswalks’ have been created.
Creating a single crosswalk is a large and complex task. As a consequence, the sheer number of significant overlapping ontologies in a domain presents a barrier to achieving interoperability. For instance, our research has identified more than twenty major ontologies pertaining to the domain of authorship and publishing. Using the ‘crosswalk’ approach, every tag in a schema needs to be mapped tag by tag against every tag in every other schema with which interoperability is required.
Each crosswalk in fact involves two translations: Ontology A defined tag by tag in terms of Ontology B, and Ontology B defined tag by tag in terms of Ontology A. Using the crosswalk method, the number of mappings to achieve interoperability between N tagging schemas is 2{(N/2)(N−1)}. In a terrain encompassing twenty-ond ontologies, for instance, 420 crosswalks would be required (see FIG. 1). Moreover, new ontologies are regularly emerging and each new ontology increases exponentially the scale of the task of achieving interoperability.
The present invention addresses fundamental problems that currently arise in the area of interoperability of data and metadata. These can be summarised as follows:
1. The failure of ‘the semantic web’ to improve the search mechanisms of computers and the Internet across even similar domains of knowledge, information and data. As a consequence, searching still functions primarily on the basis of a semantically and structurally agnostic process of matching of strings of characters.2. There is limited interoperability between ontologies for metadata tagging, and when there is, it is a consequence of the laborious manual crosswalks approach.3. There is a limited range of rendering options, even when mark-up for structure and semantics is separated from the rendering apparatus of the stylesheet.