XML (Extensible Markup Language) is a language that has been designed to improve the functionality of the World Wide Web by providing data identification in a more flexible and adaptable manner than previously possible. The term “extensible” is used because the language does not have a fixed format like its predecessor HTML (a single, predefined markup language). Instead, XML is actually a “metalanguage” (a language for describing other languages) which allows a designer the freedom of designing a customized markup language for different types of documents. XML's flexibility is possible because it is written in SGML, the international standard metalanguage for text markup systems (ISO 8879). The result is an extremely simple dialect of SGML which enables generic SGML to be served, received and processed on the Web in the way that is not possible with HTML.
Organization of data in XML is accomplished via a Document Type Definition (DTD) Schema or XML Schema. DTD is a formal description in XML Declaration Syntax of a particular type of document. It establishes what names are to be used for the different types of elements, where they may occur, and how these elements fit together. A DTD provides applications with advance notice of what names and structures can be used in a particular document type. To facilitate usage, there are thousands of DTDs already in existence for a variety of applications.
Schema matching is a problem in many data management applications, including schema evolution and integration, data exchange and data archiving and warehousing. For example, given two database schemas S1 and S2, the goal of the schema-matching process is to effectively identify elements/types in the two schemas that semantically correspond to each other. This process is a critical step, for example, in mapping messages between different formats in E-business applications or identifying points of integration between heterogeneous source schemas and a global, integrated schema (e.g., for web-data integration). Currently, schema matching is a tedious, time-consuming process performed, to a large extent, manually (perhaps supported by a graphical user interface).
Some existing solutions address different forms of the schema matching problem and offer partially automated processes for several application domains. However, none of these earlier efforts has addressed the general problem of matching DTD schemas defined in terms of complex regular expressions containing conjunction, disjunction, and Kleene star operators. Furthermore, most earlier work has ignored the issues of information preservation. Informally, an information-preserving matching of schema S1 to S2 implies that all the information in the S1-structured local database can be transformed losslessly into the integrated schema S2. In other words, a systematic mapping of instances of S1 onto instances of S2 can be obtained without losing any information or structure in the original data. Furthermore, user queries posed over the local S1 schema instances can be effectively translated (based on the underlying schema matching) into equivalent queries over S2 that return exactly the same results. Given the rapidly-growing number of available web data sources as well as the constantly increasing complexity and diversity of the underlying database schemas, there is a need for tools that can effectively automate the schema-matching process.