Although vast amounts of information are stored and are accessible through computer systems today, access across systems is not always possible. For example, some computer systems are legacy systems, which are self-contained and which have little or no flexibility in terms of data output and communication. Other systems rely on proprietary data formats, and therefore may also lack flexibility for interoperability or integration between systems.
In general, document formats may be divided into three broad categories. A structured document has completely defined format, which incorporates data in known positions of the document. A structured document is generally easily transformed by parsing the structured document to extract the required data from known positions, and then mapping the data into a different format. For example, a document in a first Extensible Markup Language (XML) structure (corresponding to a specified XML Schema Definition, i.e. XSD) may be transformed into a second XML structure (corresponding to a different XSD).
An unstructured document refers to a document that is kept in human readable form, such as Microsoft Word, Microsoft Excel, or Adobe PDF documents. Unstructured documents present a particular challenge to interoperability. While such documents may have an implicit structure, the current art is not able to identify and extract the relevant data that is required in order to transform the unstructured data into a different format.
A semi-structured document is a document that is mostly structured but has parts that are not well defined. An example is a Cobol message that has an associated copybook, where the copybook contains a “redefine”. The presence of the unstructured regions within a semi-structured document may make the document difficult to transform to another format.
Additionally, organizations which rely upon computer systems, such as corporations for example, have increasing expectations that their computer systems should be able to communicate more flexibly and efficiently with each other. Background art describes how multiple computer systems should be able to communicate, in order to fulfill expectations of the organizations which operate them. A background art system may be divided into two sections: an internal section and an external section. Internal section typically resides within an organization, and includes one or more databases and internal application(s). Database(s) and internal application(s) in turn communicate through a combination of network hardware and one or more interfaces, which may be viewed as a local network interface.
On the other side, the external section may be outside the organization, or alternatively may represent another part of the organization. For example, an organization may have multiple branches, which may be connected through a WAN (wide area network) or other type of network connection. Also, the external section may represent a different type of computer system, such as a legacy system for example. If the external section is outside the organization, the external section may belong to an external supplier, such as for business to business (B2B) communication or for communication within organizations or companies. The external section also features one or more databases and external application(s). Database(s) and external application(s) in turn also communicate through an external network interface, which could also be the Internet for example.
In order for internal section and external section to communicate effectively, data and messages should be passed between them in a suitable data format. However, if different data formats are required, then some type of transformation process must be performed. Such a process can be thought of as a “black box” process, because there is currently no universal, broadly effective solution to the problem. For example, an organization could choose to implement a specific transformation solution, which would transform data in one type of format into another type of data format, and/or which would specifically permit two systems to understand different messaging formats.
One example of a black box solution which is available in the art is Mercator mapping tools (Mercator Software Inc., USA, acquired by Ascential™, now part of IBM®). This technology enables a programmer to create a specific transformation mechanism from one type of data, such as a proprietary format for example, to a second type of data, such as a commercial database format for example. However, it is limited to predefined, fixed transformations, such that each transformation mechanism between two different types of data requires the programmer to produce a separate transformation mechanism. Thus, this type of solution clearly has significant disadvantages. Additionally, Mercator uses a centralized broker configuration which has its own disadvantages, as it becomes a local point of failure, an administrative bottleneck and more.
There is thus a widely recognized need for, and it would be highly advantageous to have, a system and method for data format transformation devoid of the above limitations.