The present invention relates to data transfer, and more particularly to an automated system to effect data interchange.
Networks and networked applications have grown dramatically in number, size and complexity over the past decade. While the Internet is the most prominent example, internal LAN""s (Intranets) and distributed computing are also part of this growth. By definition, all networked applications need to send and receive information over a network, often communicating with other applications. The great variety of formats in existence makes integration of applications and data sources a difficult and expensive problem. Current data encoding standards are constantly replaced by newer technologies, further complicating the problem of providing connectivity between network nodes. From bit-encodings of low-level network transport protocols to HTML and XML, the problem of data and protocol translation is a complex and difficult one, because of the need to provide both high flexibility and high performance.
One of the more recent data encoding formats enjoying wide adoption, especially on the Internet, has been XML, a part of the SGML family of document description languages.
The proliferation of interconnected sites or domains known as the World Wide Web (xe2x80x9cWebxe2x80x9d) was initially developed largely using the document description language known as HyperText Markup Language (HTML). HTML was used predominantly to specify the structure of Web documents or xe2x80x9cpagesxe2x80x9d using logical terms. HTML, however, has inherent limitations for Web document design, primarily resulting from the limited, predefined tags available in HTML to define elements in a document or page. Nonetheless, HTML-defined documents continue to exist in significant quantities on the Web.
EXtensible Markup Language (XML) was developed as a document format protocol or language for the Web that is more flexible than HTML. XML allows tags used to define elements of a page or document to be flexibly defined by the developer of the page. Thus Web pages can be designed to effectively function like database records with selectively defined tags representing data items for specific applications (e.g. product code, department, price in the context of a purchase order or invoice document or page).
In the world of Web content, the use of XML is growing as it becomes the preferred data format in both business-to-business (B2B) and business-to-consumer (B2C) Web commerce sectors (e-business). The tremendous and continuing growth of XML in B2B applications has led to a great number of different XML e-business vocabularies and schemas. There are standardization efforts driven by industry associations, consortia, governments, academia and even the United Nations. Merely storing or transmitting e-business data xe2x80x9cin XMLxe2x80x9d is not a guarantee of interoperability between e-business commercial entities or sites. Even the method of specifying a particular structure for an XML document has not been agreed upon, with several incompatible methods in wide use. It is therefore necessary to perform conversions between different XML formats to achieve server-to-server transfer of invoices, purchase orders and other business data in the e-business context. The problem of interoperability is exacerbated by the commingling of XML and HTML e-business sites on the Web.
Successful B2B and B2C sites are being called upon to support a greater variety of clients and client protocols. That is, sites must be accessible by different browsers running on clients, e.g. Netscape or Internet Explorer, and by different versions of these (and other) browsers. Additionally, the nature of clients and client protocols is changing and adding to the problem of interoperability. Different clients, in the form of Personal Digital Assistants (PDAs) and WAP (Wireless Application Protocol) enabled cellular phones, process XML content but need to convert it to different versions of HTML and WAP to ensure a broad and seamless reach across all kinds of web clients, from phones to powerful Unix workstations. As the diversity of web-connected devices grows, so grows the need to provide dynamic conversion, such as XML-to-HTML and XML-to-WAP, for e-business applications.
The World Wide Web Consortium has defined eXtensible Stylesheet Language (XSL) as a standard method for addressing both XML-HTML and XML-XML conversions. There are several freely available and commercial XSL processor implementations for java and C/C++ e-business applications. However, standards-compliance, stability and performance vary widely across implementations. Additionally, even the fastest current implementations are much slower than necessary to meet the throughput requirements for either B2C or B2B applications. The great flexibility provided by XML encoding generally means that such conversions are complex and time-consuming.
The XSL World Wide Web Consortium Recommendation which addresses the need to transform data from one XML format into another or from an XML format into an HTML or other xe2x80x9coutputxe2x80x9d format, as currently specified includes three major components in an XSL processor: an XSL transformation engine (XSLT), a node selection and query module (Xpath), and a formatting and end-user presentation layer specification (Formatting Objects). XML-to-XML data translation is primarily concerned with the first two modules, while the Formatting Objects are most important for XML-to-HTML or XML-to-PDF document rendering. A typical XSL implementation comprises a parser for the transform, a parser for the source data, and an output stream generatorxe2x80x94three distinct processes. Known XSL transformation engines (XSLT) typically rely on recursive processing of trees of nodes, where every XML element, attribute or text segment is represented as a node. Because of this, implementations suggested in the prior art simply optimize the transformation algorithms and will necessarily result in limitations on performance.
An XSL stylesheet is itself an XML file that contains a number of template-based processing instructions. The XSLT processor parses the stylesheet file and applies any templates that match the input data. It operates by conditionally selecting nodes in an input tree, transforming them in a number of ways, copying them to an output tree and/or creating new nodes in the output tree. Known XSLT implementations suffer from terrible performance limitations. While suitable for java applets or small-scale projects, they are not yet fit to become part of the infrastructure. Benchmarks of the most popular XSLT processors show that throughput of 10-150 kilobytes/second is typical. This is 10 times slower than an average diskette drive and roughly equivalent to a 128 Kbit/s ISDN line. Many websites today have sustained bandwidths at or above T1 speeds (1500 Kbit/s) and the largest ones require 100 Mbit/s or faster connections to the Internet backbone. Clearly, unless XSLT processing is to become the chief performance barrier in B2C and B2B operations, its performance has to improve by orders of magnitude.
There are a number of reasons for such poor performance. To transform one XML vocabulary to another, the processor must parse the transform, parse the source data, walk the two parse trees to apply the transform and finally output the data into a stream. Some of the better implementations allow the transform parsing as a separate step, thereby avoiding the need to repeat that step for every document or data record to be processed by the same transform. However, the transformation step is extremely expensive and consumes an overwhelming portion of processing time. Because XSLT relies on recursive processing of trees of nodes, where every XML element, attribute or text segment is represented as a node, merely optimizing the implementation of the algorithms cannot attain the necessary results. Thus current state-of-the-art XSLT implementations have to sacrifice performance in order to maintain the flexibility that is the very essence of XSLT and XML itself. So while XML and XSLT offer greater flexibility than older data interchange systems through the use of direct translation, self-describing data and dynamic transformation stylesheets, this flexibility comes with a great performance penalty.
Other known transformation or translation solutions implement xe2x80x9cmiddlewarexe2x80x9d translation mechanisms. As represented in FIG. 1, in the middleware solution of the prior art, a large number of different platforms A-F, 101, 107 each may be arranged to communicate with each other. Each platform implements a format translator 103 to convert data streams between the local platform 101 and an agreed or common middleware format Z. The data stream in format Z can then be exchanged with any other node in the network. Each receiving node 107 then uses its own platform specific translator 105 to convert the data streams into a format preferred by the receiving node. Disadvantageously, such solutions require platform specific static drivers for each format. Conversion is laboriously performed by converting from the first platform format or protocol (A) to the common middleware format (Z) and then converting from the middleware format to the second platform protocol. In addition to the deficiencies in terms of time to effect such conversions, if formats change there is a need to stop or interrupt platform operations and install modified drivers in accordance with the format change(s). So while performance is often better than that of XML/XSLT solutions, flexibility is almost non-existent; performance is also considerably worse than that possible by using direct translation operating on the same formats.
Direct translation between two different formats or, more generally, two different protocols is the oldest method of achieving data interchange. By writing custom computer source code that is later compiled and installed on the target platform, it is possible to achieve interoperability between two different data formats. If the source code is carefully tuned by someone very skilled in the art, the resulting translator will be a high-performance one. However, it will not work if any change in data format or protocol occurs, and will require additional programming and installation effort to adapt to any such change. Direct translation can offer excellent performance, but it is even less flexible than the static adapters used by xe2x80x9cmiddlewarexe2x80x9d systems.
Instead of a static adapter or custom-coded direct translator, it is the use of some kind of data or protocol description that can offer greater flexibility and, thereby, connectivity. U.S. Pat. No. 5,826,017 to Holzmann (the Holzmann implementation) generically describes a known apparatus and method for communicating data between elements of a distributed system using a general protocol. The apparatus and method employs protocol descriptions written in a device-independent protocol description language. A protocol interpretation means or protocol description language interpreter executes a protocol to interpret the protocol description. Each entity in a network must include a protocol apparatus that enables communication via a general protocol for any protocol for which there is a protocol description. The general protocol includes a first general protocol message which includes a protocol description for a specific protocol. The protocol apparatus at a respective entity or node in a network which receives the first protocol message employs a protocol description language interpreter to interpret the included protocol description and thereby execute the specific protocol.
Again, disadvantageously, the Holzmann implementation requires a protocol apparatus at each networked entity to interpret the protocol description. That is, the implementation is xe2x80x9cnode-centricxe2x80x9d in that each node requires and depends on a respective translation function to a predetermined and fixed target format. Clearly, if one has the ability to equip every node in the network with a protocol interpreter such as the one described, one could conceivably equip every node in the network with a much simpler standard protocol stack to enable communication. On vast global networks, such as the Internet, it is practically impossible to change all network nodes over to a new protocol or data formatxe2x80x94and this in turn drives the need for data interchange methods and devices.
Additionally, the implementation involves interpretation of protocol descriptions, which is a very resource-consuming process. The trade-off of Holzmann is quite similar to that made by XML/XSLT: by using self-describing data packets and a generalized interpreter, the implementation sacrifices a great deal of performance to achieve better flexibility and interoperability. Also Holzmann does not address the needs of next-generation Layer 6 and Layer 7 protocols (such as those based on XML-encoded data) for protocol translation, dealing instead with lower-level (Layer 3) protocols only.
The existing solutions to the general problem of data exchange between disparate systems and enabling connectivity between networked applications, provide either performance or flexibility, but never both.
Further disadvantages of the existing solutions include the fact that their performance is limited by the requirements of static interpretation between limited sets of static constructs. The higher the performance of the typical interpreter, the less flexibility its designers permit in the specifications of the formats. Also, even where the prior art has made provisions for adapting a format specification to changes, only one side of a specification can be changed while the other remains fixed. However, this generates a further disadvantage since it creates a xe2x80x9cnode-centricxe2x80x9d system requiring all nodes to be changed in order to accommodate each new format specification. In addition, the typical data translators that operate as interpreters are relegated to the more stable protocols in the lower layers of the OSI model, thus severely limiting their usefulness in a rapidly changing environment.
The present invention provides a high level transformation method and apparatus for converting data formats in the context of e-business applications, among other places. A flexible transformation mechanism is provided that facilitates generation of translation code on the fly.
According to the invention, a data translator is dynamically generated by a translator compiler engine. The translator compiler engine receives a data map (DMAP) and a pair of formal machine-readable format descriptions (FMRFDs). The first FMRFD is a formal description for data coming from a source node and the second FMRFD is a formal description of data for a destination node. All three data structures (i.e. the two dynamically selected FMRFDs and the DMAP) are used to generate executable machine code (i.e. object code), for running on the CPU of the host platform, to effect the translation from the source format to the destination format. When fed an input data stream, the data translator generates an output data stream by executing the native object code (which was previously generated on-the-fly by the translator compiler engine). In addition, the data translator may be configured to perform a bidirectional translation between the two streams.
In further accord with the invention, formal machine-readable format descriptions (FMRFDs) can be defined for each data format and/or network protocol. An FMRFD describes the structural layout of the packets or data streams or other data structures being translated. An FMRFD may also include descriptions of a protocol, being a sequence of data structures being exchanged. These FMRFDs may be manually or semi-automatically loaded into the system by operators familiar with each node, or may be developed, discovered or modified automatically during communication exchanges. For example, a table of FMRFDs can be configured for each node, and a new translator created on the fly for each new FMRFD-pair encountered. Alternatively, a translator can be built for specified packet types exchanged between nodes, and applied as the corresponding packet type is encountered. As another alternative, a translator can be supplied or generated according to the source and destination node identifiers, along with identified protocols, formats, and schemas. The translator is then re-used for further transactions between the identified communicants. Furthermore, a set of predefined or standardized schemas may be accessed according to transaction types.
In another illustrative embodiment, where the protocol is XML (eXtensible Markup Language), and the conversion map is described by an XSL (eXtensible Stylesheet Language) file, an XML stream translator can be completely replaced or augmented by an optimized translator operated according to the present invention. Machine instructions, in object code, are directly executed and produce the desired output. This illustrative embodiment comprises an optimized contiguous memory algorithm, the performance of which approaches that of a memory-to-memory copy utility at speeds orders of magnitude faster than an XSLT. However, unlike a hardwired optimization, which trades flexibility for performance, the present invention preserves the flexibility through the dynamic use of the FMRFDs derived from the XSL and their corresponding data map (DMAP).
Features of the invention include provision of a data translation mechanism that is not node-centric and avoids the need for a translation apparatus or mechanism at each networked entity. The method and apparatus facilitates the efficient exchange of data between network nodes of different protocols by dynamically adapting to protocol and format changes. The present invention provides a unique solution to the growing problem of integrating disparate or incompatible computer systems, file formats, network protocols or other machine data. It allows many more formats and protocols to be accommodated transparent to the users. The mechanism is flexible in that any protocol or format that can be formally described can be used. Older systems can be retrofitted according to the invention to take advantage of next generation protocols. High performance is obtained from dynamic code generation. The need to create, install, and maintain individual, customized translators is obviated thus providing flexibility and high performance in the same data exchange apparatus.