The present invention is related to methods and apparatus for exchanging data between parties using different schemas to represent the data. More specifically, the invention is related to an apparatus and a set of processes that automatically translate between two different data representation schemas through the use one or more shared examples.
1. Background
In any applications involving the processing of data that is not centralized, one entity will have to send or receive data from other entities. The fundamental challenge is that different entities may use different schemas (or formats) to represent data. On the Internet, this data is often described using a structured markup language such as eXtensible Markup Language (or XML). The definition of the format of the data described using XML is captured by an associated Data Type Description (DTD). Consequently, the two entities may use different XML DTDs to represent the data formats of the corresponding data.
This invention describes a new solution in which the two entities, by sharing one or more examples, are able to perform the translation between the two schemas automatically.
The possible applications, in the Internet domain, include but are not limited to: EDI; search engines; content ingestion; content customization; data delivery; and data retrieval.
2. Statement of Problems With the Prior Art
There are some existing ways of solving this schema translation problem:
1. Explicit Translation Rules: In this case, translation rules between two schema are manually derived and coded (i.e., software). A new set of schema requires new translation rules. Furthermore, writing software is time consuming and expensive.
2. Standardized Schemas: In this case, both entities use a standard schema. Examples include numerous data formats such as Hypertext Markup Language (HTML), and Channel Data Format (CDF). Nevertheless, agreeing on a common schema is time consuming and difficult.
In summary, the existing techniques for solving the schema translation problem are time consuming and expensive. This does not allow them to be deployed in scenarios where the translation is required for short engagements.
The present invention includes the advantages of:
1. No translation programs need to be written.
2. No standard schemas have to be agreed upon.
3. Examples may be needed anyway to explain the schemas.
An objective of this invention is to provide a mechanism for translating data between different representation schemas by automatic and simple means.
In accordance with the aforementioned needs, the present invention is directed to a method, computer program product or program storage device for software for automatically generating a translator adapted for translating data between different representation schemas using the translator.
An example of a method for data representation schema translation having features of the present invention, comprises the steps of: identifying data encoded in a first data representation schema; converting the data encoded in the first data representation schema to an encoding in a second data representation schema; and automatically generating a translator based on common data encoding, in response to the identifying and converting steps, wherein the translator is adapted for translating data between said first data representation schema and said second data representation schema.
Another example of a method for data representation schema translation having features of the present invention, comprises the steps of: identifying one or more shared examples encoded in two data representation schemas; and automatically generating a translator based on the shared examples, wherein the translator is adapted for translating data between said two data representation schemas.
One embodiment of the present invention comprises the further steps of: parsing data in said one or more shared examples into trees, each tree representing one of the data representation schemas; generating a path table for each tree, in response to the parsing step; wherein the step of automatically generating the translator comprises the step of generating a translation table from path tables generated for each tree.
In a preferred embodiment, the data representation is in the form of a tree and the same data example is encoded in the two different schemas, sA and sB. Let the example data D={x1,x2, . . . ,xi, . . . xn} be encoded as dA and dB in sA and sB, respectively. Each element xi in D, represented in schema sj, has a unique path pji from the root to xi.
We build a path conversion table T that has an entry {pAi, pBi} for each xi in D. This can be done, for example, by making a table of {xi, pAi} for dA in sA and {xi, pBi} for dB in sB. Rows of these tables with common element xi are merged to generate T.
When a new data Y={y1, y2, . . . ,yi, . . . ,ym} is encoded in sA as yA, and needs to be converted to yB represented in sB, we first compute the path pAi for each yi in Y. Then we use the path conversion table T to look up the entry with a matching path value in the first column which gives the resulting path pBi for yi in yB via the second column of T. We incrementally construct the tree for yB by adding the data item yi at path pBi in yB.
The path conversion table T acts as an automatic program for converting data between the two schemas sA and sB.
Alternatively, when the representation is in the form of a graph, a method is used to compute unique paths, for example, depth-first traversal. The depth-first traversal of a rooted graph will give us tree composed of the tree edges and the nodes, thus reducing the graph to an equivalent tree (reference xe2x80x9cKnuthxe2x80x9d). Other data representations can be converted to trees or graphs.
The path computed may be augmented by storing at each node along the path, its position among its siblings. This augmented path allows us to preserve ordering within siblings.
The example D, or a set of examples {D} needs to exercise each possible path p that is to be encountered in subsequent data exchanges. These paths p are the non augmented paths. Differences in augmented path can be automatically handled by using machine readable schemas sA and sB or, in the absence of schemas, using heuristics, such as defaults for repetitions of nodes (for example, the default could be that each node appears zero or more times).
Depending on the nature of the machine readable schemas sA and sB, data (format) conversions that can be deduced automatically from the two schemas, can also be added to this schema conversion process.