It is a common requirement to combine formatted data originating from potentially disparate data sources employing different data formats into a single common data categorization scheme. For example, product information for different, related or identical products originating from different product suppliers can be formatted differently by each supplier. A single recipient of such data (such as a central Product Information Management System) requires a common data categorization scheme suitable for receiving and categorizing the differing product information in a useful way.
FIG. 1 illustrates a mapping system for data originating from disparate data sources to a common data categorization scheme in the prior art. Disparate data sources 102, 104 and 106 provide data items having differing formats. For example, data sources 102, 104 and 106 provide data items formatted as XML documents, although alternative data formats could be used. Such data items are formatted to include one or more data elements such that each data element includes an element name and an element value. For example, a simple data item can be defined in XML as:
<ProductType> Apple </ProductType><Color> GREEN </Color>
Exemplary Data Item 1
Exemplary data item 1 has two elements: ProductType; and Color. The ProductType element has an associated value of “Apple” and the Color element has an associated value of “GREEN”. The particular format of the elements of data (i.e. their names, data types and nesting) will differ between data sources. For example, an alternative format for a similar data item might be:
<UniqueID> 8392 </UniqueID><Specification><Type> Apple </Type><Variety> Granny Smith </Variety><Color> GREEN </Color><Size> LARGE </Size><Origin> United Kingdom </Origin></Specification>
Exemplary Data Item 2
Exemplary data item 2 includes further details as additional data elements nested within a Specification data element, including Type, Variety, Size and Origin. A requirement exists for all data to be available in a single common data categorization scheme 114, such as a Product information Management System. To categorize data from each of the disparate data sources 102, 104 and 106 it is necessary to first define an appropriate categorization structure of the common categorization scheme 114. An appropriate categorization structure includes a definition of one or more categories, and such a definition will be partly determined by the data items themselves. For example, factors such as the particular data elements present in data items and the values associated with such elements will influence the structure of the common categorization scheme 114.
It is further necessary to define a data mapping 108, 110 and 112 to map data items from each data source 102, 104 and 106 to the common categorization scheme 114. The data mappings 108, 110 and 112 define how elements of formatted data items from each of the data sources 102, 104 and 106 are to be categorized into the common data categorization scheme 114. For example, an appropriate categorization for the exemplary data items above may be to categorize by the product type, which is defined as ProductType in data item 1 and Specification->Type (‘->’ indicating that Type is a nested element within Specification) in data item 2. This represents an element with a common meaning in both data items, despite the naming and nesting conventions differing. The ProductType element in data item 1 is a root data element (i.e. it is not nested within other data elements), whilst the Type element in data item 2 is a nested data element (within Specification), Such differing formats of data items and differing terminology in the naming of data elements require that the format of each data item to be mapped into such a categorization scheme using mappings 108, 110 and 112 which may need to be defined manually.
Whilst the arrangement of FIG. 1 is effective in providing a common data categorization 114 for data items having different formats, the arrangement is reliant upon an appropriately defined common categorization scheme 114 and appropriately defined mappings 108, 110 and 112 between each data source and the common categorization scheme 114. These mappings need to be defined in view of both the format of data from a data source and the required categorization. There is therefore a tight coupling between the data sources 102, 104 and 106 and the common categorization scheme 114 in the form of the mappings 108, 110 and 112. This approach further incurs a high overhead in defining the mappings which can be both time consuming and costly. An additional disadvantage of such a tightly coupled arrangement is that the common categorization scheme 114 cannot be defined dynamically in response to a new data item having a new format. Such a new format would require the definition of a new mapping and possible amendments to the categorization scheme 114 itself to accommodate the new data item.
Thus it would be advantageous to provide for the dynamic generation of a categorization scheme for data items of disparate origin and/or format without a need for the generation of an intermediate mapping between data items and the categorization scheme.