1. Field of the Invention
The present invention generally relates to data processing and, more particularly, to architecting a relationship between different representations of data.
2. Description of the Related Art
In various different organizations, such as research organizations, critical data frequently exists in many different formats and is generated and managed by a diverse set of individuals or groups. These individuals or groups often start by sharing information from a shared pool of data informally, for instance, via CD ROM exchange or email attachments. But as the size of such an organization grows, these informal data sharing methods become impractical and frequently lead to repeated data capture. Therefore, different methods for sharing data across individuals or groups have been developed to minimize cost associated with redundant data capture. Such methods aim to facilitate analysis and research against the shared pool of data collected within (and sometimes beyond) the organization. In particular, methods have been developed which are directed towards non-IT (Information Technology) professionals for making unique captured data available for use within the broader organization context. Accordingly, several existing software solutions are used for managing data of a shared pool of data.
One approach for managing a shared pool of data consists in implementing the shared pool of data as a physically centralized database. Databases are computerized information storage and retrieval systems. A relational database management system is a computer database management system (DBMS) that uses relational techniques for storing and retrieving data. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. An object-oriented programming database is one that is congruent with the data defined in object classes and subclasses.
One difficulty in implementing the shared pool of data as a physically centralized database is the requirement of transforming data in a data format from an incoming data source of the shared pool of data into a data format which is suitable for the physically centralized database. More specifically, the physically centralized database can be managed using database management software products supporting a variety of data import or load features. Products like IBM's DB2 support data import and load functions that can take data from a list of supported data formats and store this data into a target database schema. However, there are a number of limitations to this approach. For instance, data can be imported into a single database table or a view. Using a view allows for imported data to be “spread” across multiple database tables. However, a new view must be created in the database each time a new set of data involving a new set of target tables is required. Furthermore, the set of supported data formats is fixed. Data in a format that is not supported must first be converted to a supported format before it can be stored. Moreover, there is little or no accommodation for data cleansing, i.e., the need to transform incoming data into a correct data type and value domain supported by the target database schema. For example, incoming data may refer to a gender as “Female”. However, a gender column in a corresponding target database table may represent gender as “0” and “1” or “M” and “F”. Furthermore, mapping rules that guide importation of data are generally encoded as part of a specific application being used for data import. Changes to the mapping rules or a new set of mapping rules require changes to the specific application being used for data import. Especially, this is true of mapping rules that guide data importation which assume a given target database schema. Changes to the underlying target schema require changes to the mapping rules. Additionally, all of the data required by the target database schema may not exist in a single file that is the source for a load or import operation. Additional data can be required in addition to the data contained in the file being imported.
Another approach for managing a shared pool of data is a federated data approach. The federated data approach consists in managing the shared pool of data as a federation of different data sources. To this end, the shared pool of data can be implemented as a federated or distributed database. A federated/distributed database is one that can be dispersed or replicated among different points in a network.
The federated data approach is supported by products like IBM's DB2 Information Integrator which allow a diverse set of data source types to be viewed and accessed as if they were part of a single, logical relational schema. However, federated data query implementations rely on each data source being configured to the query execution engine where this configuration defines the location and type of each data source. This approach does not work well in an environment wherein continually new data sources are created and where the data sources are not in a centrally accessible location. This is, however, typical in organizations where researchers are continuously creating new data sources (e.g., new data spreadsheets) and where data locations are inaccessible to a federated query engine (e.g., data stored on CD ROM or on a user's private PC). Federated data query solutions are not efficient in cases where there is a desire to aggregate information from a number of different data sources. For instance, assume a situation where 100 researchers have a set of data to contribute and one wishes to run a query that spans and aggregates information across all of these data sources. In a federated data query model, each data source would be viewed as a separate logical table, requiring complex join logic and/or union of data across all of the data sources to do any kind of query spanning all of the data sources. Thus, the federated data approach is not suitable for efficiently sharing data from a large pool of data.
Therefore, there is a need for a more efficient method for managing a shared pool of data consisting of heterogeneous data sources in order to improve data sharing.