Large masses of data reside in multiple databases, applications, file systems, repositories, or specialized data stores. The large masses of data are comprised of multiple models of multiple products of multiple vendors or manufacturers, all of which utilize different data structures and different database management systems including different user interfaces into their respective underlying databases. The data structures within databases even vary among versions of the same model from the same manufacturer. Adding to the complexity, many data stores are not even databases as such, comprising, for example, repositories of electronic files or documents stored in file systems under hierarchical directory structures.
Data integration is intended to enable a customer using one repository to make use of data residing in another repository. Data integration customers typically need to locate data in a source repository, transform the data from a source format to a destination format, and transfer the data from the source to the destination.
The most ambitious attempt in prior art to solve the problem of data integration is data warehousing based upon a standard data model. The idea of the standard model is that an industry, for example the seismic data processing industry or the geophysical data processing industry, gathers in committee and agrees on standard data formats for seismic data. The geophysical data processing industry is a good example of the need for data integration because the industry utilizes extremely large volumes of geophysical data regarding wells, well logs, and log curves. If the industry could agree on a standard data model, then the industry could build application programs to convert the multiple data models from various source databases into one standard model and use the data in standard form to transfer data among customers.
In one application of a standard model, data in the standard form is physically stored in a central location called a data warehouse which is then made available to subscribing customers who can make use of the data through applications designed to operate against the standard data model. It is useful to note that data warehousing, as the term is usually used in the data integration industry, does not require use of an industry-wide standard model. In fact, many data warehousing projects start with a group within a corporate entity establishing a local standard model for their own internal warehouse. This local standard model may or may not be based on any industry standard. However, when such a local standard model is established and used as a corporate standard, it behaves identically to an industry-based standard with all its inherent flaws and weaknesses.
The standard data model does, to some extent, ease access to data across structure types. The standard data model, however, demonstrates problems that seem intractable within the standard model itself. One problem is that the standard data model utilizes a completely static standard structure. That is, there is no method or system within the standard model for giving effect to routine changes in source system data structures. After the structure of a standard model is standardized by an industry standards committee (or a local data management group), the standard model structure is locked in place until changed by the committee. The source data structures in the databases integrated by the standard model, however, change daily. The only way to change the standard model data structures to keep up with the changes in structures in industry databases is to gather a list of desired changes, take them to the industry standards committee, and request changes in the standard model. After the committee approves changes in the standard model, all applications desiring to use the new standard model, as well as the software processes, if any, comprising the model itself, must be rewritten, an extremely laborious, expensive, and time-consuming process.
A second problem with the standard model is data loss. The static nature of the standard model means that all data structure changes in industry databases not yet integrated into the standard model result in data loss every time data from an external repository is transferred into the standard model. In addition, the fact that the standard model data structure is established by committee means that it is a compromise practically never capable of including all fields from all databases for any record type. Neither the initial implementation of a standard model nor subsequent upgrades typically include all fields from all repositories contributing transferred data for a record type. For these reasons, actual utilization of a standard model for data integration almost always results in data loss.
For these reasons, and for other good reasons that will occur to the reader, there is an ongoing need for improved methods and systems for data integration.