Large masses of data reside in multiple databases, applications, file systems, repositories, or specialized data stores. The large masses of data are comprised of multiple models of multiple products of multiple vendors or manufacturers, all of which utilize different data structures and different database management systems including different user interfaces into their respective underlying databases. The data structures within databases even vary among versions of the same model from the same manufacturer. Adding to the complexity, many data stores are not even databases as such, comprising, for example, repositories of electronic files or documents stored in file systems under hierarchical directory structures.
Data integration is intended to enable a customer using one repository to make use of data residing in another repository. Data integration customers typically need to locate data in a source repository, transform the data from a source format to a destination format, and transfer the data from the source to the destination.
The most ambitious attempt in prior art to solve the problem of data integration is data warehousing based upon a standard data model. The idea of the standard model is that an industry, for example the seismic data processing industry or the geophysical data processing industry, gathers in committee and agrees on standard data formats for seismic data. The geophysical data processing industry is a good example of the need for data integration because the industry utilizes extremely large volumes of geophysical data regarding wells, well logs, and log curves. If the industry could agree on a standard data model, then the industry could build application programs to convert the multiple data models from various source databases into one standard model and use the data in standard form to transfer data among customers.
In one application of a standard model, data in the standard form is physically stored in a central location called a data warehouse which is then made available to subscribing customers who can make use of the data through applications designed to operate against the standard data model. It is useful to note that data warehousing, as the term is usually used in the data integration industry, does not require use of an industry-wide standard model. In fact, many data warehousing projects start with a group within a corporate entity establishing a local standard model for their own internal warehouse. This local standard model may or may not be based on any industry standard. However, when such a local standard model is established and used as a corporate standard, it behaves identically to an industry-based standard with all its inherent flaws and weaknesses.
The standard data model does, to some extent, ease access to data across structure types. The standard data model, however, demonstrates problems that seem intractable within the standard model itself. One problem is that the standard data model utilizes a completely static standard structure. That is, there is no method or system within the standard model for giving effect to routine changes in source system data structures. After the structure of a standard model is standardized by an industry standards committee (or a local data management group), the standard model structure is locked in place until changed by the committee. The source data structures in the databases integrated by the standard model, however, change daily. The only way to change the standard model data structures to keep up with the changes in structures in industry databases is to gather a list of desired changes, take them to the industry standards committee, and request changes in the standard model. After the committee approves changes in the standard model, all applications desiring to use the new standard model, as well as the software processes, if any, comprising the model itself, must be rewritten, an extremely laborious, expensive, and time-consuming process.
A second problem with the standard model is data loss. The static nature of the standard model means that all data structure changes in industry databases not yet integrated into the standard model result in data loss every time data from an external repository is transferred into the standard model. In addition, the fact that the standard model data structure is established by committee means that it is a compromise practically never capable of including all fields from all databases for any record type. Neither the initial implementation of a standard model nor subsequent upgrades typically include all fields from all repositories contributing transferred data for a record type. For these reasons, actual utilization of a standard model for data integration almost always results in data loss.
For these reasons, and for other good reasons that will occur to the reader, there is an ongoing need for improved methods and systems for data integration.
Aspects of the present invention include methods, systems, and products for data integration based upon dynamic common models. Aspects of the present invention typically include adapters as data communications interfaces between native data repositories and data integration applications. Aspects of the present invention typically include loose coupling between adapters and data integration applications. Aspects of the invention are summarized here in terms of methods, although persons skilled in the art will immediately recognize the applicability of this summary equally to systems and to products.
A first aspect of the invention includes methods of data integration including extracting a first native record from a first native repository, through a first adapter for the first native repository. In typical embodiments, the first adapter is loosely coupled for data integration to a data integration application, wherein the first native record from the first native repository has a first native format, and the first native format belongs to a category of formats identified as a datatype.
Typical embodiments include transforming, through the first adapter, the format of the first native record having the first native format to a dynamic common format, the dynamic common format being a subset of a dynamic common model, the dynamic common model comprising mappings specifying transformations to and from the dynamic common format for all data elements in all formats of all native records in all datatypes, whereby is produced a first native record having the dynamic common format.
Typical embodiments include transforming, through a second adapter, the format of the first native record having the dynamic common format from the dynamic common format to a second native format of a second native repository, the second native format belonging to a category of formats identified as datatypes, wherein the second adapter is loosely coupled for data integration to the data integration application, whereby is produced a first native record having attributes in the second native format. Typical embodiments include inserting, through the second adapter, the first native record having the second native format into the second native repository.
Other aspects of the invention include methods of creating systems implementing a dynamic common model, the systems typically including data integration applications, the methods typically including developing a first adapter for a first native repository, the first adapter being loosely coupled for data integration to the data integration application, the first native repository comprising first native records having first native formats, the first native formats belonging to categories of formats identified as datatypes. Typical embodiments further include developing a second adapter for a second native repository, the second adapter being loosely coupled for data integration to the data integration application, the second native repository comprising second native records having second native formats, the second native formats belonging to categories of formats identified as datatypes.
Typical embodiments include creating mappings specifying transformations of records: from the first native format to a first dynamic common format, from the first dynamic common format to the first native format, from the second native format to a second dynamic common format, and from the second dynamic common format to the second native format. Typical embodiments also include providing a transformation service capable of transforming formats in dependence upon the mappings, the transformation service coupled for data communications to the first adapter and to the second adapter. In typical embodiments, the data integration application is coupled for data communications to a multiplicity of native repositories through a multiplicity of adapters, and the multiplicity of adapters includes the first adapter and the second adapter.
In typical embodiments, all the adapters among the multiplicity of adapters are loosely coupled for data integration to the data integration application, and the data integration application includes the transformation service. In typical embodiments the dynamic common format is a subset of a dynamic common model, and the dynamic common model has the capability of specifying transformations to and from the dynamic common format for all formats of records in all datatypes of the multiplicity of native repositories.
A further aspect of the present invention includes methods of integrating an additional native repository with a system implementing a dynamic common model, the system including a data integration application. In typical embodiments, methods include developing an additional adapter for the additional native repository, the additional adapter being loosely coupled for data integration to the data integration application, the additional native repository comprising additional native records having at least one additional native format, the additional native format belonging to at least one category of formats identified as a datatype. Typical embodiments of this aspect include creating mappings specifying transformations of records: from the at least one additional native format to an additional dynamic common format, and from the additional dynamic common format to the at least one additional native format.
It is usual to view data in native repositories as sets of data elements. In this view, the integration achieved by the standard model is never more than an intersection of sets. The dynamic common model, however, is capable of a true union of all data elements selected for integration from all source repositories integrated through an embodiment of the invention. Because the standard model is static and includes from the beginning only agreed subsets of source data elements, the standard model never represents more than an intersection. In contrast, the dynamic common model of the present invention is capable at all times of transforming and transferring each and every data element from each and every source repository. If as a practical matter, users elect to integrate less than a full union of all data elements in all integrated native repositories for a particular embodiment, nevertheless, the dynamic common model remains capable of quickly effecting a full union if desired, a capability never available in the standard model for data integration.
The standard model itself provides no mechanism for changing or updating source data structures. In contrast, the dynamic common model itself comprises elements useful for automatically upgrading the dynamic common model to include changes in source repository structures. In fact, changes typically are administered in a similar manner as additions of new repositories. xe2x80x9cAutomatic upgradingxe2x80x9d in this sense means that upon activation, a new adapter automatically registers itself and its new repository with a data integration application to which it is coupled for data communications and a spider then automatically enters in a catalog identifying information for all the records in the new repository served by the new adapter. The process for changing existing repositories or adding new repositories is extremely flexible and efficient, especially in contrast with the standard model in which such changes or additions are almost impossible and are not provided for within the model itself.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of a preferred embodiment of the invention, as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts of the invention.