The field of the disclosure relates generally to computer data warehousing (CDW), and more specifically, to methods and systems for metadata driven data capture for a temporal normalized data warehouse.
There is a need to quickly load and time sequence varying volumes of incoming data with a single general purpose design without resorting to sequential methods. Sequential methods are generally too inefficient for initialization and higher volume incoming data events. In addition, there is a need to minimize sometimes intensive pre-processing to detect changes within the data or to ensure unique valid time periods to allow creation of a load set of candidate rows for every target table, regardless of interface type. Finally, there is a need to identify changes of all types and avoid loading new data rows with no new content beyond a new authoring timestamp (valid time), which would save data space by collapsing consecutive duplicate rows of data within a temporal time period.
Currently, complex custom data load programs typically running on large external application servers are one solution to loading a temporal data warehouse. Such programs must process and apply data serially by primary key, resulting in very long run-times and extensive relatively intrusive updates to the target tables which are continually being queried by end users. In some cases, two sets of target tables are used and swapped when loading is complete to continuously support users. Typically, some data already in the database is removed, processed externally on an application server along with incoming data and re-loaded to achieve the data load, further stressing the network and database. Existing solutions also tend to only deal with anticipated situations rather than all possible situations, breaking, aborting the load or rejecting data in unanticipated cases (e.g. valid time tie within a primary key).
Some contemplated solutions have other downsides, for example, a design that is hard-coded to accept particular types of incoming data and exact target schemas is not desirable due to development costs. Maintenance costs are a concern when addressing primary key or attribute changes to the data source, data target, or method of interface. Use of extract, transform, and load (ETL) tools to perform the work outside of a database on a server is one possible solution, but is inefficient and can be affected by the amount of network traffic. Loss of efficiency in contemplated solutions is particularly large when using external or row-at-a-time solutions on the massively parallel processing (MPP) architecture widely used by data warehouses. Also, proprietary database tools require specialized knowledge and are not portable to other platforms (e.g. Oracle PL/SQL). These solutions are inefficient for larger volumes of data, making near real-time non-intrusive loading impossible (no active data warehouse) and requiring different coding for initialization or large volumes of data to achieve acceptable performance.