The field of the disclosure relates generally to a computer data warehouse (CDW), and more specifically, to methods and systems for metadata driven data capture for a temporal normalized data warehouse.
A need to quickly load and time sequence varying volumes of incoming data with a single general purpose design without resorting to sequential methods exists. Sequential methods are generally not efficient means for initialization and for use with higher volume incoming data events. In addition, there is a need to reduce sometimes intensive pre-processing to detect changes within the data and/or to ensure unique valid time periods to enable creation of a load set of candidate rows for every target table, regardless of the interface type. Finally, because of the costs associated with data storage, there is a need to identify data changes of all types and to avoid loading new data rows with no new content beyond a new authoring timestamp (valid time). Such practices may help to reduce storage usage by collapsing consecutive duplicate rows of data within a time period.
Currently, complex custom data load programs typically running on large external application servers are a solution that has been implemented in an attempt to load a temporal data warehouse. Such programs process and apply data serially by primary key, which may result in long run-times and extensive, relatively intrusive updates to the target tables. In some instances, to continuously support users, two sets of target tables are used and swapped when loading is complete. However, in such systems, typically some data already in the database is removed, processed externally on an application server along with incoming data and re-loaded to achieve the data load, which further stresses the network and database. Other known existing solutions also tend to accommodate only anticipated situations rather than all possible situations, breaking, aborting the load, or rejecting data in unanticipated cases (e.g. valid time tie within a primary key).
Other contemplated solutions generally have other shortcomings. For example, a design that is hard-coded to accept particular types of incoming data and exact target schemas is not desirable due to development costs. Further, maintenance costs may be a concern when addressing primary key or attribute changes to the data source, data target, or method of interface. Use of extract, transform, and load (ETL) tools to perform the work outside of a database on a server is one possible solution, but is inefficient and can be affected by the amount of network traffic. Loss of efficiency in contemplated solutions is particularly large when using external or row-at-a-time solutions on the massively parallel processing (MPP) architecture widely used by data warehouses. Also, proprietary database tools require specialized knowledge and are not portable to other platforms (e.g., Oracle PL/SQL). These solutions are inefficient for larger volumes of data, which may render near-real-time, non-intrusive loading impossible and require different coding for initialization or large volumes of data to achieve acceptable performance.