The subject matter described herein generally relates to systems and methods for efficiently updating and analyzing the contents of a data warehouse.
A data warehouse (DW) can be thought of as a repository of data that has been extracted from one or more data sources. The data may be derived and integrated from heterogeneous and autonomous distributed data sources, using for example an extract, transform and load process (ETL). In an ETL process, generally for each data source, a source-specific data extractor retrieves the data. This data is converted into a uniform relational (warehouse) format. The data is then loaded into the DW.
To update a DW, several conventional techniques have been utilized. A DW may be updated periodically, wherein the data updates are kept at data sources and sent to the DW on a periodic basis according to some predefined period. Alternatively, DW updates may be done on-demand. That is, on-demand updates occur whenever an update request is sent from a DW administrator. Finally, “real time” updates for a DW have been implemented wherein each data source update is propagated to the DW as soon as it happens.
Analysis of the data stored in a DW often begins by using querying techniques to identify desired data that satisfies one or more conditions. For example, a DW may run an application for complex aggregation queries. Many of these queries are used to execute contingent events in the form of triggers, where some predefined action is performed when a trigger condition is satisfied on the DW. As an example, in a DW having sales data from various departmental stores, a trigger query may include a notification of when total sales in stores located in a particular geographic region exceeds 10,000 units.
To evaluate a trigger condition, the DW has to evaluate the query (for example, “retrieve total sales in the particular geographic region this week”) with each data update. This may be costly and/or inconsistent. If data updates are sent to the DW periodically with a short period/interval, the query will get executed very frequently and this may be costly in terms of computing resources. At an extreme end, if data updates are sent to a DW in real time, the trigger query will get executed very frequently. On the other hand, if the period has a long interval, the query will not get executed frequently enough and updates will not be had in a timely fashion.
Triggers defined over a DW need to be evaluated frequently to know whether triggers need to be tripped without unnecessarily wasting resources. However, determining an effective and efficient implementation of triggers has proven difficult.