1. Field of the Invention
This invention is related to the field of computer systems and, more particularly, to a refresh mechanism for databases.
2. Description of the Related Art
A data warehouse may be a repository of historical data that may be used for analyzing business trends, operational efficiency, customer preference, decision effects, etc. A data warehouse is “just another database”, and may be a consolidation of other transactional databases and systems. Data warehouse administrators may create different aggregates, indexes, and views of data along different dimension of the business. Data warehouses may be components of an enterprise's Decision Support System. In a data warehouse, an enterprise may collect and organize data into a schema that can then be searched and mined for information using Business Intelligence solutions. These collections of data may serve as the basis of crucial business decisions.
A typical enterprise's data warehouse environment may include, but is not limited to:                One or more data sources—for example, transactional system, point of sales system, ERP system etc.        Utilities/programs to extract data from the data sources.        Utilities/programs to transfer data into a “canonical” form.        Utilities/programs to load canonical data into the data warehouse repository.        Enterprise Data Model to represent the business intelligence.        Data warehouse application software to further aggregate the data for business analysis.        Customized query and report generation/presentation tools.        Smaller “data marts” that include highly aggregated data produced from the data warehouse repository. Customers may, for example, implement data marts with OLAP (Online Analytical Processing) software.        
Enterprises generally do not like to mix their data protection environment with the data warehouse. Enterprises may prefer to have dedicated storage for mirrors or on-line backup and may not be comfortable with multiplexing those mirrors for data warehouse purposes. Enterprises may prefer to have separate storage for the data warehouse. Enterprises may prefer to take advantage of idle machine cycles on backup and standby servers. A majority of data warehouse databases are created on regular file system files because of the manageability of file systems.
The complex, sometimes long-running queries of data warehouse applications may have a substantial impact on OLTP (online transaction processing). With the reduction in server and storage cost, many enterprises prefer to run the data warehouse in a separate environment. Separating the data warehouse from production data brings up the issue of refreshing the data warehouse with the up-to-date view. Enterprises may want their data warehouse to be kept as current as possible. The term ‘real-time’ data warehouse is mentioned a lot.
Enterprises typically need to refresh their data warehouse periodically. Data warehouses are typically updated nightly, but some may be updated weekly, hourly, or at other intervals based on business needs. In general, when a data warehouse is being loaded, administrators take the warehouse off limit for general users. Depending on the amount of data being loaded/processed, this data loading window may be very long. In today's global business environment, it is increasingly difficult to find such a window. The off hours in one location could be the peak hours in another.
There are many different ways to load new data into a data warehouse. Once the new data is loaded, the DBMS may need to perform post process including, but not limited to: rebuilding the index to include the new data and regenerating any aggregates and summaries with the new data. The data is typically made available to users after the post processing finishes.
A typical load process is to take the data warehouse offline, and then do a bulk load via a DBMS vendor's bulk load utility such as the Oracle sqlloader or DB2 load utilities. A common practice is to partition the data monthly. Each month will have its partition or table/tablespace. Only the current month data will be impacted during the load process. Only the index and aggregates related to the last month's data may be rebuilt.
In some environments, administrators may use DBMS vendor's online loading utility to minimize the downtime. With the online load, vendors may pose restrictions on data that may limit the usefulness of the feature. It may be difficult to maintain a point-in-time image of the database unless DBMS keeps a before image of all the related indexes, tables, aggregates, and rules. The cost to maintain a before image copy of data in the database may be significant. The database needs to be carefully set up by an experienced DBA to make use of the online loading feature.
Some administrators may do batch inserts if the changed data is relatively small. All the aggregates and indexes may be updated automatically by the DBMS. However, such cases are not common in real-world situations.
One prior art approach to refreshing data warehouses is to use duplicates of the database and to switch back and forth between the duplicates. For a data warehouse, the size of data may be large, and may exceed many Terabytes. The cost of maintaining two large duplicate sets of data may be very high.
Another prior art approach is to create a copy-on-write/point-in-time snapshot for current users to query the data warehouse, and then to load the new data on to the current file system. In order to maintain the point-in-time copy of the snapshot, any changes to the current file system will be ‘pushed’ to the snapshots. Administrators will switch the data warehouse back to use current file system once the load process is done. A problem in this process is that the push process creates unnecessary runtime and space usage overhead.
In addition, both of these approaches require running the load process on the same host as the data warehouse. The load process may become a disruptive query in the data warehouse environment itself.
Therefore, it is desirable to provide a refresh mechanism that reduces the impact of data loading on a production data warehouse. It is also desirable to provide a space-efficient mechanism to refresh data warehouses.