In the context of database technologies, databases may be characterized as data warehouses or operational databases. A data warehouse may be designed as a database to store large amounts of data from one or more operational systems. The operational systems may each be supported by one or more operational databases. Operational databases are commonly designed using normalized entity-relationship modeling. When designing an operational database, the database designer may seek to model the business domain in a manner to support the business applications and avoid recording redundant data as much as possible. The third-normal form is often strived for in such design. In general, normalized database schemas are tuned to support fast updates and inserts by minimizing the number of rows that must be changed when recording new data.
Data warehouses differ from operational databases in the way they are designed and used. Data warehouses are designed to be efficient for querying. Data warehouses usually provide a simplified version of the data in the operational databases. Instead of being updated by end users, updates to a data warehouse are propagated from operational systems.
One logical design technique used when designing a data warehouse is dimensional modeling. Schemas produced with dimensional modeling are known as star schemas. Star schemas include fact tables at the center of the star and dimension tables around the fact tables. Fact tables may be large tables and contain basis-level detailed data. Dimension tables contain attributes related to foreign keys in the fact tables.
A basis-level fact table tends to grow large and consequently the performance of queries against such a large basis-level fact table tends to suffer. The problem worsens as more data is stored in the basis-level fact table. One technique to alleviate or mitigate this problem is to provide summary-level fact tables where data from the basis-level fact table is aggregated. Although adding summary-level fact tables increase query performance, adding summary-level fact tables also adds to the complexity of the data warehouse. Users may be overwhelmed when faced with the task of choosing the correct table against which to direct their query. Because of this, users may be inclined to choose a fact table with every column needed and repetitively use this table for all the different queries. This option requires more system resources and provides lower overall performance.
Some attempts have been made to isolate the complexity of multiple summary tables from users by implementing a query rewrite facility in between the end user and the database that takes a user query and modifies it in order to generate a more efficient SQL statement. Query rewrite facilities, however, raise new issues of their own. For example, in order to implement a query rewrite facility, one approach is to add a layer of virtual tables, which are exposed to the user. This approach adds to the administrative burden of setting up the data warehouse and coordinating changes between the virtual table layer and the physical table layer.
Another approach to isolate the complexity of multiple summary tables from users is to utilize query rewrite and materialized view facilities at the database level. This approach has various limitations. First, it relies on the cleverness of a relational database management system (RDBMS) optimizer. Second, such facilities may only work for some SQL statements. Further, performance may depend on the cleverness of the SQL writer in order to fully utilize the RDBMS query optimizer. The database administrator may also have to create and maintain the required materialized views. In addition, query performance may be unpredictable as the same query is sometimes rewritten and other times not—depending on a complicated set of rules. The query rewrite approach is transparent to the users and database application; this can be a good thing, but for a professional software engineer it can also be seen as an opaque and unpredictable layer impeding the software engineer. Still another problem with the materialized-view/query rewrite approach is that the performance of materialized refresh operations may easily become too slow for practical use.
Another problem with summary tables is the reliability of data, in particular, ensuring that the summary tables are up-to-date and consistent with the basis-level fact table.
Another technique to improve performance in a data warehouse is to partition fact tables using partitioning options. Because fact tables commonly contain time-related transactions, partitioning on a time dimension is commonly used. For example, a fact table of retail sales may be partitioned by month. This may improve performance of queries that could utilize this partitioning. However, the problem with this is that in order to realize performance improvements, the query optimizer needs to be smart enough to know when partition elimination can be employed and only visit partitions that are relevant to the query. In practice, this partition elimination tends to work only in certain special cases. In other cases, the query optimizer may fall back to an execution plan that visits all available partitions, and the query performance suffers accordingly. This is especially likely to happen when the end user is not a database expert who understands the particular quirks of a SQL query optimizer. In such cases, the advantages gained from the partitioning may be limited to ease of manageability for the database administrator, with very little performance benefit for the end users.
For data warehouses with one or more large fact tables there are a number of issues with loading data and keeping both fact and summary data current and consistent with the source systems. Techniques for refreshing data in a data warehouse include a full refresh, an incremental record-by-record refresh, and an incremental partition-by-partition refresh. The technique employed may vary from table to table.
Using a full refresh technique for large fact tables can place a high burden on the operational system while extracting data and on the data warehouse when loading data. After a full refresh of a fact table, indexes need to be rebuilt, which adds time and computer resources to this task. Full refresh techniques may be limited to smaller fact tables.
The incremental record-by-record refresh technique may be faster, but also has issues. For example, in order to decide which records need to be added, this technique sometimes relies on a timestamp column in a source table in order to decide if a row is “new” in relation to the last time the refresh operation was executed. Uncommitted changes in the source table may be permanently omitted because a record-creation-timestamp value is not visible for other sessions until it is committed, and in particular, not visible for a session that is replicating data. A new record could be committed in the source table where the new record includes a record-creation-timestamp value prior to the replication-check-timestamp value. On subsequent execution of incremental replication this record is not selected because the replication-check-timestamp value is later then the record-created-timestamp value. This record is thus wrongfully omitted from all subsequent refresh cycles. This causes such incremental record-by-record algorithms to miss records from time to time and thus over time more and more propagation errors are accumulated, which causes divergence between the source and target database to develop. Another problem with an incremental record-by-record refresh is that it does not perform well when the table has many indexes, especially if the fact table is indexed with bitmap indexes. Yet another problem with an incremental record-by-record refresh is that handling updates and deletes to source rows is cumbersome to propagate correctly to the data warehouse.
Data in a data warehouse should correctly reflect the data in the operational systems with as little delay as possible. However, given the above-mentioned problems, this is often not attained in current systems. The present disclosure addresses these and other problems.