The present application relates to database systems and more particularly to techniques for efficiently loading data into partitioned database tables.
Many conventional database systems organize and store data in tables, with each table having a fixed number of columns and a variable number of rows. In many conventional database systems, database tables may be partitioned in order to increase manageability of the data, to increase performance of queries accessing and/or updating the table, and to increase the availability of the data to be stored in the table. Partitioning may be used to divide the table into a plurality of partitions, with each partition comprising a subset of the rows of data stored in the table. Various actions, such as data loads, creation and rebuilding of indices, and backup and recovery operations, may be performed on each individual partition rather than on the entire table.
Often data in a table may need to be reloaded by deleting existing data in a table and inserting new data. One benefit of partitioning a table is that data in one partition may be reloaded without affecting access to data stored in other partitions. However, data in the partition being reloaded will be inaccessible during the reloading process, and reloading a data partition may involve copying of large amounts of data and/or performing complex calculations on raw data to generate the data to be used to populate the partition, such as calculating rollup values across various dimensions. Thus, data in the partition may be inaccessible for a significant amount of time.
One approach taken in many conventional database systems to minimize the amount of time that a partition is unavailable during reloading is to create a temporary table and populate the temporary table with the data that is to be loaded into the partition. Once the data has been created in the temporary table, the data from the temporary table may be copied into the partition, and any data that was in the partition prior to copying the data from the temporary table may be deleted. Once the copy and delete steps have been completed, the temporary table is no longer needed and may be deleted. This process minimizes the amount of time that the partition is unavailable by generating the data and inserting the data into the temporary table before copying the data to the partition.
During the process of copying the data from the temporary table to the partition, a redo log may be created. The redo log tracks every change to the partition (such as data inserted, data modified, data deleted) that is made during the copying process. The redo log is created to enable the changes to the partition to be rolled back and the previous state of the partition to be restored if an error occurs during the copying process. Creating the redo log may be extremely resource intensive, because the number of records in the snapshot data may be very large. Furthermore, the data in the partition will be unavailable while the data is being copied from the temporary table to the partition, which may take a significant amount of time depending upon how much data is to be copied.
Another problem with the copy and delete process described above is that the high water mark for the partition may become artificially inflated as a result of copying the data from the temporary table into the partition before deleting data that already exists in the partition. The high water marks represents a maximum amount of data that has been stored in the partition. As data is added to the partition, additional storage space is formatted for storing data and is associated with the partition. The high water mark may not, however, represent the current amount of data stored in the partition. If data is deleted from the partition, the amount of storage space formatted for storing data does not decrease even though the amount of storage space actually being used may be less than the amount of storage space that has been formatted and allocated to the partition.
Storage space below the high water mark cannot be deallocated. As a result, due to a high water mark for a partition, the actual amount of data stored in the partition may be small compared to the amount of formatted but unused storage space allocated to the partition, thereby resulting in a substantial amount of unused but allocated storage space for the partition that cannot be deallocated. Due to the inflated high water mark, queries and other operations run against the partition will have to examine all of the allocated storage space including the unused portions. Therefore, execution of queries in such a partition may cause unnecessary processing resulting in inefficiencies and wasted use of resources significantly impact response times.
Accordingly, techniques that overcome the problems of conventional systems and methods described above and provide for efficient data loads into partitioned tables are desired.