1. Field of the Invention
The present invention relates generally to computer databases, and more particularly to methods and apparatus for balancing partitioned database contents by generating bucket-to-bin mappings and efficiently and effectively populating newly added data bins to existing databases.
2. Description of the Related Art
A computer database comprises a set of data organized into collections of interrelated structures, frequently referred to as "database objects". In a relational database, the data structures that embody data relations are typically rendered in set-oriented structures such as tables. A table in a relational database typically includes a number of records that correspond to rows of the table and that are related by attributes that correspond to columns of the table.
It happens that, to reduce the time required to search a database for a particular record, and to facilitate parallel processing (i.e., the association of multiple computers or multiple storage devices with a single database), it can be useful to store records of a table apart from each other. Such separate storage is referred to as "de-clustering" or "partitioning" the table. When de-clustering, it is advantageous to first map the records to logical "buckets", and then to map the buckets to physical storage locations referred to as "bins". Table records are typically distributed among buckets by means of a randomizing function that produces a Bucket Id (or Bucket identifier) for each record. The randomizing function might involve a transformation (hash) of the value of one or more data fields, otherwise called the partition key, of a respective record. Distribution of records to a number of storage locations also reduces the search domain for a record-access by its partition key. One bin usually has several buckets assigned to it. To further reduce the time required to search the database for data, the number of buckets per bin should be approximately equal, bin to bin.
Most databases, however, are not static. Instead, as a database expands, additional storage is acquired and new bins are frequently added. When new bins are added, existing buckets must be reassigned to the new bins to populate them, and a new mapping scheme must be generated that reflects the addition of the new bins, before the database can be reused. Buckets cannot be reassigned to bins haphazardly, however, because at the end of the reassignment it is important that each bin of the database has assigned to it about the same number of buckets as the other bins. The reason for this is that, as mentioned above, search time for data retrieval is reduced when the bins are evenly loaded. When the buckets are unevenly distributed among the bins, the database is said to be "skewed".
Accordingly, to ensure even loading of bins after new bins have been added to a database, the assignment of buckets to the bins typically is done in a round robin fashion. Unfortunately, this requires moving a large amount of the data between bins each time new bins are added, which is computationally intensive and time consuming, and which consequently prolongs the period during which the database cannot be used. Alternatively, buckets can be individually mapped to bins, which would both ensure even data distribution among the bins, and minimize the movement of data when new bins are added. Unfortunately, such individual mapping requires that a relatively large storage space be permanently allocated for a directory for each database table that lists the bin location of each bucket, thus undesirably increasing the memory overhead requirements of the database system.
As recognized herein, however, it is possible to reassign buckets to new bins in a database, without requiring the movement of large amounts of data, without resulting in database skew, and without requiring the maintenance of a large directory of database object mappings. Accordingly, it is an object of the present invention to provide a system for mapping buckets to an expandable collection of storage bins in a database. Another object of the present invention is to provide a system for adding new bins to a database, while ensuring even data distribution among the bins afterward. Still another object of the present invention is to provide a system for adding new bins to a database and generating a mapping thereof, while ensuring even data distribution among the bins afterward and while minimizing the amount of data movement necessary to populate the new bins with data. Yet another object of the present invention is to provide a system for adding, populating, and then mapping new bins which does not require excessive memory overhead. Another object of the present invention is to provide a system for adding new data storage bins to a database that is easy to use and cost effective.