Computer systems generally include one or more processors interfaced to a temporary data-storage device such as a memory device and one or more persistent data-storage devices such as disk drives. Each disk drive generally has an associated disk controller. Data is transferred between the disk drives and the disk controllers. Data is also transferred between the disk controllers and the memory device over a communications bus or similar.
Data organization in a computer system such as that above is important in relational database systems that deal with complex queries against large volumes of data. Relational database systems allow data to be stored in tables that are organized as both a set of columns and a set of rows. Standard commands are used to define the columns and rows of tables and data is subsequently entered in accordance with the defined structure.
The defined table structure is locally maintained but may not correspond to the physical organization of the data. In a parallel shared nothing relational database data can be stored across multiple data-storage facilities, each data-storage facility in turn including one or more disk drives. Data partitioning can be performed in order to enhance parallel processing across multiple data-storage facilities.
Hash partitioning is a partitioning scheme in which a predetermined hash function and map is used to assign rows in a table to respective processing modules and data-storage facilities. The hashing function generates a hash bucket number and the hash numbers are mapped to data-storage facilities.
Natural data skew can be particularly troublesome for certain customer scenarios. It is important to have a primary index choice for a big table that fairly evenly distributes the rows across the processing modules. It is also important to ensure that there are efficient joins to other tables in a data warehouse implemented on the above system.