One or more aspects of the invention relate generally to storing data in a distributed database.
Nowadays, distributed database appliances sometimes allow the processing of relational SQL queries with little tuning effort. One exception, however, is with distribution keys which define how data is distributed among computing nodes of distributed database appliances. Often times, distributed database appliances are basically a massively parallel processor system with a partitioned database, often with partitioned tables. For efficient processing of complex queries like joins, it is crucial that joined data from multiple tables are on the same computing nodes. In general, it is important to reach a data distribution without significant skew, to allow all units to work on the data simultaneously. The distribution key influences how the table data is distributed across nodes. Manually picking the ideal distribution key depends on table contents as well as on the queries to be processed. Picking the wrong distribution key can have a severe performance impact, making execution, e.g., 10× slower and more.
In today's computing centers, it is often the case that customers only use a fraction of the available disk space, since disk space is inexpensive. However, to reach good performance, a high number of spindles is required. Also typical is that distributed database appliances have lots of idle time over the day. So, despite peaks of high usage, there is unused capacity without impacting productive workload.