The present invention relates to query optimization in database distribution systems.
Key value stores have been recently used for scale-out (horizontal scaling) data management, especially for web applications. Data is divided into small data fragments and distributed over multiple storage nodes. A key is associated with each fragment of the data, and the key-value store provides key-based operations (such as put and get) that enable an application to access data fragments by key without knowing their physical location. Key-based operations provide an abstraction layer of the data and make it possible to scale out data stores: the system can easily add and remove storage nodes without disrupting applications that access the data using such operations.
However, the key-based operations also make it non-trivial to efficiently process more complicated data access, such as a relational query including join. A traditional relational database management system (RDBMS) often relies on various ways to access data stored on disks. Especially a scan operation takes a key role to let the RDBMS efficiently read a set of data in a table. Unfortunately, key-value stores usually do not support such scan operators. A query must be executed using only key-based lookup operations (i.e., get operations) to retrieve data fragments one by one, which can be much more expensive than a scan operator due to response time overhead of each operation.
On the other hand, one of the inherent features of such stores is the capability of responding to multiple requests in the same time, i.e. parallelizing the requests processing. In systems that use key-values stores for the backend storage while providing a relational interface to the applications, the query optimizer of the relational queries should be able to take advantage of the parallelization capabilities of the underlying key values stores.
One challenge here is to make optimization aware of effective parallelism: the degree of parallelism that is effective to faster execution time. Parallel key lookup is effective if it can hide latency of each lookup, but excessive parallelism does not improve performance if the query execution is already busy (i.e., it becomes CPU bound). The effective parallelism depends on the ratio between the response time of key lookup and the computation time of a query, which differs in different environments. Thus an automated approach based on optimization is crucial to efficiently execute a query on key value stores.