Database engineering often involves different kinds of optimizations relative to queries on the database, and notably aims at performing workload prediction. The time taken to perform a query is called Workload, or simply Execution time, or Query runtime. This is in general the quantity that the optimizer will minimize even if it also often has to take into account other costs as the memory space and resource used. It is important to notice that the time needed to answer a query is the sum of the time to compute the query plan, and the time to execute the query plan. Some technologies tends to find a trade-off between these two quantities (such as in document US20050071331A1).
The most important application of Query Runtime Prediction is query optimization, which relies on these predictions to select a particular execution plan from an often very large number of candidates. In real-world applications, a database system has to answer numerous queries in a few time, that is why it operates a Query Scheduling (as detailed in the article “Distribution-Based Query Scheduling”, by Chi, Hacigum, Hsiung, Naughton, 2013), this scheduling being based on different criteria as the expected query runtime, the priority of the user from which the query was sent, the priority of the concerned task.
Query runtime is thus a central information that one needs to estimate in order to schedule the queries. In particular, one wants to avoid bottleneck queries that could delay other important queries. Moreover, it could also be interesting to evaluate the runtime of a query in order to quantify how much resource to put on its computation such that the query will be performed before a given time limit. This is explained in document US20050192937. As mentioned, predicting the runtime of a query is at the core of query scheduling and therefore this problem has been intensively studied.
One natural way to predict the runtime of a query is to look for the runtime of “similar” queries that have already been performed (and for which one has stored the time used). In order to implement this kind of methods, one has to find good representations in order to compare the queries between them and to learn how to model their runtimes, using for example metrics as in the article “Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning”, by Ganapathi, Kuno, Dayal, Wiener, Fox, Jordan, and Patterson (2009), or model selection as in the article “Learning to Predict Response Times for Online Query Scheduling”, by Macdonald, Tonellotto, and Ounis (2011). The choice of the representation of the queries and the way to model the runtime according to this representation depends on the language on which the queries are built. For example, the research report “Predicting SPARQL Query Execution Time and Suggesting SPARQL Queries Based on Query History”, by Hasan and Gandon deals with the case of the SPARQL language.
The prediction of the runtime of a query can then be evaluated after a Machine Learning algorithm has been applied to a training set. In particular, the thesis “Dynamic Cost Models for Query Runtime Prediction” by Lima analyses the case of relational databases as PostgreSQL and tests different classical Machine Learning algorithms using a particular representation of the query. Undersupervised averaging, Undersupervised Linear Regression, Undersupervised K-Nearest-Neighbor Regression, Undersupervised Nadaraya-Watson Regression are all concepts at use.
Several articles have demonstrated that a powerful meta-method consists in dividing the queries of the training set in ranges of execution runtime and then apply the Machine Learning algorithm on each of the range. In particular, the article “PQR: Predicting Query Execution Times for Autonomous Workload Management”, by Gupta, Mehta, Dayal (2008) uses this separation in time. Some methods also apply this paradigm recursively and obtain a PQR tree (as in documents U.S. Pat. No. 7,895,192B2 and US20080270346), where each node of the tree uses a predictor learned on the training data (recall that the training data is a set of queries whose execution time is known). One of the most robust general method consists in clustering the query times of the training set queries and then predicting the execution time of a given (new) query finding the cluster from which it is the closest (using a particular notion of similarity, often based on the similarity between the query and the centroids of the clusters) and then compute the query time according to the query times of the cluster. The above-mentioned research report “Predicting SPARQL Query Execution Time and Suggesting SPARQL Queries Based on Query History” by Hasan and Gandon and the article by Wang and Wong, “Optimal k-means Clustering in One Dimension by Dynamic Programming” show for example that this last computation can be performed by evaluating a predicted model learned on the cluster (that can be a simple average—in which case this last computation amounts in a way to a quantization (i.e. replacement of an input value by a closest one in a predetermined set of values, according to a predetermined distance)—or a more complex Machine Learning method as SVM). This framework relates to the more general field of cluster analysis.
Cluster Analysis concerns the task of partitioning a set of objects in groups (called clusters), so that in each group the data are similar (see the article of Jain et al., “Data Clustering: A Review”). It appears as a central problem in Data Mining (see the article of Chen et al., “Data mining: an overview from a database perspective”), Machine Learning (see the book of Murphy, “Machine Learning, A Probabilistic Perspective”), and Large Scale Search (see the article of Goodrum, “Image Information Retrieval: An Overview of Current Research”). Cluster Analysis is an important tool for Quantization: assigning a center to each cluster, one has a simple quantization that consists in quantizing each point to the center of its cluster.
The K-means clustering problem is the most famous problem of Cluster Analysis and was introduced by Stuart Lloyd in 1957 at Bell Laboratories, as a technique for Pulse-Code Modulation. The Lloyd algorithm takes as input a collection of p-dimensional points and outputs a partition of these points that aims to minimize the “total distortion”. This algorithm is only a heuristic (it does not provide the optimal clustering). But in fact we cannot hope for an exact algorithm since the K-means clustering problem is NP-hard in the non-one-dimensional case. The Lloyd algorithm is nowadays still widely used. Several variants have also been proposed (see J. A. Hartigan (1975), “Clustering algorithms”, John Wiley & Sons, Inc.”).
The one-dimension application is particularly important. One of the most famous algorithms for this problem is called Jenks natural breaks optimization developed in 1967 (see the book of Jenks, “The Data Model Concept in Statistical Mapping”, in International Yearbook of Cartography) and was introduced for cartographic purpose. As Lloyd algorithm, it is only a heuristic. In 2011 an exact algorithm, called CKmeans, was developed by Wang and Song (see the article of Wang and Wong, “Optimal k-means Clustering in One Dimension by Dynamic Programming”). This algorithm is the corner stone of document U.S. Pat. No. 1,543,036A. It runs in time O(K*n^2) where K is the requested number of clusters and n is the number of real numbers. Even more recently (in 2013), Maarten Hilferink has developed a more efficient algorithm and provides an implementation of it. This implementation was dedicated to cartography, more precisely for choropleth maps.
All these existing methods are however limited because either they do not produce the optimal K-means clustering, or they are too slow. Within this context, there is still a need for an improved solution to cluster queries for runtime prediction.