Databases that are distributed across many computers, including some SQL databases, Hadoop databases (e.g., eBay's® Hadoop® database), etc. receive queries and for some such databases perform these queries in parallel on different parts of a data set that are distributed on different servers, then return a response to the originator of the query and/or one or more other designated locations. Responses to queries are generally returned fastest when the data being queried is evenly distributed among multiple servers. All other things being equal, some distributed databases take longer to respond to queries when the data set(s) being queried are concentrated entirely or mostly on a small number of servers (or one server). Thus long delays can be caused by such concentrations of data (sometimes called “data skew”). However, all other things are not always equal. There may be various other causes for a response to a query to be long delayed, such as resource congestion. That is, the distributed database may receive large numbers of queries at the same time. In such a case, even a query on well distributed data will take a long time.
This occurs because various queries are designed and sent to the database servers by various users and groups of users. There is generally little to no coordination among users to ensure that the system receives a steady supply of queries, rather than receiving many queries during some periods (e.g., weekdays from 9 AM-5 PM) and few queries at other times (e.g., weekends at 2 AM).
Although some database systems redistribute data based on the queries that come in, in some cases this is not necessary, even when data for a particular query is highly skewed. For example, some queries may so simple and/or be performed on so little data that even when a high percentage of that data is concentrated on one server, the response would come in an acceptable amount of time, unless the database (or the specific server on which the data for that query is concentrated) is congested. That is, even in a case where there is a high data skew and a long delay, the high data skew may not be at fault for the long delay.
The performance of a query by the database servers and the return of the response may be referred to herein as a “job” and the delay between submitting the query and receiving the response may be referred to as a “job delay”. Currently, it is very difficult to know if a job delay is caused by a data skew issue or a congestion issue. This is a problem for users of distributed databases, because data skew issue will cause their jobs run extremely slowly which may not only affect the business contracts for the use of the database, but also impact the performance of the entire database due to resource contention. Accordingly, there is a need for a user friendly tool to determine the cause of long job delays.
The headings provided herein are merely for convenience and do not necessarily affect the scope or meaning of the terms used.