1. Background
Computers and computing systems have affected nearly every aspect of modern living. Computers are generally involved in work, recreation, healthcare, transportation, entertainment, household management, etc.
2. Relevant Art
Further, computing system functionality can be enhanced by a computing systems ability to be interconnected to other computing systems via network connections. Network connections may include, but are not limited to, connections via wired or wireless Ethernet, cellular connections, or even computer to computer connections through serial, parallel, USB, or other connections. The connections allow a computing system to access services at other computing systems and to quickly and efficiently receive application data from other computing system.
Interconnection of computing systems has facilitated distributed computing systems. In some distributed systems, nodes of a distributed system each perform portions of work to accomplish an overall computing task or set of tasks. Some distributed systems may implement a distributed database where different rows of a distributed table are stored at different nodes. Such distributed databases work best when rows are evenly distributed. In particular, if one node has significantly more rows than other nodes, that node can become a bottleneck when operations, such as joins, on the database are performed.
To ensure even distribution of rows, databases will often hash a particular column using a good hash that distributes evenly and then distribute the rows according to the hash. However, this process does not work for some columns that have a high percentage of one value as compared to other values, i.e. “skewed” columns. Additionally, even though rows may be distributed evenly based on one column, a join with a skewed column may result in a bottleneck scenario. For example, consider an order database that stores information about orders received by an on-line retailer. The order database may have a table that identifies an order number, a customer, and a date. If the table were distributed based on a hash of the order number, the table would distribute very evenly as the order numbers would hash quite evenly because each order number is unique. In fact, the order number itself could probably be used without needing to perform a complex hash on the order number.
However, suppose that after the table was distributed, a join was to be performed based on the customer. Also suppose that one customer has an unusually high number of orders as compared to other customers. The resulting join would result in one portion of the join, the portion with said customer, having an unusually high percentage of the result of the join, which would all be stored on one node. This would cause that node to be required to do significantly more work than the other nodes and would degrade the performance of the entire system. A similar analysis may be performed based on the date column. For example, cyber Monday would have an unusually large number of sales as compared to other days of the year.
Further, if the table were distributed in the first instance based on the customer column or the date column, the table data would be skewed in the first instance.
Thus, it would be helpful to reduce bottlenecks in distributed database systems caused by skewed distributions or joins.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.