1. Field of the Invention
The present invention relates, in general, to methods and systems for managing data storage in tables and databases, and, more particularly, to methods and systems for providing improved partitioning of a table of data to support adding a node to a data storage cluster.
2. Relevant Background
In the data storage or information technology industry, relational database management systems (RDMS) provide important support for a wide variety of commercial applications, and there is growing demand for methods and devices for effectively and efficiently storing large quantities of data in a manner that allows it to be quickly retrieved and reliably stored. In databases, information is typically stored in rows of data fields storing pieces of information (e.g., one field may store a person's name, another field may store the person's address, and so on within a row) with one or more of the fields providing a key (e.g., a primary key may be included that uniquely identifies each row of data in a database or table). For example, clustered, high-availability databases are used by telecommunication companies and many other service providers such as financial institutions, Web-based retailers and service providers, and the like. The rate of increasing size of databases causes many challenges within the data storage industry including how to add additional storage devices (such as disk drives, tape drives, optical drives, servers, and the like) and/or how to provide more nodes for storing larger and larger tables. For example, more storage nodes may be added to handle rapidly increasing volumes of data stored in rows of a table, and such node addition may require modification of a database cluster to handle the additional information.
Every RDMS developer eventually encounters a situation in which a table stores a huge amount of historical data, but users typically only retrieve small, distinct portions at any particular time. For example, a financial institution may track millions of records related to trades of stocks spanning years, but a user may only need to retrieve trade data for a small time period such as a particular month. To improve query performance as well as storing growing volumes of data in a table, a RDMS developer often splits a large table into separate tables with the same structure (e.g., same fields/columns). A table typically is partitioned horizontally with each member or separate table having the same number of columns/fields as the original table, and each column has the same attributes (such as data type, size, and the like) as the corresponding column in the original table. An ongoing challenge for the RDMS developer is how to partition tables of data in a manner that provides a better utilization of hardware and allows for quick reproduction and/or reorganization of data.
For example, effective hardware utilization may involve deciding how to partition tables in relational database management systems implementing shared-nothing architectures to provide database management. With a shared-nothing approach, each processor has its own data storage as well as local disks or data storages. Except for the communication network, no other resources are shared among the processors. For example, the MySQL™ NDB Cluster storage engine is a distributed, shared-nothing storage engine with synchronous replication with a cluster and automatic horizontal data partitioning across the nodes that store the distributed data (e.g., buckets or partitions of a table). With this storage engine, any given row of data of a table is eligible to be stored in any of the partitions on the clustered nodes. The table's definition specifies which rows map to which partitions based on a partitioning function, and, as a result, it is important to choose or design the partitioning function to achieve effective reorganization of a table of data when a node is being added to a cluster. Partitioning is a significant design problem for other storage engine products as inefficient partitioning can quickly result in undesirable distributions of data (e.g., with some nodes storing larger portions of a table) and inefficient use of storage during reorganization processes.
It may be useful to utilize a table reorganization mechanism that would be used by the storage engine for adding partitions to a table that remain online and available. Design requirements for such a table reorganization mechanism may be that online transactions (e.g., reads, scans, and updates) should not be blocked. Additionally, the reorganization could in some cases be completed without duplication of an entire table, but, instead, only the rows that are moved to a new partition (e.g., in an added node or in an existing node) exist in two places in memory or storage. For example, when the table reorganization module is used in combination with an add node operation, this means that no extra data storage is used on the original older nodes. Further, it may be useful to provide a no-extra-storage-on-old-nodes property because adding a node is often a reaction to a shortage of old nodes.
Many storage engines support a number of differing mechanisms for partitioning tables including by data ranges (e.g., partition a table into 12 partitions coinciding with the months of the year), by use of linear hashing, by modulo hashing, and the like. Linear hashing only needs to split a bucket or partition when adding a new bucket or partition, but linear hashing introduces a skew in data distribution among the data buckets or partitions. An advantage in partitioning by linear hashing is that the adding, dropping, merging, and splitting of partitions is made much faster, which can be beneficial when dealing with tables containing extremely large amounts of data. But, as mentioned above, a disadvantage is that data is less likely to be evenly distributed between partitions as compared with the distribution using modulo hashing in partitioning. Modulo hashing has no skew of data, but, unfortunately, all rows of a table need to be moved when the number of buckets or partitions is changed.
To understand partitioning using linear hashing and modulo hashing, it may be useful to consider a simple example. Consider the following data (‘id’, ‘name’), where the primary key is ‘id’ that may make up the rows of a table with two fields or columns: (1, “Tomas”); (2, “Kent”); (3, “Jonas”); (4, “Frazer”); (5, “John”); (6, “Eliza”); (7, “ ”); (8, “ ”); (9, “ ”); (10, “ ”); (11, “ ”); and (12, “ ”). When this set of data is stored by a storage engine (such as within a MySQL Cluster with the ndb storage engine), the data may be separated onto different nodes. For example, the storage engine may use a modulo hashing distribution for storing the data in two buckets on two nodes, and the data may look like: Node 1—(1, “Tomas”); (3, “Jonas”); (5, “John”); (7, “ ”); (9, “ ”); and (11, “ ”) and Node 2—(2, “Kent”); (4, “Frazer”); (6, “Eliza”); (8, “ ”); (10, “ ”); and (12, “ ”). For a three node configuration, the data may be distributed as: Node 1—(1, “Tomas”); (4, “Frazer”); (7, “ ”); and (10, “ ”); Node 2—(2, “Kent”); (5, “John”); (8, “ ”); and (11, “ ”); and Node 3—(3, “Jonas”); (6, “Eliza”); (9, “ ”); and (12, “ ”). This example has been simplified, and, in practice, modulo hashing involved taking the modulo of the hash of the primary key (rather that of the primary key itself as it may appear here). It can be seen that the data is evenly distributed in both cases with each partition having 6 rows in the first example and each partition having 4 rows of data in the second example. However, the data resides very differently on three nodes compared to two nodes with only four data entries being on the same node in the 2 cases (i.e., (1, “Tomas”); (2, “Kent”); (7, “ ”); and (8, “ ”) are on the same nodes in each partitioning implementation). Even distribution is desirable but reshuffling of data or all rows when adding nodes or partitions can make online addition of nodes or reorganization of partitioned tables expensive, e.g., if 1 gigabyte of data is stored in a partitioned table, adding a node using modulo hashing may require 1 gigabyte of data to be moved.
Alternatively, the storage engine may use a linear hash partitioning for storing this same data in two buckets on two nodes, and the data may look like: Node 1—(1, “Tomas”); (3, “Jonas”); (5, “John”); (7, “ ”); (9, “ ”); and (11, “ ”) and Node 2—(2, “Kent”); (4, “Frazer”); (6, “Eliza”); (8, “ ”); (10, “ ”); and (12, “ ”). For three nodes, the partitions or buckets may be: Node 1—(1, “Tomas”); (5, “John”); and (9, “ ”); Node 2—(2, “Kent”); (6, “Eliza”); and (10, “ ”); and Node 3—(3, “Jonas”); (4, “Frazer”); (7, “ ”); (8, “ ”); (11, “ ”); and (12, “ ”). In another implementation of linear hashing, one of the original nodes may have the same entries as it had before or originally, while the other original node may have half of its original entries with the other half being on the new node. The 3-node partitioning example shows the desirable property that Nodes 1 and 2 do not have any new data as compared to the 2-node configuration. It is, however, evident that the distribution of data is very uneven and skewed when we have 3 nodes. This is apparent with Node 3 having twice as many data entries or rows as Node 1 and Node 2.
Conventional methods of partitioning including use of linear hashing and modulo hashing have significant drawbacks. For example, linear hashing as may be used in partitioning tables with database storage engines (such as MySQL™ NDB Cluster storage engine) allows for minimal copying during a table reorganization (such as to add a node to a clustered data store). However, linear hashing does not give an even data distribution unless the new number of nodes/buckets is some power of two times the original number. Use of modulo hashing distributions provides even distribution over the nodes or partitions on such nodes, but it often requires a great deal of data shuffling as the data on the original nodes is not retained.