In recent years, amount of data being handled by data management systems has increased by a much higher rate than the increase in storage capacity of storage mediums. The data management systems have evolved to cope with this phenomenon by distributing data over multiple servers. This distribution is accompanied by various techniques for different aspects of data management. One of them being sharding. Sharding is a type of distribution that minimizes query time for predetermined queries. Replication is a technique utilized to provide fault tolerance in a distributed data management system, by having multiple copies of data residing in different servers.
A database program or a data management system generally displays data as two-dimensional tables, of columns and rows. However, data are typically stored as one-dimensional strings. A row-based store typically serializes the values in a row together, then the values in the next row, and so on.
Currently available relational database management systems accomplish partitioning based on specified criteria applied to split the database. In general, a partitioning key is used to assign a partition based on certain criteria. Commonly used approaches include range partitioning, list partitioning, hash partitioning, round robin partitioning, and composite partitioning. In range partitioning, a partition can be defined by determining if the partitioning key is inside a certain range. For example, a partition can be created to include for all rows in which values in a column of postal codes are between 70000 and 79999.
In list partitioning, a partition can be assigned a list of values and the partition can be chosen if the partitioning key has one of the values on the list. For example, a partition built to include data relating to Nordic countries can includes all rows in which a column of country names includes the text string values Iceland, Norway, Sweden, Finland, Denmark and the like. In hash partitioning, the value of a hash function can determine membership in a partition. For example, for a partitioning scheme in which there are four partitions, the hash function can return a value from 0 to 3 to designate one of the four partitions. Round robin partitioning can be used to distribute storage and/or processing loads among multiple data partitions and/or servers or server processes according to a pre-set rotation among the available partitions or servers or server processes. As an example, a first data unit can be directed to a first partition of three partitions, a second data unit to the second partition, a third data unit to the third partition, a fourth data unit to the first partition, and so forth. In composite partitioning, certain combinations of other partitioning schemes can be allowed, for example by first applying a range partitioning and then a hash partitioning.
Data replication refers to storing the same data on multiple storage devices. Current systems for processing data for calculation of metrics take two different approaches. In one approach, known as batch processing, data is equally and arbitrarily distributed among the different servers and then the computation is done in two stages, known as the map stage and the reduce stage. In the map stage, each server does all the processing that can be done on per-row basis and in the reduce stage, the remaining computation is done. This requires that the data taking part in the calculation of the same key value be moved to the same server after map stage and before reduce stage. The process of moving data between map and reduce stages is called shuffle stage. In case of certain kind of metrics, an additional stage, called a combination stage, can improve performance by reduction of network traffic during shuffle. The combination stage is very similar to the reduction stage and can be called a local reduction stage in which the data is processed in the server where the map had finished running. Some well known systems taking this approach are Hadoop and Spark. They also provide a resilient storage, known as HDFS, that, by means of replication of data, enables the computation to continue in case a few servers fail. However, the recovery may only work up to a maximum number of failures, which is one less than that of the replication factor/that is the number of copies of the data that is saved
The second approach is to distribute the data in a way so that the data related to one value of a column remains on one server. This way, shuffle is no longer necessary before reduction. However, the distribution is only allowed according to one single column or a combination of columns grouped as a single column. The replication is achieved by making identically partitioned copies. A subset of these systems is stream processing systems which do not deal with storage of data. That is to say, if a server fails, the data needs to be reprocessed in the system.
Conventional database systems allow partitioning of a row based on a primary key (partitioning key). This is equivalent to partitioning data along one dimension. This is problematic because if queries along multiple dimensions are run against such a system or a group by is done along a dimension that is different from the partition dimension then it leads to a lot of data movement across nodes and there by reducing the overall performance of the queries.
Accordingly, there exists a need to provide system and method for storage efficient multi-key sharding approach for use in a data management system.