As the size and organizational complexity of large data sets (e.g., what is commonly referred to as “big data”) continue to increase, it has become commonplace to use grids of multiple node devices to store large data sets. The use of grids of node devices enables the provision of improved speed of access to data sets and/or redundancy of storage as a protection against loss of data due to device failures. Over time, different approaches to organizing large data sets for storage among grids of node devices have arisen, each of which are directed to achieving somewhat different goals.
In answer to sheer size of some large data sets, an approach sometimes referred to as “normalization” may be adopted in which various techniques may be used to reduce overall size by identifying and taking advantage of opportunities to combine otherwise separate data structures to eliminate redundant entries thereamong and/or instances of null values. Data compression may also be used to further reduce overall size. Unfortunately, to retrieve data, such an approach has the disadvantage of typically requiring various “denormalization” techniques and/or decompression to be performed in a centralized manner, thereby requiring exchanges of relatively large portions of the data set among devices, which slows the speed of data retrieval.
In an opposing approach, a large data set may be stored in denormalized form with the intention of having the redundant entries that would otherwise be eliminated by normalization, as well as tolerating what may be a considerable degree of sparsity in which many entries may be filled with null values. Such an approach may greatly improve the speed of retrieval. Unfortunately, the degree of denormalization that may be required to achieve desired speeds of retrieval may increase the overall size to an extent that requires a prohibitively large, complex and costly grid of node devices to provide storage.