When working with large volumes of data, there are often situations when only a small fraction of data is relevant for a given task. In those cases, identifying the relevant subset early increases productivity by not accessing irrelevant data and thereby results in dramatic performance improvements. Current technologies available in the computing industry focus on organizing data in a specific manner in order to achieve acceptable results. Such technologies rely on primary indices wherein a table is forced to be ordered on some dimension. The downside with this approach, however, is that maintaining strict ordering is often expensive and only a single dimension can be used for retrieval in any storage scheme.
Another common technology relies on the use of secondary indices; wherein auxiliary data structures are created that provide quick access to records that match desired criteria. A shortcoming of such systems is that they commonly result in random access patterns, are expensive to maintain with updates, and use additional disk and memory within a system. Additionally, they require user participation in order to make manual or semi-automatic decisions about what indices should be created.
Another common technology relies on table partitioning. Using this approach, a system may allow explicit partitions along certain dimensions. For example, each week can be formed into a separate subset of the data, and then filters on the used dimensions may be optimized so as to only access the relevant partitions (i.e., those that contain data from relevant ranges). A problem with this approach is that it typically only works with a relatively small number of partitions (10-100s), and requires manual tuning by an administrator.
Yet another common approach relies on Min-Max indices (also called “zone maps”) to automatically maintain simple statistics (usually min/max values for all columns) about different dimensions within the data. Based on these statistics relevant ranges of data can be readily identified. This is most useful when exploiting the natural order of the data such as when the data is loaded in batches such that batches have very little overlap (as is common e.g. for time-based loading). They can also be used when there is a primary index on the data.
The systems and methods described herein provide an improved approach to data storage and data retrieval that addresses and reduces the impact of the above-identified limitations of existing systems by efficiently managing the data and data operators being processed within a data processing system.