On-line analytical processing (“OLAP”) is a way of organizing data in which the data is arranged across one or more categories or “dimensions.” For example, cellular phone calls made in the US (the data) may be classified/organized/accessed in multiple ways: by the US state in which each call was made (the first dimension), the date of each call (the second dimension), the type of phone used to make each call (the third dimension), and the carrier used to make each call (the fourth dimension). Arranging the data in this way allows a user to easily extract information from the data involving multiple variables by accessing the intersection or “union” of two or more dimensions. For example, the various dimensions may be efficiently queried to determine the number of cellular phone calls made in Massachusetts in July 2012, or the number of cellular phone calls made by ANDROID devices on the SPRINT network.
The dimensions of data may be represented in a data structure called a “cube,” which may have any number of dimensions (i.e., while referred to as a “cube,” the OLAP cube is not limited to a physical cube's three dimensions). The OLAP cube is, in one sense, another way of storing data, and thus may be maintained (albeit using a specialized layer implementing the cube's functionality) in a traditional database. In many cases, queries to the cube request only the number of unique values (or “cardinality”) of a given dimension (e.g., the number of calls made in a certain area) rather than information specific to each value (e.g., the names of the people who made those cellular phone calls). In these cases, the cube may store only a representation of the cardinality (a “measure”) instead of the full dimension (e.g., the number of cellular calls instead of the entire data record for each cellular call). Data may be first stored as a dimension and then later condensed into a measure; the act of condensing a dimension into a measure is known as “folding,” and the data structure used to hold the information in a measure is called a “set.”
In many OLAP applications, additional data may arrive in real time (i.e., new items of data may arrive every minute, second, millisecond, or at any other timeframe), and the new data may be added to the existing data already in the cube. For example, new cellular-phone usage data may arrive from a cellular carrier, representing recent cellular-phone activity, and this new data may be added to the appropriate sets in the cube. The cardinality of a given set may thus vary greatly, from a few to a few million, depending on the type of dimension, the amount of input data, the length of time of data collection, and/or many other factors. At small cardinalities, the data may be stored straightforwardly and uncompressed in the set; at larger cardinalities, a straightforward storage of the data takes up too much memory space, and the data is therefore stored in a compressed form (at the expense of the loss of information beyond the mere cardinality of the data and/or the loss of some accuracy in estimating the exact cardinality of the data). The set that holds the values has to accommodate this wide range of sizes while also supporting other functions, such as adding a new element as new data arrives (“set insertion”), determining (or accurately estimating, in the case of large cardinalities) the number of unique items in a set (“querying set cardinality”), and merging sets (“performing a set union”) in response to a query. These and other operations must be performed in an amount of time that does not grow as cardinality grows, in order to keep up with the real-time input data. Finally, because OLAP cubes are typically implemented on top of traditional databases, the sets themselves must have a small memory footprint (e.g., approximately four kilobytes, which is often the maximum size of a field in a typical database). By “memory footprint” is meant the amount of main memory actually occupied by, or allocated to, the OLAP cube data.
The data in a set may be stored in a data structure in many different ways; some representations may be more efficient, accurate, and/or faster at small cardinalities, while other representations may be more efficient, accurate, and/or faster at medium or large cardinalities. For example, an uncompressed, fixed-size unsorted list (or, alternatively, a set using D-left hashing) may be preferable at low cardinalities (e.g., less than 1,000 elements); sets of this size do not consume an appreciable amount of memory space, and the uncompressed data may be useful if, for example, further information is required from each set member. At higher set cardinalities (e.g., greater than 1,000 elements), the uncompressed representation consumes too much memory (i.e., more than the four kilobytes available in a typical database) and, at this number of elements, the cardinality of the set may be more useful or interesting than the details of a particular member of the set. Thus, at this mid-range set cardinality (e.g., between 1,000 and 20,000 elements), a linear counter may be preferable; this data structure compresses the data elements (such that details beyond the cardinality of the set are lost) and provides only an estimate of the cardinality, but occupies a small, fixed memory space (independent of the cardinality of the set). Linear counters, however, become inaccurate at very high cardinalities (e.g., more than 20,000 elements); at these cardinalities, the set may be better represented by a probabilistic counter with stochastic averaging (“PCSA”) or a log-log counter. These data structures, while inaccurate at low cardinalities, provide accurate estimates of the cardinality of large sets.
Thus, while none of the aforementioned data structures is optimal for all cardinalities of a set, each one is optimal for a given range of cardinalities. Existing systems describe the use of different data structures at different levels of cardinality; an uncompressed list at low cardinalities, for example, transitioning to a linear counter and/or PCSA at higher cardinalities. Data is first stored in a first data structure and, once the cardinality of the set increases past a threshold, the data is converted to a different data structure. Thus, the set is always represented optimally by one of a plurality of data structures.
As mentioned above, however, the conversion of an uncompressed list to either one of a linear counter or PCSA represents a loss of information. That is, while the uncompressed list stores actual data elements, the linear counter and PCSA store only the number of data elements. Conversion from the uncompressed list to either of the linear counter or PCSA is trivial; it simply requires a number if set insertions equal to the number of elements stored in the uncompressed list. Without the availability of the uncompressed list, however, existing systems provide no method of converting the data to the linear counter or PCSA; once the original data is abandoned, no further conversion is possible.
As one solution, existing systems maintain both a linear counter and PCSA in parallel. The uncompressed set is used until it becomes untenable, at which point its cardinality is converted to both the linear counter and PCSA (which is, as noted above, a trivial conversion). Both of these compressed sets are maintained as the cardinality increases, though for medium cardinalities, only the linear counter is trusted (i.e., because the PCSA provides accurate cardinality estimates only at very high cardinalities). Once a threshold cardinality is crossed, the PCSA is used instead of the linear counter. The parallel maintenance of the PCSA is thus wasted overhead until when and if the cardinality of the set increases to the point at which the PCSA becomes preferred over the linear counter. This dual maintenance of the two data structures consumes an unacceptable amount of memory space (given, in particular, the four-kilobyte limit of a field in a typical database). Further, because the linear counter's upper limit of cardinality depends upon the amount of space available to store the counter, and because that space is lessened by the need to also maintain the PCSA, the effectiveness of the linear counter is further reduced. A need therefore exists for a way to transition between the uncompressed, linear counter, and PCSA representations (or their equivalents) without wasting the memory space required to maintain two representations simultaneously (e.g., both the linear counter and PCSA).