1. Field of the Invention
The invention generally relates to arrangements for managing and summarizing data in a time-efficient manner so as to represent the data using less storage space in data storage devices. More particularly, the invention relates to arrangements for managing and summarizing data by using an intermediate summary structure to ultimately form a hierarchical histogram that is nearly optimal for multidimensional data, even multidimensional data that is subject to changes.
2. Related Art
To conserve memory space in data storage devices, especially in large database scenarios, and for visualizing data and computing approximately with data, it is desirable to represent data by summarizing it and placing summary data in a summary data structure that occupies a substantially smaller amount of memory than the original data. Symbolically, data may be an array A of numbers that is indexed by two or more integer keys. In a two-dimensional case, the (i,j)/th datum is denoted A[i,j]. A histogram is another array H with indices that match A's, such that H[i,j] is constant on rectangles of (i,j)'s. A goal is to find a histogram that minimizes the sum, over all i and j, of the square of |A[i,j]-H[i,j]|.
As used herein, there are several notions of efficiency, including space efficiency, time efficiency, and communication efficiency. (The following discussion does not constitute an admission that the discussed concepts constitute “prior art.”)
Concerning space efficiency, a B-bucket histogram is a space-efficient representation because it requires about 5B numbers to store the boundaries and heights of each bucket. (There are more efficient ways to store histograms that are hierarchical.) Also, there is space efficiency of a histogram sketch (in a dynamic data scenario—in which data is subject to change) and of the method's workspace (in a static data scenario—in which the data does not change). Typically, the size of a histogram sketch is somewhat bigger than 5B numbers, but much smaller than the N2 numbers needed to store the entire dataset.
Time efficiency relates to performance of various operations, such as updates to sketches and construction of histograms at top level, and also the constituent operations.
Finally, the size of structures such as sketches is related to communication efficiency. Larger structures consume greater amounts of communication bandwidth, and, accordingly, it would be desirable to use smaller data structures if communication thereof is needed, provided the data structures do not unduly sacrifice accuracy of the data they represent.
Various known arrangements may be considered efficient in one or another of these respects. However, conventional arrangements have not been efficient in space, time and communication efficiency simultaneously, especially for multidimensional data.
Of course, an overriding concern is that the transformation of the data to the summary data structure retain as much or the original data's meaning as possible, so that the summary structure accurately represents the original data. That is, there should be quality guarantees (guarantees of how accurately the summary data represents the original data). Concurrently, it is desirable that there be useful guarantees about the time, space, and bandwidth used, especially for multidimensional data.