The present invention is directed to a method and a system for managing collections of data. More specifically the present invention is directed to a method and a system for managing a hierarchy of subsets of data.
There are many environments in which it is desirable to monitor system operations and/or collect sets of data over certain time periods or in connection with the occurrence of certain events. These sets of data can be considered to be samples of data for a given time interval or in regards to the occurrence of some event or state transaction. One environment in which this periodic sampling is done is in the communications network arena. For example, it may be desirable to collect netflow data from routers in a wide area network (WAN) or local area network (LAN). In this arrangement the netflow information can be gathered by dedicated servers referred to as “collectors”. It is known that it may be appropriate to take samples of the collected data rather than to store all of the raw data in a database. The sampling may be made up of collection of the relevant data that corresponds to a predetermined time interval or corresponds to the occurrence of a particular event. The time interval or event occurrence selected defines a sampling “granularity”. One such data sampling technique is referred to as smart sampling. An example of an algorithm for smart sampling is:
smart sampling algorithm
int smartSample (DataType data, int z) {
static int count = 0;
if (data.x > z)                data.samplingFactor = 1.0;        
else{                count += data.x;        if (count < z)                    return 0; //drop                        else {                    data.samplingFactor = ((double)z) / data.x;            count − count % z;                        }        
}
return 1; //sample
}
For the ease of description, the remainder of this example will focus on a sampling algorithm which samples data over a given time interval, such as every five minutes. One of skill in the art will recognize, though that the duration of the time interval is variable, as is the decision to use time intervals to define sampling intervals.
Once the raw data is sampled it can be ingested into a database. The initial sampling interval is taken to be the initial, and smallest, sampling granularity. The size of the granularity, that is the sampling interval, in this example can be set by the data collector.
In the desired working environment it may be helpful to look at samples of data over larger granularities or time intervals. For example it may be desirable to know what the samples of data are for a one hour period, or a one day period rather than the five minute interval of the smallest granularity. Using a composable sampling algorithm, that is an algorithm that can successively sample, with increasing granularity, the resulting set from each previous round of sampling, a system can derive data for a larger sampling granularity from the set of data collected at the smaller granularity. The derived data set would be equivalent to a data set that could have been collected if the larger granularity had been used at the collection stage.
In the example given above each sample set for each five minute interval could be considered a separate bin of data. To derive data for a one hour time interval the sampling algorithm would be run over twelve “bins” of data corresponding to the smallest granular level. The derived data would be equivalent to data that would have been collected if the original granularity or time interval had been set for one hour. This derived data set is smaller than the data set in the twelve bins from which it was derived, but there is a corresponding loss of detail.
The derived data set for hour long intervals could be sampled again to create a data set for a higher level of granularity, for example a day. Thus 24 “one hour” bins of data would be sampled to create another data set, even further reduced. This set would be equivalent to the data that would have been collected if the original granularity had been selected to be a 24 hour interval rather than the original 5 minute interval.
One problem that arises in this repeated smart sampling of the data is the problem of making sure that the sampled data are appropriately associated with the respective defined levels of granularity.
A couple of solutions have been proposed to this problem, but they each have drawbacks.
One solution involves replicating, within the database, the data that corresponds to each of the granularity levels. In this arrangement any data record that appears in each granularity level actually appears multiple times in the database, each instantiation having associated with it a key or code or identifier that indicates the particular granularity level that instantiation is associated with. While this solution arguably simplifies the process of sorting through the database for records for each granularity level, the replication and duplication increases the storage requirements of the database arrangement.
In a second proposed solution the data records are not replicated. Instead, each data record receives a separate identifier or key in connection with each granularity that is introduced into the system. As an example, consider bins of 5 minute time intervals sampled and re-sampled so as to create granularities of 1 hour, 24 hours, and seven days. Thus three additional levels of granularity will have been introduced. All of the data records get examined when one conducts a search or query at the smallest or finest level of granularity; a first subset of data records, something less than all of the data records, are in the next level of granularity, the one-hour bins; second subset, something less than the data records of the first subset are in the third granular level and so on. In the second proposed solution a flag for each granularity level is associated with each data record. If “0” indicates that the record is not contained at a particular granularity level and “1” indicates that it is, then if a data record has a key of 0011 this indicates the record is in the five minute interval set and the one hour subset, but not the one day or one week subsets (the flags in this example are arranged with smallest granularity on the right and increasing granularity going from right-to-left; alternative arrangements for the flags may be possible). This arrangement eliminates the need to replicate the data base. However, this arrangement requires that a new key or identifier or code for every data record must be added every time a new level of granularity is created. That is, a new flag must be added to each data record with each sampling of the data so as to accurately and completely reflect those granularity levels with which the data records are associated.
It is desirable to have a data records management arrangement that avoids the need for duplication of records while avoiding having to introduce multiple keys or flags or identifiers for each data records.