This invention relates generally to analyzing the data population of a dataset to determine information that characterizes the data population, and more particularly to determining the data values at predetermined percentiles of a data population that is distributed across multiple nodes of a distributed parallel database.
It is frequently desirable to characterize the data in a data population in order to better understand the nature of the data. Important characteristics of the data population include data values which occur at certain percentile levels. For example, determining data values at the median (50th percentile), the 90th percentile, or the 99th percentile levels is important, especially for financial data as to satisfy legal reporting and regulatory requirements, because percentile values allow insight into the underlying data and permit the data to be summarized meaningfully. Percentile values are determined using inverse distribution functions which are different from other types of mathematical calculations that characterize a data distribution as they produce the actual real data values in the data distribution at desired percentiles. The median, for instance, of a data distribution is different from the average because it produces the real value of the middle data element in the distribution. Moreover, it is unaffected by an outlying value that could significantly skew the average value.
While performing inverse distribution operations to determine the data values at selected percentiles on a small dataset is relatively straightforward, doing so on a large parallel database where the data is distributed across clusters of multiple computers is exceptionally difficult. This is because there is an ordering constraint upon the data population which requires getting the data into a particular order before percentile levels can be determined. It is not possible to calculate inverse distribution functions in parallel on separate subsets of data and combine the answers in a way to derive correct results for the overall dataset. The median of a data distribution, for instance, is not equal to the median of medians. Ordering of the data in a large distributed parallel database has not generally been possible in a way that does not revisit the data multiple times or require massive movements of large amounts of data. Accordingly, known approaches to performing inverse distribution function operations on parallel databases are inefficient, costly and difficult.
It is desirable to provide systems and methods which address the foregoing and other known problems of characterizing distributed datasets by enabling inverse distribution operations to determine data values at selected percentile levels of a data population that is distributed across multiple nodes of a parallel database system efficiently and cost effectively, and it is to these ends that the present invention is directed.