1. Field of the Invention
The invention relates to method and system for aggregating data distribution models. More specifically, the invention provides a system and method of aggregating or combining data distribution models while correctly maintaining the applicability of statistical and analytical techniques.
2. Description of the Prior Art and Related Information
When a researcher or engineer is involved in the analysis of a large quantity of data, called a data distribution, some summarization of the data elements in the data distribution is generally necessary because of the limitations of existing hardware and software. Generally this summarization of the distribution center and spread, including mean, sigma (standard deviation) and the number of data elements in the data distribution, are used. Alternatively sampling may be used to extract a smaller, more manageable subset of data that can be analyzed.
Unfortunately summarizing the data elements with a mean and sigma assumes that the distribution is Gaussian, or normal, and represents a single distribution and is not a mixture of independent distributions. Similarly, sampling may miss important elements of the distribution, for example outliers or bimodal patterns, unless the sample is sufficiently large.
Thus, there is a need for a system for advanced analysis that provides the benefit of maintaining the overall shape and characteristics of the data distribution while keeping the data storage requirements to a minimum. There is a further need for a system that can perform aggregation of small subgroups of the data distribution, thus keeping computation needs to a minimum. There is a further need for such a system with which statistical tests can be properly performed without making assumptions about the data distribution. There is a further need for a system with which complex analytics can be performed including basic statistical functions, such as mean, minimum, maximum, standard deviation, etc. can be performed as well as complicated correlation and modeling studies. There is a further need for a system that naturally weights the highest data concentrations with the greatest accuracy in the approximation, wherein outliers are de-emphasized but not removed. There is a further need for a system with which an approximation of the original data distribution can be rebuilt from the model and estimates of the errors in this rebuilding can be made.