The present invention generally relates to modeling data. More particularly, the present invention relates to joint modeling of a mean and a dispersion of the data.
The joint modeling of a mean and dispersion refers to estimating the mean and the dispersion concurrently. The estimated mean and dispersion may or may not depend on each other. The (sample) mean is a measure of the central value of the data, and generally refers to an average value among sample data drawn from a population, e.g., a set of data about which statistical inferences are to be drawn. The (sample) dispersion is a measure of the spread of the data about its central value, and is generally measured by at least one or more of: (1) Range, (2) Mean absolute deviation, (3) Standard deviation, (4) Variance and (5) Covariance, etc. The range refers to the difference between a largest value and a smallest value among the sample data. If the sample data is composed of x1, x2, . . . , xn, then the mean, m, is (x1+x2+ . . . +xn)/n. The mean absolute deviance is (|x1−m|+|x2−m|+ . . . +|xn−m|)/n. The standard deviation is a square root of the variance, and the variance is {(x1−m)2+(x2−m)2+ . . . +(xn−m)2}/n. When two variables, x and y, are independent of each other, the covariance of them is 0. Generally, the variance of (x+y), Var(x+y), is Var(x)+Var(y)+2*Cov(xy), where Cov(xy) refers to the covariance of (x*y). Cov(xy) is {x1y1+x2y2+ . . . +xnyn}/n−mean(x)*mean(y), where x is {x1,x2, . . . , xn}, y is {y1,y2, . . . , yn}, n is the number of elements in x or y, mean(x) is the mean of x and mean(y) is the mean of y. The covariate refers to a possibly predictive variable that is related to an outcome of a modeling.
In regression applications (e.g., linear regression) involving a conditional response distribution from the exponential dispersion family, a primary interest is typically in a regression model for the mean parameter. The corresponding dispersion parameter is either a fixed constant (e.g., a unity for the Poisson and Bernoulli distributions), or is an unspecified constant that is estimated from residual deviance values of the mean regression model. The linear model refers to any approach to modeling relationship between one variable (or possibly more variables) termed the response and denoted by y and one or more variables termed the covariates denoted by x, such that the model depends linearly on an unknown parameters to be estimated from sample data. The conditional response distribution refers to distributional changes of y across the sample data for fixed values of the covariates x. A sample mean and variance (i.e., a measure of the central value and the spread about the central value) are statistics computed from sample data. The sample mean is an estimate of the mean parameter (μ) of the population from which the sample data are drawn. The residual deviance value refers to a measurement of deviance contributed from each sample data. Regression models are used to predict one variable, termed the response, from one or more other variables termed the covariates. Regression models provide a user with predictions about past, present or future events to be made with sample data about past or present events. Mean regression refers to a tendency that statistical outliers regress toward the mean when sample data is tested over and over again. Regressing toward the mean refers to a relation between a value of a variable x and a value of y from which the most probable value of y can be predicted for any value of x. Regression refers to a process for determining a line or curve that best represent a general trend of sample data.
An assumption of constant dispersion implies that the mean and variance of the conditional response distribution have some fixed and unchanging relationship as a function of the covariates, even though in many cases these quantities can plausibly vary quite independently. Although modeling of a dispersion parameter (i.e., the sample dispersion is an estimate of a dispersion parameter (φ) of a population from which the sample data are drawn) may not be a primary goal of the regression application, one consequence of any variability in the dispersion parameter is that non-constant case weights (i.e., weights for a non-constant dispersion parameter) are required in deviance loss functions used for the mean regression. Therefore, an accurate estimation of dispersion variability will result in tighter confidence bounds for parameter estimates in mean regression, as the data associated with the larger dispersion values will be appropriately down-weighted in terms of their contribution to the deviance loss function. However, the regression models for the mean and dispersion parameters cannot be obtained independently, or even sequentially, since their estimation is intrinsically coupled in likelihood-based formulations used for regression modeling. The deviance loss function refers to a function that that not only measures how close dispersion parameters are to their expected values, but also measures how well dispersion parameters correspond with dispersions of residuals. The likelihood-based formulations are functions of parameters of a statistical model that plays an important role in statistical inference. Statistical inference or statistical induction comprises uses of statistics and random sampling to make inferences concerning some unknown aspect of a population.
Traditionally, GLM (General Linear Model), which is widely used for mean regression modeling, has been used for conditional response distributions from the exponential dispersion family. The GLM is a statistical linear model for a suitable transformation of the mean, term the link transformation. The GLM may be represented as g(Y)=XB+U, where Y is a vector with series of response measurements, g(•) is the link function that is chosen appropriately for the assumed response distribution, X is a design matrix, B is a vector including parameters to be estimated, and U is a vector including errors and noises. The design matrix refers to a matrix of explanatory variables (one or zeroes, or reals), that represents a specific statistical or experimental model). However, a traditional methodology such as the GLM cannot perform joint modeling of the mean and the dispersion without inventing additional art, particularly in the case when the covariates in the data are complex, and must be simplified and grouped in a preprocessing step that considerably detracts from the quality of resulting model.
Therefore, it is highly desirable to provide a system and method to perform joint modeling of a mean and dispersion suitable for a wide variety of data sets without requiring any preprocessing and grouping of the sample data.