Measured data, whether it is from detected images, received signals, polls, searches, or any other data collection method, are likely to be subject to noise or uncertainty. Methods are necessary to improve the ability to extract as much information from the data as possible without over-interpretation of the data.
Statistical estimation is used when information needs to be extracted from noisy data, i.e., data with unknown random components that change from one realization of the data to another and can only be characterized statistically. The goal of statistical estimation is to model the signal in the data, i.e., the reproducible components that do not change from one data realization to the next. The signal is modeled as a function of parameters, whose values are determined by a fit of the model to the data.
This process of modeling noisy data finds broad application, including, without limitation, tracking objects, signal processing, imaging (including medical imaging such as CT, SPECT, PET, X-ray), market research, supply-chain management, inventory control, and financial markets, to name a few. The information extracted from historical data is often used to predict future behavior and/or the risk associated with it. As long as the underlying model is correct, the quality of the prediction and/or risk analysis are determined by the accuracy of the estimation.
It should be pointed out that statistical estimation is related to the broader field of optimization, which also includes non-statistical methods, in which noisy data are not involved, and optimization is just a matter of finding the best parameters.
Statistical estimation dates back two centuries to the method of least squares (Gauss 1809), which later evolved into the maximum-likelihood (ML) method (Fisher 1912, 1922). Given the probability distribution function (pdf) of statistically independent data, the ML method maximizes the conditional probability of the data, given the model, or, equivalently, minimizes the log-likelihood function (LLF)
                    L        =                  2          ⁢                      ∫                                                  ⁢                          d              ⁢                                                          ⁢                              x                ⁡                                  [                                                            f                      ⁡                                              (                                                  x                          ,                          θ                                                )                                                              -                                                                  ln                        ⁡                                                  (                                                      f                            ⁡                                                          (                                                              x                                ,                                θ                                                            )                                                                                )                                                                    ⁢                                                                        ∑                          i                                                                                                                                ⁢                                                                                                  ⁢                                                  δ                          ⁡                                                      (                                                          x                              -                                                              x                                i                                                                                      )                                                                                                                                ]                                            ⁢                              (                LLF                )                                                                        (        1        )            
Here, x is an n-dimensional space, any of whose dimensions may be continuous, discrete, or even categorical, θ are the model parameters, xi are the positions of the observations, ƒ(x,θ) is the pdf, and δ is the n-dimensional Dirac (1958) delta function. The integral sign is understood to designate an integral over the continuous dimensions and a sum over the discrete and categorical dimensions, although the integral may be approximated by a sum in practice.
In many applications the integral normalization of the pdf is fixed, typically to unity. In that case the first term on the right hand side of Eq. (1) is constant and may be omitted, yielding
                    L        =                              -            2                    ⁢                                    ∑              i                                                                    ⁢                                                  ⁢                          ln              ⁡                              (                                  f                  ⁡                                      (                                                                  x                        i                                            ,                      θ                                        )                                                  )                                      ⁢                          (                              LLF                ⁢                                                                  ⁢                with                ⁢                                                                  ⁢                fixed                ⁢                                                                  ⁢                integral                ⁢                                                                  ⁢                pdf                            )                                                          (        2        )            
The extra term is included in Eq. (1) to allow for the case in which the normalization of the observations is itself of interest, e.g., the rate of events observed by a detector or the rate of sale of a product. In that case, Eq. (1) is the unbinned LLF of the Poisson (1837) distribution.
The ML method has three distinct advantages:                1. It directly estimates the probability distribution function (pdf) without requiring the data to be binned.        2. In the asymptotic limit in which the number of data points largely exceeds the number of parameters, the variances of the parameters estimated by the ML method are smaller than or equal to those of competing statistics. In this asymptotic limit, the variance of an estimated parameter is inversely proportional to the number of data points used to estimate it. For a given accuracy, therefore, the ML method allows a parameter to be estimated from a smaller sample than other methods, hence its higher sampling efficiency. The efficiency of an alternative estimator is, in fact, defined as the ratio between the ML variance and the variance of the other estimator. This sets the efficiency of ML estimators to unity, by definition, and those of competitors to fractions smaller than or equal to unity.        3. The covariance matrix of the uncertainties in the estimated parameters is readily computed in the asymptotic limit from the information matrix (Fisher 1922), the Hessian matrix of second-order partial derivatives of the LLF at its minimum.        
In the non-asymptotic regime, when the number of parameters is comparable to the number of data points or larger, it is necessary to constrain the solution to avoid having the model treat random statistical noise as reproducible signal. (See, e.g., the review by Puetter, Gosnell & Yahil 2005.) A common practice is to represent the signal as a generic linear “nonparametric” combination of basis functions, whose coefficients are to be estimated. (There may be additional nonlinear parameters characterizing the basis functions.) The goal is to have the estimation both provide the values of the significant coefficients and constrain the insignificant coefficients by zeroing, or at least minimizing them. In that way, it is hoped to separate signal from noise.
The most reliable parameterization is the most conservative one, which seeks the simplest underlying parameters consistent with the input data, also known as minimum complexity or Ockham's razor. Simplicity is context-dependent, but for most continuous applications, the simplest solution is the smoothest one. The PIXON® method achieves this solution by imposing the maximum, spatially adaptive smoothing permitted by the data (Piña & Puetter 1993; Puetter & Yahil 1999; Puetter et al 2005; U.S. Pat. Nos. 5,912,993, 6,353,688, 6,490,374, 6,895,125, 6,993,204, 8,014,580, 8,086,011, 8,090,179, 8,160,340, 8,396,313; US Patent Publication 2012/0263393, each of which is incorporated herein by reference). The ALGEBRON™ method is an equivalent technique designed for discrete problems that are not anchored in a continuous space, such as prediction and risk assessment in financial systems (U.S. Pat. No. 7,328,182).
Ever since the pioneering work of Fisher (1912, 1922), the common thread in statistical estimation has been the use of ML and its LLF estimator. However, ML has a number of serious drawbacks, which can limit its usefulness:                1. ML is only asymptotically efficient. When additional constraints are applied to the solution in the non-asymptotic regime, the efficiency of the ML method is no longer assured.        2. The covariance matrix of the parameters cannot be estimated from the information matrix in the non-asymptotic regime. The unrestricted ML method, in fact, often creates significant artifacts by amplifying noise treated as signal. Constraining the solution can reduce these artifacts as discussed above, but the residual accuracy is then largely determined by the constraints and not by the information matrix.        3. The LLF is not, in general, quadratic in the parameters θ, and the computational effort to determine the parameters may not be worth the extra asymptotic sampling efficiency, particularly for large-scale problems.        4. The gradient of the LLF with respect to the parameters has a term that is inversely proportional to the pdf ƒ(x,θ)        
                                          ∇            θ                    ⁢          L                ⁢                ⁢        2        ⁢                  ∫                      d            ⁢                                                  ⁢            x            ⁢                                          ∇                θ                            ⁢                                                f                  ⁡                                      (                                          x                      ,                      θ                                        )                                                  ⁡                                  [                                      1                    -                                                                                            f                          ⁡                                                      (                                                          x                              ,                              θ                                                        )                                                                                                    -                          1                                                                    ⁢                                                                        ∑                          i                                                                                                                                ⁢                                                                                                  ⁢                                                  δ                          ⁡                                                      (                                                          x                              -                                                              x                                i                                                                                      )                                                                                                                                ]                                                      ⁢                          (                              LLF                ⁢                                                                  ⁢                gradient                            )                                                          (        3        )                             Data in regions of low pdf, possibly including outliers (rogue data), can then lead to large biases and/or fluctuations in the estimates of the parameters.        
Given the above limitations, the ML method does not have a particular advantage over other estimators for nonparametric estimations in the non-asymptotic regime. Accuracy and computational efficiency outweigh the ML “advantages” that no longer exist.