This invention relates to a method of data compression for the estimation of a plurality of parameters on which a dataset is dependent.
There are many instances where objects consist of many data, whose values are determined by a small number of parameters. Often, it is only these parameters which are of interest. An aim of this invention is to find linear combinations of the data which are focussed on estimating the physical parameters with as small an error as possible.
Such a problem is very general, and has been attacked in the case of parameter estimation in large-scale structure and the microwave background (e.g. Tegmark M., Taylor A., Heavens A., 1997 Astrophysical Journal 480, 22. [hereafter TTH], Tegmark M., 1997a. Astrophysical Journal (Letters) 480, L87 and Tegmark M., 1997b Physical Review D, 55, 5895.) The entire disclosure of these three documents is incorporated herein by way of reference. Previous work has concentrated largely on the estimation of a single parameter; the main advance of this invention is that it provides a method for the estimation of multiple parameters. The method provides one projection per parameter, with the consequent possibility of a massive data compression factor. Furthermore, if the noise in the data is independent of the parameters, then the method is entirely lossless. i.e. the compressed dataset contains as much information about the parameters as the full dataset, in the sense that the Fisher information matrix is the same for the compressed dataset as the entire original dataset. An equivalent statement is that the mean likelihood surface is at the peak locally identical when the full or compressed data are used.
The method will be illustrated herein with the case of galaxy spectra, for which there are surveys underway which will provide xcx9c106 objects. In this application, the noise is generally not independent of the parameters, as there is a photon shot-noise component which depends on how many photons are expected. We take a spectrum with poor signal-to-noise, whose noise is approximately from photon counting alone, and investigate how the method fares. In this case, the method is not lossless, but the increase in error bars is shown to be minimal, and superior in this respect to an alternative compression system PCA (Principal Component Analysis).
One advantage such radical compression offers is speed of analysis. One major scientific goal of galaxy spectral surveys is to determine physical parameters of the stellar component of the galaxies, such as the age, star formation history, initial mass function and so on. Such a process can, in principle, be achieved by generating model galaxy spectra by stellar population synthesis techniques, and finding the best-fitting model by maximum-likelihood techniques. This can be very time-consuming, and must inevitably be automated for so many galaxies. In addition, one may have a large parameter space to explore, so any method which can speed up this process is worth investigation. One possible further application of the data compression method is that the handful of numbers might provide the basis of a classification scheme which is based on the physical properties one wants to measure.
The present invention provides a method of compressing a dataset which is dependent on a plurality of parameters, as well as on noise, comprising forming a plurality of sets of weights, one set for each parameter, from an initial guess for the parameters, multiplying all the data in the dataset by each set of weights in turn turn and summing the products in each case to give compressed data for estimating the parameters, the number of compressed data being equal to the number of parameters.
In an embodiment of the invention, the dataset is represented by a vector xi, i=1, . . . N (e.g. a set of fluxes at different wavelengths). These measurements include a signal part, which we denote by xcexc, and noise, n:
x=xcexc+nxe2x80x83xe2x80x83(1)
Assuming the noise has zero mean,  less than x greater than =xcexc. The signal will depend on a set of parameters {xcex8xcex1}, which we wish to determine. For galaxy spectra, the parameters may be, for example, age, magnitude of source, metallicity and some parameters describing the star formation history. Thus, xcexc is a noise-free spectrum of a galaxy with certain age, metallicity etc.
The noise properties are described by the noise covariance matrix, C, with components Cij= less than ninj greater than . If the noise is gaussian, the statistical properties of the data are determined entirely by xcexc and C. In principle, the noise can also depend on the parameters. For example, in galaxy spectra, one component of the noise will come from photon counting statistics, and the contribution of this to the noise will depend on the mean number of photons expected from the source.
The aim is to derive the parameters from the data. If we assume uniform priors for the parameters, then the a posteriori probability for the parameters is the likelihood, which for gaussian noise is                                                                         ℒ                ⁡                                  (                                      θ                    α                                    )                                            =                              xe2x80x83                            ⁢                                                1                                                                                    (                                                  2                          ⁢                          π                                                )                                                                    N                        /                        2                                                              ⁢                                                                  det                        ⁡                                                  (                          C                          )                                                                                                                    xc3x97                                                                                                        xe2x80x83                            ⁢                                                exp                  ⁡                                      [                                                                  -                                                  1                          2                                                                    ⁢                                                                        ∑                                                      i                            ,                            j                                                                          ⁢                                                  xe2x80x83                                                ⁢                                                                              (                                                                                                                            x                                  ⁢                                                                      xe2x80x83                                                                                                  i                                                            -                                                              μ                                i                                                                                      )                                                    ⁢                                                                                    C                              ij                                                              -                                1                                                                                      ⁡                                                          (                                                                                                x                                  j                                                                -                                                                  μ                                  j                                                                                            )                                                                                                                                            ]                                                  .                                                                        (        2        )            
One approach is simply to find the (highest) peak in the likelihood, by exploring all parameter space, and using all N pixels. The position of the peak gives estimates of the parameters which are asymptotically (low noise) the best unbiased estimators (see TTH). This is therefore the best we can do. The maximum-likelihood procedure can, however, be time-consuming if N is large, and the parameter space is large. An aim of this invention is to reduce the N values to a smaller number, without increasing the uncertainties on the derived parameters xcex8xcex1. To be specific, we try to find a number Nxe2x80x2 less than N of linear combinations of the spectral data x which encompass as much as possible of the information about the physical parameters. We have found that this can be done lossless in some circumstances; the spectra can be reduced to a handful of numbers without loss of information. The speed-up in parameter estimation is by a factor xcx9c100.
In general, reducing the dataset in this way will lead to larger error bars in the parameters. To assess how well the compression is doing, consider the behavior of the (logarithm of the) likelihood function near the peak. Performing a Taylor expansion and truncating at the second-order terms,                               ln          ⁢                      xe2x80x83                    ⁢          ℒ                =                              ln            ⁢                          xe2x80x83                        ⁢                          ℒ              peak                                +                                    1              2                        ⁢                                                                                ∂                    2                                    ⁢                  ln                                ⁢                                  xe2x80x83                                ⁢                ℒ                                                              ∂                                      θ                    α                                                  ⁢                                  ∂                                      θ                    β                                                                        ⁢                          Δθ              α                        ⁢                                          Δθ                β                            .                                                          (        3        )            
Truncating here assumes that the likelihood surface itself is adequately approximated by a gaussian everywhere, not just at the maximum-likelihood point. The actual likelihood surface will vary when different data are used; on average, though, the width is set by the (inverse of the) Fisher information matrix:                               F          αβ                ≡                  -                      ⟨                                                                                ∂                    2                                    ⁢                  ln                                ⁢                                  xe2x80x83                                ⁢                ℒ                                                              ∂                                      θ                    α                                                  ⁢                                  ∂                                      θ                    β                                                                        ⟩                                              (        4        )            
where the average is over an ensemble with the same parameters but different noise.
For a single parameter, the Fisher matrix F is a scalar F, and the error on the parameter can be no smaller than Fxc2xd. If the data depend on more than one parameter, and all the parameters have to be estimated from the data, then the error is larger. The error on one parameter a (marginalized over the others) is at least [(Fxe2x88x921)xcex1xcex1]xe2x88x92xc2xd see Kendall M. G., Stuart A., 1969 The Advanced Theory of Statistics, London: Griffin. There is a little more discussion of the Fisher matrix in TTH. The Fisher matrix depends on the signal and noise terms in the following way (TTH, equation 15)                               F          αβ                =                              1            2                    ⁢                                    Tr              ⁡                              [                                                                            C                                              -                        1                                                              ⁢                                          C                                              ,                        α                                                              ⁢                                          C                                              -                        1                                                              ⁢                                          C                                              ,                        β                                                                              +                                                            C                                              -                        1                                                              ⁡                                          (                                                                                                    μ                                                          ,                              α                                                                                ⁢                                                      μ                                                          ,                              β                                                        t                                                                          +                                                                              μ                                                          ,                              β                                                                                ⁢                                                      μ                                                          ,                              α                                                        t                                                                                              )                                                                      ]                                      .                                              (        5        )            
where the comma indicates derivative with respect to the parameter. If we use the full dataset x, then this Fisher matrix represents the best that can possibly be done via likelihood methods with the data.
In practice, some of the data may tell us very little about the parameters, either through being very noisy, or through having little or no sensitivity to the parameters. So in principle we may be able to throw some data away without losing very much information about the parameters. Rather than throwing individual data away, we can do better by forming linear combinations of the data, and then throwing away the combinations which tell us least. To proceed, we first consider a single linear combination of the data:
yxe2x89xa1btxxe2x80x83xe2x80x83(6)
for some weighting vector b (t indicates transpose). We will try to find a weighting which captures as much information about a particular parameter, xcex8xcex1. If we assume we know all the other parameters, this amounts to maximizing Fxcex1xcex1. The dataset (now consisting of a single number) has a Fisher matrix, which is given in TTH (equation 25) by:                               F          αβ                =                                            1              2                        ⁢                          (                                                                    b                    t                                    ⁢                                      C                                          ,                      α                                                        ⁢                  b                                                                      b                    t                                    ⁢                  Cb                                            )                        ⁢                          (                                                                    b                    t                                    ⁢                                      C                                          ,                      β                                                        ⁢                  b                                                                      b                    t                                    ⁢                  Cb                                            )                                +                                                                      (                                                            b                      t                                        ⁢                                          μ                                              ,                        α                                                                              )                                ⁢                                  (                                                            b                      t                                        ⁢                                          μ                                              ,                        β                                                                              )                                                            (                                                      b                    t                                    ⁢                  Cb                                )                                      .                                              (        7        )            
Note that the denominators are simply numbers. It is clear from this expression that if we multiply b by a constant, we get the same F. This makes sense: multiplying the data by a constant factor does not change the information content. We can therefore fix the normalization of b at our convenience. To simplify the denominators, we therefore maximize Fxcex1xcex1 subject to the constraint
Bxe2x80x2Cb=1xe2x80x83xe2x80x83(8)
The most general problem has both the mean xcexc and the covariance matrix C depending on the parameters of the spectrum, and the resulting maximization leads to an eigenvalue problem which is nonlinear in b. We are unable to solve this, so we consider a case for which an analytic solution can be found. TTH showed how to solve for the case of estimation of a single parameter in two special cases: 1) when xcexc is known, and 2) when C is known (i.e. does not depend on the parameters). We will concentrate on the latter case, but generalize to the problem of estimating many parameters at once. For a single parameter, TTH showed that the entire dataset could be reduced to a single number, with no loss of information about the parameter. We show below that, if we have M parameters to estimate, then we can reduce the dataset to M numbers. These M numbers contain just as much information as the original dataset; i.e. the data compression is lossless.
We consider the parameters in turn. With C independent of the parameters, F simplifies, and, maximizing F11 subject to the constraint requires                                           ∂                          ∂                              b                i                                              ⁢                      (                                                            b                  j                                ⁢                                  xe2x80x83                                ⁢                                  μ                                      ,                                          1                      ⁢                      j                                                                      ⁢                                  b                  k                                ⁢                                  xe2x80x83                                ⁢                                  μ                                      ,                                          1                      ⁢                      k                                                                                  -                              λ                ⁢                                  xe2x80x83                                ⁢                                  b                  j                                ⁢                                  C                  jk                                ⁢                                  b                  k                                                      )                          =        0                            (        9        )            
where xcex is a Lagrange multiplier, and we assume the summation convention (j, k xcex5 [1, N]). This leads to
xcexc,1(btxcexc,1)=xcexCbxe2x80x83xe2x80x83(10)
with solution, properly normalized                               b          1                =                                            C                              -                1                                      ⁢                          μ                              ,                1                                                                                        μ                                  ,                  1                                t                            ⁢                              C                                  -                  1                                            ⁢                              μ                                  ,                  1                                                                                        (        11        )            
and our compressed datum is the single number y1=bt1x. This solution makes sensexe2x80x94ignoring the unimportant denominator, the method weights high those data which are parameter-sensitive, and low those data which are noisy.
To see whether the compression is lossless, we compare the Fisher matrix element before and after the compression. Substitution of b1 into (7) gives
F11=xcexc,t1Cxe2x88x921xcexc,1xe2x80x83xe2x80x83(12)
which is identical to the Fisher matrix element using the full data (equation 5) if C is independent of xcex81. Hence, as claimed by TTH, the compression from the entire dataset to the single number y1 loses no information about xcex81. For example, if xcexcxe2x88x9dxcex8, then y1=xcexa3ixt/xcexa3ixcexct is simply an estimate of the parameter itself
Fiducial Model
It is important to note that yi contains as much information about xcex81 only if all other parameters are known, and also provided that the covariance matrix and the derivative of the mean in (11) are those at the maximum likelihood point. We turn to the first of these restrictions in the next section, and discuss the second one here.
In practice, one does not know beforehand what the true solution is, so one has to make an initial guess for the parameters. This guess we refer to as the fiducial model. We compute the covariance matrix C and the gradient of the mean (xcexc,xcex1) for this fiducial model, to construct b1. The Fisher matrix for the compressed datum is (12), but with the fiducial values inserted. In general this is not the same as Fisher matrix at the true solution. In practice one can iterate: choose a fiducial model; use it to estimate the parameters, and then repeat, using the estimated parameters as the new fiducial model. As our example in section 4 shows, such iteration may be completely unnecessary.
Estimation of Many Parameters
The problem of estimating a single parameter from a set of data is unusual in practice. Normally one has several parameters to estimate simultaneously, and this introduces substantial complications into the analysis. The invention generalises the single-parameter estimate above to the case of many parameters. We proceed by finding a second number y2xe2x89xa1bt2x by the following requirements:
y2 is uncorrelated with y1. This demands that bt2Cb1=0.
y2 captures as much information as possible about the second parameter xcex82.
This requires two Lagrange multipliers (we normalize b2 by demanding that bt2Cb1=1 as before). Maximizing and applying the constraints gives the solution                               b          2                =                                                                              C                                      -                    1                                                  ⁢                                  μ                                      ,                    2                                                              -                                                (                                                            μ                                              ,                        2                                            t                                        ⁢                                          b                      1                                                        )                                ⁢                                  b                  1                                                                                                                          μ                                          ,                      2                                                        ⁢                                      C                                          -                      1                                                        ⁢                                      μ                                          ,                      2                                                                      -                                                      (                                                                  μ                                                  ,                          2                                                t                                            ⁢                                              b                        1                                                              )                                    2                                                              .                                    (        13        )            
This is readily generalized to any number M of parameters. There are then M orthogonal vectors bm, m=1, . . . M, each ym capturing as much information about parameter xcex1m which is not already contained in yq; q less than m. The constrained maximization gives                               b          m                =                                                                              C                                      -                    1                                                  ⁢                                  μ                                      ,                    m                                                              -                                                ∑                                      q                    =                    1                                                        m                    -                    1                                                  ⁢                                  xe2x80x83                                ⁢                                                      (                                                                  μ                                                  ,                          m                                                t                                            ⁢                                              b                        q                                                              )                                    ⁢                                      b                    q                                                                                                                                            μ                                          ,                      m                                                        ⁢                                      C                                          -                      1                                                        ⁢                                      μ                                          ,                      m                                                                      -                                                      ∑                                          q                      =                      1                                                              m                      -                      1                                                        ⁢                                      xe2x80x83                                    ⁢                                                            (                                                                        μ                                                      ,                            m                                                    t                                                ⁢                                                  b                          q                                                                    )                                        2                                                                                .                                    (        14        )            
This procedure is analogous to Gram-Schmidt orthogonalization with a curved metric, with C playing the role of the metric tensor. Note that the procedure gives precisely M eigenvectors and hence M numbers, so the dataset has been compressed from the original N data down to the number of parameters M.
Since, by construction, the numbers ym are uncorrelated, the likelihood of the parameters is obtained by multiplication of the likelihoods obtained from each statistic ym. The ym have mean  less than ym greater than =btmxcexc and unit variance, so the likelihood from the compressed data is simply                               ln          ⁢                      xe2x80x83                    ⁢                      ℒ            ⁡                          (                              θ                α                            )                                      =                  constant          -                                    ∑                              m                =                1                            M                        ⁢                          xe2x80x83                        ⁢                                                            (                                                            y                      m                                        -                                          ⟨                                              y                        m                                            ⟩                                                        )                                2                            2                                                          (        15        )            
and the Fisher matrix of the combined numbers is just the sum of the individual Fisher matrices. Note once again the role of the fiducial model in setting the weightings bm: the orthonormality of the new numbers only holds if the fiducial model is correct. Multiplication of the likelihoods is thus only approximately correct, but iteration could be used if desired.
Proof that the Method can be Lossless for Many Parameters
Under the assumption that the covariance matrix is independent of the parameters, reduction of the original data to the M numbers ym results in no loss of information about the M parameters at all. In fact the set {ym} produces, on average, a likelihood surface which is locally identical to that from the entire datasetxe2x80x94no information about the parameters is lost in the compression process. With the restriction that the information is defined locally by the Fisher matrix, the set {ym} is a set of sufficient statistics for the parameters {xcex8xcex1} (e.g. Koch K., 1999. Parameter Estimation and Hypothesis Testing in Linear Models, Springer-Verlag (Berlin). A proof of this for an arbitrary number of parameters is given in the appendix.
The General Case
In general, the covariance matrix does depend on the parameters, and this is the case for galaxy spectra, where at least one component of the noise is parameter-dependent. This is the photon counting noise, for which Cii=xcexci. TTH argued that it is better to treat this case by using the n eigenvectors which arise from assuming the mean is known, rather than the single number (for one parameter) which arises if we assume that the covariance matrix is known, as above. We have found that, on the contrary, the small number of eigenvectors bm allow a much greater degree of compression than the known-mean eigenvectors (which in this case are simply individual pixels, ordered by xcexc,xcex1/xcexc). For data signal-to-noise of around 2, the latter allow a data compression by about a factor of 2 before the errors on the parameters increase substantially, whereas the method of the invention allows drastic compression from thousands of numbers to a handful. To show what can be achieved, we use a set of simulated galaxy spectra to constrain a few parameters characterizing the galaxy star formation history.
Parameter Eigenvectors
In the case when the covariance matrix is independent of the parameters, it does not matter which parameter we choose to form y1, y2 etc, as the likelihood surface from the compressed numbers is, on average, locally identical to that from the full dataset. However, in the general case, the procedure does lose information, and the amount of information lost could depend on the order of assignment of parameters to m. If the parameter estimates are correlated, the error in both parameters is dominated by the length of the likelihood contours along a xe2x80x98ridgexe2x80x99. It makes sense then to diagonalize the matrix of second derivatives of ln L at the fiducial model, and use these as the parameters (temporarily), for galaxy surveys. The parameter eigenvalues would order the importance of the parameter combinations to the likelihood. The procedure would be to take the smallest eigenvalue (with eigenvector lying along the ridge), and make the likelihood surface as narrow as possible in that direction. One then repeats along the parameter eigenvectors in increasing order of eigenvalue.
Specifically, diagonalize Fxcex1xcex2 in (5), to form a diagonal covariance matrix xcex9=Sxe2x80x2FS. The orthogonal parameter combinations are "psgr"=Stxcex8, where S has the normalized eigenvectors of F as its columns. The weighting vectors bm are then computed from (14) by replacing xcexc,xcex1p with Sprxcexc,xcex1r.
The parameters may be ordered in an arbitrary way or replaced by new parameters which are dependent combinations of the original parameters.
Other objects, advantages and features of the present invention will become more readily appreciated and understood when taken together with the following detailed description of an example of the use of the invention in connection with galaxy spectra, in connection with the accompanying drawings, in which: