As used herein, boldface capitals designate two-dimensional data matrices (e.g., XI×J) and underlined boldface capitals symbolize three-way arrays (e.g., XI×J×K). Vectors are represented by bold lowercase characters, scalars or elements of matrices by lowercase italics. The superscript T denotes the matrix or vector transposition. ∥X∥p indicates the Entrywise p-norm of matrix X, where p≧1.
The analysis of complex aqueous based mixtures such as cell culture media, fermentation broths, wastewater, river/sea water etc. is often difficult when the compositional differences between samples are very small. For example, cell culture broth is an enormously complex mixture consisting of a vast variety of components at low concentrations (e.g., amino acids, peptides, sugars, proteins, vitamins, and metals). The concentration profiles of the components vary widely during the course of the cultivation or fermentation. A key challenge therefore is the rapid characterization of compositional changes, in the gross sense and for specific analytes. The solution is usually different for each of component of the mixture, and therefore requires the use of a range of analytical technologies and techniques. Many current methods, for example, in monitoring compositional changes of a process broth or the production of the product (e.g., the biotherapeutic in biopharmaceutical industry) during:manufacture are slow, e.g., typically hours or days. Thus, there is a need for the development of rapid analytical methods that can characterize constituents in cell culture media and provide useful information on cell culture progress. This need consists of two elements: one to monitor the use/level of constituents of cell culture media, and the other, product formation. One way in which rapid analyses for gross compositional changes can be achieved is via the use of multidimensional spectroscopy based methods. However, in many cases the differences in the spectra are very small and assessing these small changes in the topography of multidimensional spectra can be difficult and time-consuming.
Fluorescence spectroscopy is widely used in biomedical, biopharmaceutical, chemical, environmental, food, and other industrial domains. Fluorescence has proven itself as a highly sensitive and potentially selective analytical technique capable of measuring analyte concentrations down to sub-nanomolar levels in routine use. In fluorescence spectroscopy, the fluorescence properties can be measured by recording measurements spanning both the excitation and emission wavelengths of the mixture. Collating a series of emission scans from a range of excitation wavelengths produces an excitation-emission matrix (EEM). Such an EEM spectrum can be visualized as a two-dimensional contour map or as a three-dimensional landscape. Another possibility is total synchronous fluorescence scan (TSFS) spectrum, which is obtained by scanning both emission and excitation properties simultaneously. When a group of EEMs (or TSFS data) of equal excitation and emission wavelengths are coupled with each other, a three-way data array (for example as a matrix with samples×excitation×emission dimensions) is generated. There are a great number of applications using fluorescence EEM information for the characterization of analytes, and the most of them have frequently involved its combination with multiway multivariate analysis methods such as multiway principal component analysis (MPCA), N-way partial least squares (NPLS), and parallel factor analysis (PARAFAC). However, these methods require a deep mathematical knowledge and expertise to undertake the complex procedures involved. This can in turn hinder the capability of an analyst to faithfully conduct these methods using large amounts of data.
The Matrix congruence coefficient (MatrixCC) can be thought of as describing a similarity matrix of all the variables involved in the numerator term in the following formula,
                    MatrixCC        =                                                                                            X                  1                  T                                ⁢                                  X                  2                                                                    2                                                                                                X                  1                                                            2                        ⁢                                                                            X                  2                                                            2                                                          (        1        )            Each score of X1TX2 expresses the similarity between two data points of matrices X1 and X2, which is strongly related to the data counterparts. The denominator terms normalize the similarity results obtained. Higher scores are given to more similar characters, and lower or negative scores for dissimilar characters. Formula (1) is based in molecular electrostatic potential comparison and structural alignment. For instance, the similarity/dissimilarity of two compared molecules in terms of their relevant target matrices assigns identical bases a score of +1 and non-identical bases a score of zero or −1 in a simple way.
In essence, the MatrixCC is a congruence angle between two spectral hyperplanes spanned by X1 and X2, which measures their linear correlation degree. Generally, this calculation tends to give higher similarity values when the common feature predominates both the matrices X1 and X2. However, sometimes the MatrixCC method fails to describe/identify differences of interest in complex mixtures where the concentration of the individual constituents can vary significantly. In other words, the sensitivity of the MatrixCC method may not be sufficient for complex mixture analysis.
Multiway principal component analysis (MPCA) is statistically and algorithmically consistent with standard PCA, and has the same goals and benefits. The philosophy behind them is multivariate statistical projection, in which a large data set of highly correlated variables is projected into a low-dimensional spaceahat summarizes the variables within a set of suboptimal principal components (PCs). Thus, the original data are compressed and the relevant information is extracted at the same time. PCA describes a two-dimensional data matrix X of m observations by n variables in an underlying factor analytic manner,X=UΣVT   (2)where U is a m-by-m orthogonal matrix, V is a n-by-n orthogonal matrix, and Σ is a m-by-n matrix containing nonnegative real singular values of decreasing magnitude along its main diagonal (σ1≧σ2≧. . . ≧σmin(m, n)≧. . . ≧0), and zero off diagonal elements. This process can be realized by the singular value decomposition (SVD) of X.
In order to optimally capture the variations of the data while minimizing the noise effect corrupting the PCA representation, the loading vectors corresponding to the F largest singular values are typically retained. For each loading vector, the amount of variance explained by the PCA model is calculated via comparing the estimates σi2(i=1, 2, . . . , F) computed by the model with the true data (Σσ2),
                                          Explained            ⁢                                                  ⁢            variance            ⁢                                                  ⁢                          (              %              )                                =                                                    σ                i                2                                            ∑                                  σ                  2                                                      ×            100                          ⁢                                  ⁢                  (                                    i              =              1                        ,            2            ,            …            ⁢                                                  ,            F                    )                                    (        3        )            
Then, these F vectors are stacked into a n-by- F loading matrix P (generally called loadings). For the score space, the Hotelling T2 statistic can be calculated from the PCA representation:T2=xTPΣF−2PTx   (4)where x is an observation of X, whose dimension is (n×1), and ΣF contains the first F rows and columns of Σ. The threshold for the T2 statistic is given as,
                              T          α          2                =                                                            F                ⁡                                  (                                      m                    -                    1                                    )                                            ⁢                              (                                  m                  +                  1                                )                                                    m              ⁡                              (                                  m                  -                  F                                )                                              ⁢                                    F              α                        ⁡                          (                              F                ,                                  m                  -                  F                                            )                                                          (        5        )            In Eq. (5), α means a significant level tested for F-distribution, for instance, α=0.05.
MPCA, by its nature, is an extension of PCA capable of handling multiway data. Hereafter, MPCA is equivalent to unfolding a three-way array XI×J×K, of which the fluorescence EEMs are generated from K measurements at I excitation and J emission wavelengths, or of TSFS data at I Delta and J excitation wavelengths, slice by slice, rearranging these slices into a large two-dimensional matrix X and then performing a standard PCA. For an MPCA analysis of K measurements, a most meaningful approach is to unfold the data array X into a matrix X with dimensions (IJ×K). The similarity and variability among the K measurements can be assessed by the analysis of their scores and Hotelling T2 statistics within the model. MPCA has been extensively applied to statistic process monitoring and multivariate quality control.
PARAFAC and its variant PARAFAC2 are one of the simplest generalizations of multiway multivariate curve resolution. Its trilinear methodology is based upon the principle of proportional profiles. An in-depth description of PARAFAC can be found in the scientific literature. An F-component (or factor) PARAFAC model on an EEM data array XI×J×K can be formulated as,
                              x          ijk                =                                            ∑                              f                =                1                            F                        ⁢                                          a                if                            ⁢                              b                jf                            ⁢                              c                kf                                              +                                    e              ijk                        ⁢                                                  (                                          i                =                1                            ,              2              ,              …              ⁢                                                          ,                              I                ;                                  j                  =                  1                                            ,              2              ,              …              ⁢                                                          ,                              J                ;                                  k                  =                  1                                            ,              2              ,              …              ⁢                                                          ,              K                        )                                              (        6        )            where xijk is the intensity of the kth measurement/sample at the ith excitation and jth emission wavelengths, aif,bif,ckf are parameters describing the behaviors of the excitation, emission variables, and samples of F components. The residuals eijk contain the errors and the part not captured by the model. Analogous to the decomposition form of standard PCA, the PARAFAC model can be written in matrix notation,Xk=Al×Fdiag(ck)BJ×FT+Ek (k=1 . . . K)   (7)
Here Xk is the kth frontal slice of XI×J×K, which corresponds to the EEM of the kth measurement. The F-vector ck is the kth row of matrix C. The matrices A, B, and C hold the estimated excitation, emission, and concentration information of the underlying sample constituents, respectively.
Rather than MPCA focusing only on the variance of the EEM data, PARAFAC focuses on profiling the contributions of each constituent. If the data are approximately trilinear and the correct number of components (viz. F) is used, the I-vector af(f=1,2, . . . , F) with elements aif(i=1,2, . . . , I) is the estimated excitation spectrum of constituent f in sample k, and likewise bf is the estimated emission spectrum of this constituent. The scores in ckf of the F-vector ck may be interpreted as the relative concentration of constituent f in sample k. The PARAFAC performance is assessed based on the sum of squared errors,
                    min        ⁡                  (                                    ∑                              i                ,                j                ,                                  f                  =                  1                                                            I                ,                J                ,                F                                      ⁢                          e              ijk              2                                )                                    (        8        )            
The popularity of PARAFAC for chemical and industrial applications lies in the facts that it generates a unique model, the model results can be interpreted in a meaningful way to provide information on constituent concentrations, and that the results are robust by virtue of being calculated using least square methods. However, in practice, the data arrays are usually contaminated by noise and other factors (e.g. Rayleigh scattering in fluorescence spectroscopy) which induce deviations from the trilinear prerequisite for PARAFAC. As a result of the presence of deviations in the spectral data, the PARAFAC model becomes less precise (it no longer yields a unique solution), and this may generate misleading estimated parameters (the spectral and concentration profiles). Such inaccuracy can be formulated as:X=X(A, B, C)+Dindependent+D(A, B, C)+E  (9)In which X(A, B, C) contains the linear responses of compositions, D(A, B, C) denotes deviations with a certain relationship to the three parameter matrices A, B, and C, Dindependent represents deviations independent of A, B, and C, and E holds the residual variation.
For the data arrays of which D(A, B, C)≠0, it is not usually possible for PARAFAC to directly estimate a reliable solution to the model parameters unless some pretreatments are used beforehand. If the data arrays satisfy the D(A, B, C)=0 condition, then the influence of Dindependent on the accuracy of the estimated model parameters can be mitigated by imposing some constraints on the model, such as non-negativity and/or incorporation of a priori information (e.g., the known concentrations) into the decomposition procedure.