The first statistical methods for ordinal point data were developed between 1935 and 1955. These methods have subsequently been extended to cover special cases of interval and approximate data (some of these methods were based on conventional U-statistics (UStat) and also the well known Marginal Likelihood (MrgL) principle). The most recent work has pointed to the necessity of estimating information content (IC) for approximate, interval and multivariate point data. The first results had a limited range of applications and were rarely used, due to deficiencies in the theory and lack of computationally efficient algorithms.
Shortcomings of Currently Available Statistical Methods when Used with Ordinal Data
Most statistical analysis programs are based on the linear model, mainly because of its computational simplicity. When applied to multivariate data, application of the linear model comprises the use of linear combinations of the variables (e.g. 10 times expression of gene A plus 2 times expression of gene B minus the logarithm of the expression of gene C). With biological, psychological, and genomic applications the relationship between the measurement (body temperature, IQ, gene expression) and its meaning (fever, social status, immunity) are usually merely ordinal. An increase in body temperature (observed variable) by two degrees from 35° C. to 37° C., for instance, is usually an irrelevant change in fever (latent factor), while an increase from 41° C. to 43° C. means that a person dies. One of the problems in dealing with ordinal data, however, is that the magnitude of a difference between the values of variables has no meaning. Thus, “distance” cannot simply be defined as the absolute value of a difference or ratio, as in the linear model. Because, it is not clear if a linear combination is meaningful at all, even after applying some transformations, the nature of which is also unknown, the use of linear models is questionable at best.
A different class of approaches comprises the use of models for categorical data, where date are interpreted on a nominal scale, i.e., where any order between categories is ignored. Examples are colors, races, but also disease codes. For these models to be applicable, however, continuous variables need to be discretized, which introduces arbitrariness. Moreover, the loss of information on the order of the categories is clearly undesirable in many applications.
The lack of alternative methods has led to linear model methods also being applied to ordinal data, essentially comprised of combining ordinal outcomes by means of linear combinations (weighted averages). External “validation” is then used to justify an otherwise conceptually invalid approach. For such “validation”, however, one needs an independent population with the latent factor known as a (“gold standard”), against which the different linear score functions can be compared. (The term “population” is used here to describe classes of entities identified by some common characteristics in general and not limited to human or animal populations.) External validation, also poses several technical problems. The comparison of many possible linear score functions can be very time consuming. The data from entities with similar characteristics and known conditions may first need to be collected. Moreover, if the population considered is relatively “unique”, similar entities can be difficult to find. Finally, there may be no “gold standard” against which the score function(s) can be immediately validated.
Methods for multivariate ordinal data, ideally should be somewhere in the middle between linear models for interval scaled data and categorical models for nominally scaled data. They should not assume a specific form of the relationship between each of the observed variables and the latent factor, but they should recognize the fact that “more is better” (or worse, for that matter).
The MrgL method is the first approach known to successfully cover this “middle ground”. The MrgL approach was introduced in about 1973 for use with censored data, a special case of inexact data. The Gehan/Prentice/Savage test and the Kaplan-Meyer estimate for survival are widely used applications. In 1992, it was shown that this approach could be generalized to more than two variables and other metrics than those for interval censored data. Subsequently, early versions of the MrgL approach have been applied to assess side effects, to determine risk factors, to evaluate prevention strategies, and to measure immunogenicity. In addition, the MrgL approach has been demonstrated to allow results to be “augmented” for external, or secondary variables, in cases where information exists that might have some relevance (e.g., cost), although it should not be allowed to overwrite evidence contained in the primary variables (treatment effectiveness or side-effects).
In its present form, however, the MrgL approach is not practically useful. It is crucial to give more weight to observations with higher information content (“precision”). Within the linear model, the Fisher information is generally used to achieve this. For replications (unstructured, exchangeable observations), the Fisher information is 1.0 divided by the variance among the replications. With inexact ordinal data, similar differences in information content exist. Observations are more informative, if their ordinal relation to other observations is better defined. Thus, data that are “identical” may be more informative than data that are merely “similar”. While the lack of such differentiation has recently been resolved for the special case of the most simple test for ordinal data (the sign test) and acknowledged for the known Wilcoxon-Mann-Whitney test, a more general solution how to deal with inexact data is still lacking. The outline of such a solution has been initially described, allowing inexact observations to be assigned a lower weight. In some cases, however, the proposed estimates underestimate information content. Some ambiguity may not result in loss of information with regard to the intended method of aggregation. As a result, the method, as it was originally introduced, suffered severe limitations.
Furthermore, the MrgL approach may lead to methods of extreme computational complexity. The rate by which this complexity grows when the number of objects increases outpaces by far the advances in computer technology to be expected within the foreseeable future.
A different approach for the analysis of ordinal data based on u-statistics has been applied to a special case of inexact ordinal data, namely interval censored ordinal data. The UStat approach, however has not been extended to more general multivariate data. Moreover, no UStat method is currently available for estimating information content, even for interval-censored data. Moreover, although this approach is less intense in terms of computations, it is also less efficient, because it does not utilize all information.
Even if a valid information content estimate could be found and the computational difficulties could be overcome, using MrgL, UStat, or other intrinsically valid approaches, several problems resulting from the conceptual complexity of dealing with inexact (multivariate) data would need to be resolved.
First and foremost, trying to decide which objects in a population are most “similar” to a given entity poses additional problems, when “distance” cannot be defined as the absolute value of a difference, because, with ordinal data, “difference” in itself has no straightforward meaning.
Further, when variables are exchangeable (independent identically distributed measurements, e.g., replications), the conventional methods for ordinal variables, which start out with comparing variables individually, cannot be applied. As sums have no meaning for ordinal data the “distribution” of interchangeable observations can also not be characterized by the mean (x) and the standard deviation (SD), as in the linear model.
Finally, the majority of the forgoing methods have dealt with comparing two or more populations, or positioning an entity within a single population, situations where the strategies for analyzing inexact data, i.e. univariate ordinal data or multivariate linear data can be directly generalized. This, however, is not always the case. With the well-known Kruskal-Wallis test, for instance, which compares more than two groups of ordinal data, the results of pair-wise comparisons depend on the observations in other groups. When one tries to determine which of several categories an object belongs to, an even more severe problem arises. With exact data, it is sufficient to compare the object with entities from one population at a time. With inexact data, however, information from the other population(s) could be used to reduce the level of “inexactness” when comparing an object with any of these populations. This problem has never been addressed and, consequently, it has never been suggested how to define the position of a entity in relation to one population by utilizing data from other populations.
Shortcomings of Currently Used Decision Process when Applied to Multivariate Ordinal Data
Situations where categories need to be ranked with respect to their exigency based on multivariate results in a test entity are frequent. One example is the decision of a diagnosis in a patient. Traditionally, such decisions are based on comparing the patient's value in each variable individually against a published “normal range” derived from a “standard” population of “controls” (healthy individuals). Frequently, these ranges are determined as the mean (x) ±2 times the standard deviation (SD) of the empirical distribution among the controls. Depending on which observed variables exceed their normal ranges, the decision maker (the physician) determines that the entity (patient) belongs to a specific category (of disease) in which such observations are expected, often employing subjective criteria to pick one of several categories. There are several inherent problems:    (1) Characterizing empirical distributions by ranges x±2×SD is valid only if the corresponding theoretical distribution is Gaussian, an assumption which is inappropriate for the majority of variables in fields such as medicine, biology, genetics, and sociology.    (2) A single “standard” reference interval is unlikely to be optimal for all entities.    (3) Addressing specificity only, i.e., ignoring the distribution of a variable among the cases in either category (sensitivity) is not sufficient to even partly automate the decision process.The above problems are even more relevant in dealing with multivariate data (each variable being point, interval, or distribution). Further:    (4) Looking at a single variable at a time is often not sufficient.    (5) The set of variables to be optimal to determine the relative position of the entity with respect to the reference populations may vary.    (6) As linear combinations cannot be meaningfully utilized to reduce multivariate ordinal data to univariate data, as within the linear model, specific problems exist that have not been addressed.    (7) To determine sensitivity and specificity for a cutoff target, it is not sufficient to compare the test entity with either population (controls and cases) separately, as in the linear model.Shortcomings of Previously Proposed Decision Support SystemsThe complexity of dealing with multivariate data has led to several generations of decision support systems (knowledge based systems, expert systems). Of the first generation, developed in the 1960s, most remained only prototypes. Even the second generation, developed in the 1970s based on recent results in the field of artificial intelligence, have failed to gain widespread acceptance because they merely tried to mimic the human decision process, rather than striving at overcoming its shortcomings by utilizing advances in technology to go beyond the “heuristic” nature of the human decision process. With more information becoming available through “information technology”, the inherent problems of intuitive decision making are likely to become even more apparent. The advent of genetic, genomic, and proteomic information, has further complicated the situation by increasing the number of variables relevant to diagnostic decision-making. Simply increasing the computational capacity of conceptually insufficient “expert systems”, clearly, cannot overcome the underlying obstacles.
In previous “expert systems”, the separation of a general purpose “inference engine” from an unstructured “knowledge base” containing a vast set of “heuristics” and applying it in an iterative fashion, resulted in a lack of transparency, that couldn't be overcome with yet a different component, an “explanation facility”. Since the decision maker could not understand the decision process, he also could not control it. The need to acquire knowledge as heuristical rules with subjective “certainty factors” attached, not only contributed to non-transparent decisions, but also made the knowledge acquisition process difficult.