The first statistical methods for ordinal univariate data were developed between 1935 and 1955 and have subsequently been extended to cover special cases of multivariate data, including, but not limited to, interval and approximate data. Some of these methods were based on conventional u-statistics and also the well known Marginal Likelihood (MrgL) principle.
The first results had a limited range of applications and were rarely used, due to deficiencies in the theory and lack of computationally efficient algorithms. An earlier application 20030182281 (‘Statistical Methods For Multivariate Ordinal Data Which Are Used For Data Base Driven Decision Support’), which is incorporated herein by reference, provided for a method that allowed for more efficient algorithms, yet the method did not cover an important aspect, namely how to incorporate knowledge about relationships between variables or subsets of variables.
This application extends the earlier application 20030182281 by providing a method to construct partial orderings that a decision maker can use when the variables are structured hierarchically, i.e., if variables can be grouped into subsets of variables, the variables in each subset known to be related to a common feature of the latent factor. The proposed application allows for recursive subdivisions, i.e., each of subsets can then be subdivided into subsets itself.
Shortcomings of Currently Available Statistical Methods When Used with Ordinal Data
Most statistical analysis programs are based on the linear model, not because the assumption of linearity is realistic, but mainly because it leads to computational simplicity. When applied to multivariate data, it comprises the use of linear combinations (i.e., weighted averages) of the variables (e.g., 10 times expression of gene A plus 2 times expression of gene B minus the logarithm of the expression of gene C). With biological, psychological, and genomic applications the relationship between the measurement (body temperature, IQ, gene expression) and its meaning (fever, social status, immunity), however, are rarely linear, but usually merely ordinal. An increase in body temperature (observed variable) by 2° C. from 35° C. to 37° C., for instance, is usually irrelevant with respect to fever (latent factor), while an increase by the same amount of 2° C. from 41° C. to 43° C. is lethal. Thus, body temperature as an indicator of fever is ‘ordinal’, i.e., the sign (direction) of the difference is important, yet the magnitude has no consistent meaning. In such situations, i.e., when a linear combination may not be meaningful, even after some transformations, the use of linear models is questionable at best.
A different class of approaches comprises the use of models for categorical data, where data are interpreted on a ‘nominal’ scale, i.e., where any order between categories is assumed irrelevant. Non-limiting examples are colors, races, but also disease codes. For these models to be applicable, however, continuous variables need to be discretized, which typically introduces arbitrariness and leads to loss of information on the order of the categories, which is clearly undesirable in many applications.
The need for methods capable of handling multivariate monotonic data has been widely acknowledged (see, for instance, US Patent Application 20060149710), yet the lack of practical methods for dealing with ordinal data has led to linear model methods being applied to ordinal data. External ‘validation’ is then used to justify a conceptually invalid approach. For such ‘validation’ to be at least a reasonably good approximation, however, one needs an independent set of entities where the latent factor is known (often called a ‘gold standard’), against which the various linear score functions can be compared. Aside from being conceptionally questionable, external validation, poses several technical problems. While methods based on the linear model are relatively simple for a given transformation, the evaluation of many possible ‘linearizing’ transformations is computationally challenging. Moreover, collecting data from entities with similar characteristics and known conditions may prove time consuming. If the population considered is relatively ‘unique’, similar entities can be difficult to find.
Methods for multivariate ordinal data, ideally should be somewhere between linear models for interval scaled data and categorical models for nominally scaled data. They should not assume a specific form of the relationship between each of the observed variables and the latent factor, but recognize the fact that ‘more is better’ (or worse, for that matter).
The MrgL approach was introduced about 1973 for use with censored ordinal data and has been applied to assess side effects, determine risk factors, evaluate prevention strategies, and measure immunogenicity. However, the MrgL approach may lead to methods of extreme computational complexity.
The methods of the earlier application 20030182281 generalized a related approach based on u-statistics to multivariate ordinal data. While the earlier application 20030182281 provided a method to measure information content, it did not resolve a common problem which causes loss of information content when u-scores for multivariate data are applied to situations involving many variables. u-scores rely on determining all pairwise orderings between objects. For univariate data, this ordering is ‘complete’, i.e., the order among any two objects (A, B) can be determined as either A<B, A=B, or A>B. For multivariate data this ordering may only be ‘partial’. If one variable is higher in object A, and another variable is higher in object B, the order between the two objects is ambiguous (A˜B). As any two discordant variables cause the ordering between two objects to be ambiguous, information content, i.e., the number of unambiguous orderings decreases as then number of variables increases.
The earlier application 20030182281 resolved this problem for two special cases where dimensionality can be reduced through amalgamation of some variables or through the use of specific multivariate partial orderings. Non-limiting examples of such amalgamating functions comprise of average, median, and test statistics. Non-limiting examples of specific orderings comprise those specifically designed for intervals (the special case considered by Gehan 1965), interchangeable variables, probe pairs on Affymetrix GeneChips, and diplotypes.
The current application provides for a more general method to ameliorate the loss of information content with increasing numbers of variables. It applies to all situations, where variables can be structured in a hierarchical fashion, thereby allowing a decision maker to obtain results from multivariate ordinal in situations where the method of the earlier application 20030182281 might suffer from or fail due to low information content.
The preferred method addresses this problem by providing a generic method that allows for some of the variables to be grouped in a hierarchical fashion. For instance, one may have collected information about several features of the latent (unobservable) factor. In clinical trials, such feastures might be safety (adverse effects) or efficacy (desired effects), When assessing quality of life, as another non-limiting example, one may have several variables for each of the following features: ‘quality of sleeping’, ‘sexual desire’, and ‘job satisfaction’ (first hierarchical level). One may then have measured each variable twice, the first time before and the second time after an intervention, aiming at assessing the magnitude of the intervention's effect (second level). Finally, one may have measured each variable at each time point repeatedly to reduce the effect of measurement errors (third level). In genomics, as yet another example, one may have observed changes in the expression of genes known to act along different genomic ‘pathways’ (where the genes acting along the same pathway comprise the first hierarchical level), each gene then is typically represented by several specific sequences of base pairs (which comprise the second hierarchical level), and each sequence is accompanied by a ‘mismatch’ to determine specificity (the pair comprising the third hierarchical level).
The gain of information by using a hierachical structure among the variables is illustrated in FIG. 7. In this non-limiting example, it is assumed that variables X1 and X2 are related to one feature of the latent factor, while variables Y1 and Y2 are related to another feature of the same latent factor.
Referring to FIG. 7 Hesse diagrams are used to illustrate which pair wise orderings can be decided. The first row of Hesse diagrams illustrates the univariate partial ordering for each of four variables. Pairs of observations whose order can be decided as being either ‘>’, ‘<’, or ‘=’ are connected by lines, with horizontal lines indicating ‘=’. For instance, Object B is larger than object A with respect to variable X1 (2>1), but of the same order as object A with respect to variable Y1 (3=3). For object C, the data for variable X1 is missing (‘?’) and, thus, subject C is not connected with any other subject in the Hesse diagram generated by variable X1.
The second row illustrates the corresponding Hesse diagrams for the case where the variables are grouped as (X1, X2) and (Y1, Y2), with the pairwise orderings being obtained using method described in the earlier application 20030182281.
The third row illustrates the difference between the non-hierarchical method described in the earlier application 20030182281 (left side) and the preferred hierarchical method (right side). If the hierarchical nature of the variables' relation to the latent factor were ignored, only three pairwise orderings could be decided: A>C, B>C, B>D, resulting in u-scores U′=(1, 2, −2, 1), depicted in the lower left diagram of FIG. 7. However, if the relation of the variables to the two features of the latent factor is accounted for, one obtains the additional pairwise ordering A>D, resulting in a more informative lattice (four vs. three pairwise orderings decided) yielding the u-scores U=(2, 2, −2, −2).
As a related, yet also inferior approach, the earlier application 20030182281 could be applied to score related variables first, using the method of the earlier application 20030182281 and then applying this method again, this time to the scores obtained in the first step, possibly using different partial orderings. Then this process of rescoring subsets of scores could be repeated until the data is reduced to a single score. Unfortunately, this approach does not necessarily provide the most informative scores. As can be seen in FIG. 7, subject D with (UX+, UY+)=(−0.5, −1.5) (bottom of center row) would be considered larger (‘>’) than subject C with (UX+, UY+)=(−1.5, −1.5), while the diagram provided in the lower right of FIG. 7 demonstrates that the hierarchical order between subjects C and D is undetermined.
With the non-hierarchical method of the earlier application 20030182281, an ambiguous order between two subjects (A˜B) in any group of related variables results in the pairwise order between the two subjects being declared ‘ambiguous’, even if subject A is superior to subject B in all other groups. The preferred approach allows for groups of variables to be defined in a hierarchical fashion, in a way that some of the ambiguities that would be created with the method of the earlier application 20030182281 could be resolved. As the preferred approach operates on intermediate lattices (which could be depicted as Hesse diagrams), rather than intermediate scores, it does not ignore ambiguities as would merely applying the method of the earlier application 20030182281 in a hierarchical fashion.
Shortcomings of Currently Used Decision Process When Applied to Multivariate Ordinal Data
Situations where categories need to be ranked with respect to their exigency based on multivariate results in a test entity are frequent. One non-limiting example is the decision of a diagnosis in a patient. Traditionally, such decisions are based on comparing the patient's value in each variable individually against a published ‘normal range’ derived from a ‘standard’ population of ‘controls’ (healthy individuals). Frequently, these ranges are determined as the mean (x.)±2 times the standard deviation (SD) of the empirical distribution among the controls. Depending on which observed variables exceed their normal ranges, the decision maker (the physician) determines that the entity (patient) belongs to a specific category (of disease) in which such observations are expected, often employing subjective criteria to pick one of several categories. There are several inherent problems:    (1) Characterizing empirical distributions by ranges x.±2×SD is valid only if the theoretical distribution is Gaussian, an assumption which is inappropriate for the majority of variables in fields comprised of medicine, biology, genetics, and sociology.    (2) A single ‘standard’ reference interval is unlikely to be optimal for all entities.    (3) Addressing specificity only, i.e., ignoring the distribution of a variable among the cases in either category (sensitivity) is not sufficient to even partly automate the decision process.
The above problems are even more relevant in dealing with multivariate data (each variable being point, interval, or distribution). Further:    (4) Looking at a single variable at a time is often not sufficient.    (5) The set of variables to be optimal to determine the relative position of the entity with respect to the reference populations may vary.    (6) As linear combinations cannot be meaningfully utilized to reduce multivariate ordinal data to univariate data, as within the linear model, specific problems exist that have not been addressed.    (7) To determine sensitivity and specificity for a cutoff target, it is not sufficient to compare the test entity with either population (controls and cases) separately, as in the linear model.    (8) Often some different subsets of variables are related to different features. In medicine, for instance, some variables may describe clinical efficacy, while others describe side effects or quality-of-life.Shortcomings of Previously Proposed Decision Support Systems
The complexity of dealing with multivariate data has led to several generations of decision support systems (also known as knowledge based systems or expert systems). Of the first generation, developed in the 1960s, most remained only prototypes. Even the second generation, developed in the 1970s, has failed to gain widespread acceptance because they merely tried to mimic the human decision process, rather than striving at overcoming its shortcomings by utilizing advances in technology to go beyond the ‘heuristic’ nature of human decisions. With more information becoming available through ‘information technology’, the inherent problems of intuitive decision making are likely to become even more apparent. The advent of genetic, genomic, and proteomic information, has further complicated the situation by increasing the number of variables relevant to diagnostic decision-making. Simply increasing the computational capacity of conceptually insufficient ‘expert systems’, clearly, cannot overcome the underlying obstacles.
In previous ‘expert systems’, the separation of a general purpose ‘inference engine’ from an unstructured ‘knowledge base’ containing a vast set of ‘heuristics’ and applying it in an iterative fashion, resulted in a lack of transparency, that could not be overcome with yet a different component, an ‘explanation facility’. Similar criticism regarding the obscure algorithms underlying Google's ‘quality scores’ has recently been expressed at the Search Engine Strategies 2006 conference. When decision makers cannot understand the decision process, they cannot not control it. The need to acquire knowledge as heuristical rules with subjective ‘certainty factors’ attached, not only contributed to non-transparent decisions, but also made the knowledge acquisition process difficult.