The present invention relates to methods for exploratory analysis of categorical data. More specifically, the invention is a method for generating analyses of categorical data that will allow the application of exploratory multivariate analysis procedures.
A categorical measurement on an object is a measurement that takes one of a set of known, fixed values, but has a discontinuous relationship with a previous or next measurement. For example: an observation as to whether a switch is xe2x80x9conxe2x80x9d or xe2x80x9coffxe2x80x9d is a categorical measurement; the answer to each question in a political poll or other survey is a categorical measurement. Clock, calendar, and angle measure are also categorical data inasmuch as there are discontinuities, for example 60 minutes per hour, leap year, and 60 minutes per degree.
In addition to clinical and survey data [the xe2x80x9cmultiple choicexe2x80x9d parts of a survey (as opposed to the free text)], other forms of categorical data include but are not limited to data mining, patents, warranty cards, and combinations thereof. Much of data that are often the subject of xe2x80x9cdata miningxe2x80x9d (e.g. for marketing) are categorical (e.g. income level, age bracket, favorite sports and hobbies). However, the size of the data sets to be analyzed in some data mining applications are of a much larger scale than the anticipated size of clinical trials data sets. Patents, thought of as data, contain significant categorical data, and significant data of other types.
Table 1 shows a typical matrix arrangement of categorical data. For definiteness and convenience the data are discussed as though they are obtained as the result of a survey, poll, or questionnaire of multiple choice questions. In the table, each object (or individual) responds to the 4 questions. Possible values are shown for two of the objects; the answers for the first 3 questions are listed in a manner suggesting some character-coded response. The fourth question is listed as though the response is one of a finite list of positive whole numbers. Note that different questions can have different numbers of allowable answers and different coding schemes.
The current general strategy for summarizing categorical data is to model all of the outcomes of a question (E.G. Q1) as representing outcomes from a single probability distribution, for example a multinomial. Previously, categorical data have been difficult to use for exploratory cause-effect analysis. Most often a query or hypothesis is posed and categorical data is collected and tested statistically to confirm or deny the query or hypothesis. Further treatments of such data (see references [1], [2], and [3]) concentrate largely on describing classes of probabilistic models that might explain or fit the data; the resulting models are then used to confirm whether suspected effects exist. Some methodology for exploratory analysis of categorical data is presented in [4]; these methods focus on calculating optimized encodings of categorical (and other) data.
However, categorical data may contain useful information, supporting a second hypothesis if you will, beyond the data needed to address the first hypothesis, which would not be recognized by methods focused on the first hypothesis. For example, clinical treatments, designed for a particular purpose, sometimes have desirable side effects. Discovering beneficial side effects and the conditions under which they occur can lead to medically and economically significant pharmaceutical products. Isolating detrimental side effects and the conditions under which they occurs is also clinically useful. Relevant data to uncovering these side effects arise from clinical trials when a patient""s symptoms and associated properties, either elicited or reported to the health care provider, are encoded into standard classes.
Work with similar intent, that is, retrieving objects similar to a specified object, or summarizing the relations among objects (but using different typed data) has been long underway in the information retrieval community [5], [6]. However, the data in these works are unstructured text.
Hence there is a need for a method of handling categorical data in a manner that permits identification of additional hypotheses and relationships in the data.
Background References
[1] Y. M. M. Bishop, Feinberg, S. E. and Holland, P. W. Discrete Multivariate Analysis: Theory and Practice. MIT Press, 1975.
[2] Alan Agresti. Categorical Data Analysis. John Wiley and Sons. 1990.
[3] N. E. Breslow and N. E. Day. Statistical Methods in Cancer Research. IARC Scientific Publications No. 32. 1980.
[4] George Michailidis and Jan de Leeuw, xe2x80x9cGIFI System of Descriptive Multivariate Statisticsxe2x80x9d Statistical Science 13(4) 307-336, 1998.
[5] Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A. (1990)xe2x80x94no figures, xe2x80x9cIndexing by latent semantic analysis.xe2x80x9d Journal of the Society for Information Science, 41(6), 391-407.
[6] Howard R. Turtle and W. Bruce Croft. xe2x80x9cA comparison of text retrieval models.xe2x80x9d Computer Journal, 35(3): 279-290, Jun. 1992.
The present invention provides a method of generating analyses of categorical data that will allow the application of exploratory multivariate analysis procedures constructed from inner products, distances, vector additions, and scalar multiplications to said categorical data having a plurality of responses. The method comprises the steps of encoding categorical data to provide a plurality of probability distribution representations, transforming exploratory multivariate analysis procedures based on inner products, distances, vector additions and scalar multiplications to work with probability distribution representations, and applying the transformed exploratory multivariate analysis procedures to the probability distribution representation to allow browsing, retrieving and viewing of said converted categorical data.
Whereas previously, each response to a question might have been modeled as an outcome from a multinomial probability distribution, according to the present invention each response is represented as a probability distribution. With this encoding or conversion, the vector of measurements for each individual can be viewed as a member of the linear space that includes vectors of probability distributions.
An advantage of the present invention is that existing methods for representing and manipulating numerical data can be adapted for the converted categorical data. In other words, the representation of categorical data as vectors of discrete probability distributions allows the use of standard clustering, projection, and/or visualization algorithms. A collection of vectors of probability distributions can be used to create a linear space; by the standard method of taking all linear combinations of the vectors of probability distributions. The present invention has the further advantage of permitting identification of more than one hypothesis from a categorical data set.
The data are represented and treated so that a visual, exploratory analysis of the data becomes possible. The present invention effectively permits the data to suggest hypotheses by virtue of the distribution encoding and adaptation of existing exploratory analysis methods.
One object of the present invention is to provide a method for clustering objects based on categorical measurements taken on objects and/or responses.
Another object is to provide a method for trending objects and/or responses.
It is a further object to provide a method for trending objects and/or responses based on categorical measurements taken on said objects and/or responses.
Yet another object is to provide a method for segmenting a sequence or time series of objects and/or responses.
It is a further object to provide a method for segmenting a sequence or time series of objects and/or responses based on categorical measurements taken on said objects or/and responses.
Another objective is to provide a method for classifying objects and/or responses.
It is a further object is to provide a method for classifying objects and/or responses based on categorical measurements taken on objects and/or responses.
Yet another object is to provide a method for detecting relatedness and periodic patterns in sequences of objects and/or responses.
It is a further object to provide a method for detecting relatedness and periodic patterns in sequences of objects and/or responses based on categorical measurements taken on said sequence of objects and/or responses.
Another object is to provide a method for generating continuous, numeric data from cluster relationships.
Yet another object is to provide a method for encoding vectors of categorical data embodied in Java language.
Another object is to provide a method that allows a novel, holistic view of the objects and/or responses on which the categorical data measurements were taken.
Yet another object is to provide a method for generating categorical data vectors that allow the generalization of standard clustering, projection and ultimately visualization algorithms to said categorical data.
Another object is to permit available methods including but not limited any technique that does not depend directly on fitting a probability distribution such as maximum likelihood estimation.
Still another object is to provide a method of handling categorical data in a manner that permits identification of additional hypotheses and relationships in the categorical data.
The subject matter of the present invention is particularly pointed out and distinctly claimed in the concluding portion of this specification. However, both the organization and method of operation, together with further advantages and objects thereof, may best be understood by reference to the following description taken in connection with accompanying drawings wherein like reference characters refer to like elements.