The present invention relates to a probabilistic approach to data clustering in a data mining system.
Data mining systems are employed to analyze data sets comprising a plurality of objects each object having a set of attributes. An object could for example represent a survey respondent, and object attributes could be a series of answers the respondent has provided for the survey. Each attribute can be divided into one of three types: continuous, discrete ordinal and categorical (discrete nominal).
Continuous attributes include exact, or quasi exact, answers within an ordered scale for the attribute, for example, where the respondent inserts the exact amount of their salary. Two continuous attributes can be compared for equality, they can be ordered with respect to each other, and a numerical distance measure can be made of the size of their dissimilarity.
Discrete ordinal attributes include answers divided into bins or bands within an ordered range, for example, where a question asks for a respondents age, the respondent answers that their age is in one of the bands 0-25, 25-50, or 50+ years of age, thus the information is termed partially missing in the sense that the exact attribute value cannot be determined from the information provided. Such a question could be put in the form of a question requiring an exact answer to provide a continuous value, but this is often seen as intrusive, and survey respondents may be unwilling to provide such information. Two discrete ordinal attributes can nonetheless be tested for equality, and for ordering, however a distance measure cannot always be made of the size of their dissimilarity. Hence a discrete ordinal attribute is one that is like a continuous attribute but only the ordering of the attributes is known.
It is acknowledged that an example of a discrete ordinal attribute is the number of children in a family. Here the discrete value is an exact representation of the quantity, and a distance measure can be made between two families. However, where the attribute is the price band of a house or a person""s age band as above, the underlying quantity for the discrete value is a continuous one, but the discrete value is an approximation to the real value of the house or the age of the person, and a distance measure cannot be found, because bands made be of different sizes.
Categorical or discrete nominal attributes include answers divided into unordered bands, for example, house location or make or colour of car. Two categorical attributes can only be checked for equality, ordering has no meaning.
Data mining systems are known which analyze data sets to define one or more non-probabilistic clusters. These clusters can then be used to test individual objects or other data sets against the pre-defined clusters. FIG. 1 shows an example where three clusters C1 . . . C3 have been pre-defined for a data set including salary and age attributes. This data set although extremely simple could be used, for example, by a financial institution to decide whether a respondent is a suitable candidate for a loan. The model can be used to test the answers provided by respondents x, y and z regardless of whether they are continuous or discrete. The clusters are non-probabilistic in that respondents are considered to either lie inside or outside a cluster, that is, they have a probability of either 0 or 1 of being in a cluster. Using such a non-probabilistic (NP) analysis, the respondent z would be seen to lie inside cluster 1, whereas the respondents x and y would be seen not to lie in any cluster. The results of the analysis are displayed in FIG. 2, which illustrates that the only information gained from the respondents is that the respondent z is definitely in cluster C1, whereas the other respondents definitely do not fit any cluster. It will be seen, however, that the respondent x is much closer to fitting into the model than the respondent y, yet this information is lost in the analysis.
It is acknowledged that some non-probabilistic data mining systems can define clusters to fill all possible object attribute values, rather than the above example where some of the space if left undefined. In such a case, the points x and y, rather than not lying in any cluster, would be deemed to lie in a cluster for which they are really not appropriate.
Furthermore, because the non-probabilistic approach can not properly take into account xe2x80x9chow farxe2x80x9d a respondent may be from lying in or outside a cluster, it has little possibility of detecting very unusual responses which may be indicative of fraud. Non-probabilistic data analysis systems try to overcome this problem by defining xe2x80x9cfraudulentxe2x80x9d clusters, but because fraudulent activity is relatively rare, definition of such clusters is difficult.
Probabilistic approaches to data analysis are known. In speech recognition information is provided as a continuous data in the form of amplitude or frequency information, and a speech processor may need to allocate a probability of between 0 and 1 that a speech pattern matches one or more phonemes. Similarly, a probabilistic approach has been used in relation to categorical data in for example clinical research, where respondents provide information in relation to types of symptoms experienced.
However, a probabilistic approach has not been employed properly in the field of data mining particularly because of the difficulties and shortcomings in dealing with discrete ordinal attributes.
In a first aspect, the present invention provides a component of a data clusterer adapted to determine a conditional probability density of an object lying in a cluster; said object having a discrete ordinal attribute value within a finite range of attribute values, said conditional probability density for said discrete ordinal attribute being a function of an integral of a conditional probability function across a sub-range of said discrete ordinal attribute range of values, said sub-range comprising an upper bound and a lower bound bounding said discrete ordinal attribute value.
In a second aspect, the present invention provides a data mining system adapted to generate a cluster model from a data set comprising a plurality of objects, each object including a plurality of attributes, said attributes including a set of discrete ordinal attributes, said system including an iterative cluster definition means, the or each cluster having a distribution attribute associated with each of said set of discrete ordinal attributes, said cluster definition means including: means for determining, for each cluster, a conditional probability density (pj(x,z,q)) of an object lying in a cluster; means for determining, for each cluster and for each object, a posterior probability (hij) of an object lying in a cluster, said posterior probability being a function of said conditional probability density of the cluster (pj(x,z,q)), a mixing fraction for said cluster (xcex1j) and an unconditional probability density (p(x,z,q)); and means for determining, for each object attribute and for each cluster, a next cluster distribution attribute (xcexcjk,Vjk; vjk,Wjk; xcfx80jk, cjk), said distribution attribute being a function of said posterior probability, said object attribute value and a sum of said posterior probabilities; wherein said means for determining the conditional probability density of an object lying in a cluster is characterised by means for determining the conditional probability density of an object having a discrete ordinal attribute value within a finite range of attribute values lying in a cluster, said conditional probability density for said discrete ordinal attribute being a function of an integral of a conditional probability function across a sub-range of said discrete ordinal attribute range of values, said sub-range comprising an upper bound and a lower bound bounding said discrete ordinal attribute value.
After the generation of the cluster model, the present invention can be used for further analysis on new data sets:
to generate a cluster membership value, which is the probability of an object having given attributes belonging to a cluster.
to generate confidence or novelty (=1xe2x88x92confidence) values associated with an object, that indicates how likely it is that the object comes from the same distribution as the training data set.