1. Field
The described technology relates to data analysis and predictive systems and related methodologies. In particular the described technology relates to customised or personalised data analysis and predictive systems and related methodologies.
2. Description of Related Technology
The concept of personalised medicine has been promoted widely in the recent years through the collection of personalised databases, establishment of new journals and new societies and publications in international journals (see for example ref. 1-7). Despite the furore of interest in this area, there are at present no adequate data analysis methods and systems which can produce highly accurate and informative personalised models from data.
Contemporary medical and other data analysis and decision support systems use predominantly inductive global models for the prediction of a person's risk, or likely outcome of a disease for an individual. In US20050131847A1, for example, features are pre-processed to minimise classification error in a global Support Vector Machine model used to identify patterns in large databases. Pre-processing may be performed to constrain features used to train the global SVM learning system. Global modelling in general is concerned with deriving a global formula (e.g. via regression, a “black box neural network”, or a support vector machine) from the personal data of many people. The global formula is expected to perform well on any new subject, at any time, anywhere in world. Based on this expectation, drugs may be designed to target a disease, and these drugs are assumed to be useful for everybody who suffers from this disease. When a global model is created, a set of features (variables) may usually be selected that applies to the whole problem space (e.g., all samples in the available data). However, statistics have shown very clearly that drugs developed by such global models will only be effective for around average of 70% of people in need of treatment, leaving a relatively large number of patients who will not benefit at all from treatment with the drug. With aggressive diseases such as cancer, any time wasted, e.g. either a patient not being treated, or being treated, with an ineffective treatment, can be the difference between life and death. In particular, it would be useful to determine from a sample taken from a patient (e.g. blood sample, tissue, clinical data and/or DNA) into what category a patient falls. This information can also be used to determine and develop treatments that will be effective at treating the remainder of the population.
It would therefore be useful if there could be provided data analysis methodologies and systems, which based on available population data, are capable of creating models which are more useful and informative for analysing and/or assessing an individual person for a given problem. Such models should also ideally achieve a higher degree of accuracy of prediction of outcome or classification than conventional systems and methodologies.
A step towards personalised medicine and profiling may be the creation of global models, that cover a whole population of data, but importantly comprise many local models, each of them covering a cluster (neighbourhood) of similar data samples (vectors) Such models are called local learning models. Such models may be adaptive to new data. Once created, a person's information can be submitted and a personal profile extracted in terms of the closest local model which may be based on the neighbourhood of vectors in the dataset closest to that of subject person. Such models include evolving connectionist systems (EGOS), such as those previously developed, patented and published (Kasabov 2000, 2002 and 2007). These methods identify groups (clusters or neighbourhoods) of similar samples and develop a local model for each cluster through a machine learning algorithm, collectively all clusters cover the whole problem space. While local learning models have been very useful to adapt to new data and discover local information and knowledge, these methods do not select specific subsets of features and precise neighbourhood of samples for a specific individual that would be required for a true personalised modeling, for example in personalised medicine.
While inductive modeling results in the incremental creation of a global model where new, unlabeled data may be “mapped” through a recall procedure, transductive inference methods (transductive models) estimate the value of a potential model (function) only in a single point of the space (e.g., that of the new data vector) and utilise the information (features) of samples close in space (e.g., related to this point). This approach seems to be more appropriate for clinical and medical applications, where the focus may be not so much on the model, but more on the individual patient. The focus may be on the accuracy of prediction for any individual patient as opposed to the global error associated with a global model which merely highlights the shortcomings of an inductive approach. Thus, with a transductive approach each individual data vector (e.g. a patient in any given medical area) obtains a customised, local model, that best fits the new data, rather than a global model, where new data may be matched to a model (formula) averaged for the whole dataset which fails to take into account specific information peculiar to individual data samples. Thus a transductive approach seems to be a step in the right direction when looking to devise personalized modelling useful in personalized medicine.
The general principle of transductive modeling can be stated as the following: for every new input vector x, that needs to be processed for a classification or a prognostic task, the closest K samples, that form a new sub-data set Dx, may be derived from an existing global data set D. A new model Mx may be dynamically created from these samples. The system may then be used to calculate the output value y for this input vector x (Vapnik 1998).
A simple and classical transductive inference method may be the K-nearest neighbour method (K-NN) where the output value y for a new vector x may be calculated as the average of the output values of the K-nearest samples from the data set Dx. In a weighted K-NN method (WKNN) the output y may be calculated based on the weighted distance of the K-NN samples to x:y=(Σj=1,K(wjyj))/(Σj=1,K(wj))  (1)where: y is the output value for the sample x from Dx; yj is the output value for the sample xj in the neighbourhood of x; wj is the weighted distance between x and xj measured as:wj=max(d)−[dj−min(d)].  (2)
In Eq. (2), the vector distance d=[d1, d2, . . . dK] may be defined as the distances between the new input vector x and the nearest samples (xj, y1) for j=1 to K; max(d) and min(d) are the maximum and minimum values in d respectively.
In general, distance between two q—element vectors x and z of same variables may be measured as normalised Euclidean distance defined as follows:dx,z=SQRT(Σl=1 to q(xl−zl)2)/q  (3)
In another classification method, called WWKNN, not only may the nearest samples be weighted based on their distance to the new sample x, but the contribution of each of the variables may be weighted based on their importance for the nearest neighbor area of x (Kasabov 2007).
The KNN, WKNN and WWKNN methods use a single formula to calculate the output y for the input vector x based on the K nearest neighbours. These methods do not suggest how to select the number K and the most suitable set of K nearest samples, neither they suggest how to select the number of variables V, that would give the best accuracy of each personalised model Mx. By way of contrast these methods use a fixed number of K nearest neighbours and a fixed number of variables.
Other methods create a machine learning model from the K nearest neighbours and the model may then be used to calculate the output y. Such methods for example are: Transductive Neural Fuzzy Inference System—NFI and Transductive Neural Fuzzy Inference System with Weighted Data Normalization—TWNFI (Song and Kasabov 2006). As the above group of methods, these methods do not suggest how to select the number K of nearest samples, neither they suggest how to select the number of variables V, that would give the best accuracy of the personalised model Mx.
To summarise, in the above transductive methods, there is no efficient method for personalised feature selection (e.g. features such as important genes, clinical and/or other variables) required for personalised prognosis, classification, profiling, and/or treatment selection. These transductive methods also do not rank variables (features) in terms of importance for a person and for an optimal personal model creation based on these variables and a personalised selection of the nearest neighbour samples from the available data set. There is also no methodology to suggest how individual scenarios for personal improvement (e.g. treatment) can be designed.