The application of classifiers, such as deep learning techniques, can be used to generate relevant insights from large volumes of data. The use of classifiers is being explored across a variety of areas. Specifically, in healthcare, the American Recovery and Reinvestment Act of 2009 and the Precision Medicine Initiative of 2015 have widely endorsed the value of medical data in healthcare. Owing to several such initiatives, medical big data is expected to grow approximately 50-fold to reach 25,000 petabytes by 2020. See, Roots Analysis, Feb. 22, 2017, “Deep Learning in Drug Discovery and Diagnostics, 2017-2035,” available on the Internet at rootsanalysis.com.
Classifiers can be used to generate valuable/meaningful insights using conventional data mining techniques. Lead identification and optimization in drug discovery, support in patient recruitment for clinical trials, medical image analysis, biomarker identification, drug efficacy analysis, drug adherence evaluation, sequencing data analysis, virtual screening, molecule profiling, metabolomic data analysis, EMR analysis and medical device data evaluation, off-target side-effect prediction, toxicity prediction, potency optimization, drug repurposing, drug resistance prediction, personalized medicine, drug trial design, agrochemical design, material science and simulations are examples of applications where the use of classifiers, such as deep learning based solutions are being explored.
The likely benefits associated with the use of classifier based solutions in the above mentioned areas is estimated to be worth several billions of dollars. For example, there are well-known instances where deep learning models have accelerated the drug discovery process and provided solutions to precision medicine. With applications in drug repurposing and preclinical research, the application of classifiers in drug discovery is likely to have great opportunity. In diagnostics, an increase in the speed of diagnosis based on classifiers is likely to have a profound impact in regions with large patient to physician ratios. The implementation of such solutions will increase the efficiency of physicians, thereby providing relief to the overly-burdened global healthcare system.
One drawback with classifiers is their error. Two main sources of classifier error are bias and variance. The error due to bias is taken as the difference between the expected (or average) prediction of a classifier and the correct value which the classifier is trying to predict. Because there is typically only one classifier used in an application, the concept of an expected or average prediction value for the classifier is counterintuitive. However, if one were to repeat the classifier training process more than once, each time using new training data and run a new analysis creating a new classifier, due to randomness in the underlying data sets, the resulting classifier will have a range of predictions. Bias measures how far off in general these classifier's predictions are from the correct value. For instance, using a phonebook to select participants in a survey used to train a classifier is a source of bias. By only surveying certain classes of people (those people that have a registered phone number), it skews the results in a way that will be consistent if we repeated the entire classifier building exercise. Similarly, not following up with respondents is another source of bias, as it consistently changes the mixture of responses obtained.
Classifier error due to variance is taken as the variability of a classifier prediction for a given data point. Again, if the entire classifier building process is repeated multiple times, the variance is how much the predictions for a given point vary between different realizations of the classifier. A small sample size for a training population is a source of variance. If the sample size is increased, the results would be more consistent each time the survey and prediction is repeated during classifier training. The results still might be highly inaccurate due to large sources of bias, but the variance of predictions will be reduced.
In the art, total error of the model has been minimized by a careful balancing of bias and variance. However, as classifiers, such as deep learning classifiers, get more complex and are applied to more types of data such as unstructured data and/or data for which very few replicates can be used in the training set, error becomes increasing more difficult to detect, let alone correct. Given the above background, there is a need for solutions that remove error, such as bias, in a classifier, in order to provide more accurate results. Removal or improvement in such error will have application in lead identification and optimization in drug discovery, support in patient recruitment for clinical trials, medical image analysis, biomarker identification, drug efficacy analysis, drug adherence evaluation, sequencing data analysis, virtual screening, molecule profiling, metabolomic data analysis, EMR analysis, medical device data evaluation, off-target side-effect prediction, toxicity prediction, potency optimization, drug repurposing, drug resistance prediction, personalized medicine, drug trial design, agrochemical design, material science and simulations to name a few practical applications where the use of improved classifiers have value.