There are many instances in which it is desirable to predict the likelihood of an event occurring (initially occurring and/or recurring) within a certain amount of time and/or the amount of time until an event is likely to occur. In the medical field, for example, it would be useful to predict whether a patient who has been treated for a particular disease is likely to recur, and if so, when. Mathematical models can be developed to make such time-to-event predictions based on data obtained from actual cases. In the example above, such a predictive model could be developed by studying a cohort of patients who were treated for a particular disease and identifying common characteristics or “features” that distinguished patients who recur from those who do not. By taking into account the actual time to recurrence for the patients in the cohort, features and values of features can also be identified that correlate to patients that recurred at particular times. These features can be used to predict the time to recurrence for a future patient based on that patient's individual feature profile. Such time-to-event predictions can help a treating physician assess and plan the treatment for the occurrence of the event.
A unique characteristic of time-to-event data is that the event of interest (in this example disease recurrence) may not yet be observed. This would occur, for example, where a patient in the cohort visits the doctor but the disease has not yet recurred. Data corresponding to such a patient visit is referred to as “right-censored” because as of that time some of the data of interest is missing (i.e., the event of interest, e.g., disease recurrence, has not yet occurred). Although censored data by definition lacks certain information, it can be very useful, if the censored nature can be accounted for, in developing predictive models because it provides more data points for use in adapting parameters of the models. Indeed, time-to-event data, especially right-censored time-to-event data, is one of the most common types of data used in clinical, pharmaceutical, and biomedical research.
In forming or training predictive mathematical models, it is generally desirable to incorporate as much data as possible from as many sources as possible. Thus, for example, for health time-to-event predictions, for example, it is generally desirable to have data from as many patients as possible and as much relevant data from each patient as possible. With these large amounts of diverse data, however, come difficulties in how to process all of the information available. Although various models exist, none is completely satisfactory for handling high dimensional, heterogeneous data sets that include right-censored data. For example, the Cox proportional hazards model is a well-known model used in the analysis of censored data for identifying differences in outcome due to patient features by assuming, through its construct, that the failure rate of any two patients are proportional and the independent features of the patients affect the hazard in a multiplicative way. But while the Cox model can properly process right-censored data, the Cox model is not ideal for analyzing high dimensional datasets since it is limited by the total regression degrees of freedom in the model as well as it needing a sufficient number of patients if dealing with a complex model. Support Vector Machines (SVMs) on the other hand, perform well with high dimensional datasets, but are not well-suited for use with censored data.