One goal of digital pathology is to produce computerized systems that can detect the presence of cancer or other disease, possibly to be used as prescreening or quality control tools in coordination with human pathologists. To develop such systems, a machine learning classifier may be trained. Training data consists of examples of tissue, together with a grade indicating whether the tissue is cancerous or not. The grade typically describes the entire tissue without indicating the specific region where cancer may be found.
Digital images of biopsy specimens to be tested for the presence of disease, such as cancer, can be overwhelmingly large, possibly containing billions of pixels. While most of a tissue may appear healthy, disease-indicating phenomena may appear in a tiny fraction of the tissue to be examined.
The abundance of healthy tissue even in a tissue graded as cancerous poses a challenge for typical machine learning training methods. It may have the effect of lowering the quality of a trained classifier that randomly selects image regions inside cancerous and non-cancerous tissues and imputes the label of the tissue to them, because the random selections in cancerous tissue may look just like healthy tissue.
Multiple-instance learning is a class of machine learning techniques designed to address problems with non-specific labels. In the multiple-instance learning framework, a classifier considers so-called “bags” of examples, each of which consists of the same number of features. The features for all the examples together are used to classify the bag.
In digital pathology, a multiple-instance learning setting may be constructed by dividing a tissue into so-called “regions of interest” (ROI), each of which is used to measure a set of features. The ROI may be selected heuristically and may not cover the entire tissue. The multiple-instance learning task is to classify the entire tissue using the features from the set of ROI.
This invention separates the training of a tissue classifier into two parts. The first part is the training of an ROI classifier with the objective of minimizing the error given by the maximum decision over all ROI in the tissue. The second part is the training of a tissue classifier based on actual ROI outputs. Compared to non-multiple-instance learning approaches, the first part confers the advantage of not assuming that all the tissue in a cancerous tissue is actually cancerous. Compared to using the multiple instance classifier obtained through the first part alone, introducing the second part may improve the tissue classification result by learning to aggregate noisy ROI decisions in the best way.