The tissue microenvironment is comprised of cells of distinct lineage and function. Better classification of the cellular composition and attendant phenotype of individual cells in the tissue microenvironment in healthy and disease states should advance basic, translational, and clinical research and ultimately improve human health. This is especially true in cancer. Immunotherapies are emerging as one of the success stories of treating cancer. Intense effort is also being expended in designing anti-cancer therapies targeting the elements of the tumor stroma including the vasculature, as well as other elements of the microenvironment. That the recent success of immunotherapies is limited to subsets of patients underscores the urgent need to develop new tools for in situ tissue microenvironment analysis and cell type quantification so as to facilitate the utilization of these treatments. Identifying the different numbers and kinds of cells is a critical task in characterizing the immune response in cancer tissues. However, there are multiple challenges to overcome to perform this task reliably. For example, the phenotypes are identified based on features computed from the biomarker expression on each cell. One simple way to identify cells that are positive for biomarkers would be to define numerical thresholds per biomarker. However, the biomarkers used to identify these cell types exhibit a great degree of variability in terms of their expression on cells of interest. Thus, defining thresholds to classify phenotypes might not perform efficiently in all cases. Also, due to differences in staining protocol and tissue fixation these thresholds would vary from slide to slide for each biomarker being analyzed. Thirdly, the cells being analyzed are two dimensional projections of three dimensional objects and this, in certain cases, affects the computation of features. The classification methods should be robust to these potential causes of variability.
The final images being analyzed also contain several artifacts which include dust particles, bubbles, tissue folding, fragments etc. which may be due to poor tissue quality as well as sample preparation. These artifacts can be misclassified as cells and can increase the false positive rate. The methods being used to process these images should take into account the incidence of artifacts and should discard them from analysis.
Large scale discovery studies involve analysis of hundreds of slides which can result in the analysis of millions of cells. Classifying millions of cells by training an algorithm requires efficient and scalable methods of training and classification. Since the number of cells identified in a large study can run into the millions, there is a need to classify cells types efficiently. One method described in patent publication 2014/0199704, published Jul. 17, 2014 uses quantile based thresholding methods to identify and classify immune cells. The present invention is directed to an improved method of letting the algorithm guide the user to select the slides from which the training data needs to be created and use the manually annotated training data to build models applicable to those or similar slides. The slide selection is done via unsupervised analysis of slide data and clustering them into groups that are similar to each other.