Each of the references cited herein are expressly incorporated herein by reference in their entirety.
The human retina is readily photographed using a variety of commonly available specialist cameras. In the UK most opticians, in particular the major chains, have such cameras in their practices already and a retinal photograph is increasingly offered as part of a standard eye test.
Retinal images may be used to assess many eye diseases—or in general retinopathy—and over recent years as the resolution of digital cameras have increased, has become the preferred way to assess retinal health. There are many types of retinopathy which can be detected on the image, from hemorrhages to tumors, and although many such diseases can lead to serious consequences (including blindness), the early stages usually cause no symptoms a patient can detect. As early intervention is often critical to successful management and/or cure, there are a number of compelling public health reasons to make assessing retinal images more routine.
Type II diabetes is a particularly relevant disease, as one of its consequences is the development small bleeds on the retina (micro-aneurysms). If left untreated, these can develop rapidly into serious eye disease, possibly before the patient has been diagnosed as a type II diabetic.
Diabetic retinopathy (“DR”) is the leading cause of adult blindness worldwide, including developed countries like the UK. Consequently, the UK-NHS offers all type II diabetes patients a free retinal screen every year, where a digital photo is taken and passed to a grading center and trained graders (humans) examine the image in minute detail for the earliest signs of bleeding. This manual process is costly, but to date is the most reliable way to process these images. Despite years of research, no automated system has yet been able to demonstrate the levels of reliability and accuracy a trained human can achieve.
U.S. 20170039689 discloses “deep learning” technologies for assessing DR from images. See also, U.S. 20170039412, 20150110348, 20150110368, 20150110370, 20150110372, U.S. Pat. Nos. 9,008,391, 9,002,085, 8,885,901, 8,879,813, and 20170046616. Retinal images from a funduscope provide an indication of vascular conditions, and may be useful for diagnosing both eye disease and more generally vascular disease.
“Deep learning” is a refinement of artificial NN (“ANN”), consisting of more than one hidden layer, with that permits higher levels of abstraction and improved predictions from data [2]. See, Greenspan, Hayit, Bram van Ginneken, and Ronald M. Summers. “Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique.” IEEE Transactions on Medical Imaging 35.5 (2016): 1153-1159. Convolutional neural networks (“CNN”) are powerful tools for computer vision tasks. Deep CNNs can be formulated to automatically learn mid-level and high-level abstractions obtained from raw data (e.g., images). Generic descriptors extracted from CNNs are effective in object recognition and localization in natural images.
In machine learning, two paradigms, supervised learning and unsupervised learning. In supervised learning, the training data is labelled, that is, there is an extrinsic truth, such that statistical learning is tethered to an external standard. In unsupervised learning, the training data is analyzed without reference to a ground truth, and therefore, the features are intrinsic. In semi-supervised learning techniques, aspects of both paradigms are employed, that is, not all samples (or attributes of samples) of the training data set are labelled.
In medical imaging, the accurate diagnosis and/or assessment of a disease depends on both image acquisition and image interpretation. Image acquisition has improved substantially over recent years, with devices acquiring data at faster rates and increased resolution. The present technology therefore addresses the image interpretation process. Most interpretations of medical images are performed by physicians; however, image interpretation by humans is limited due to its subjectivity, large variations across interpreters, and fatigue.
Human examiners may be inconsistent in their interpretations, and prone to error. Further, different humans may have different standards. Therefore, even obtaining training data is not without difficulty.
Many diagnostic tasks require an initial search process to detect abnormalities, and to quantify measurements and changes over time.
CNNs have been applied to medical image processing, see Sahiner et al. [4]. ROIs containing either biopsy-proven masses or normal tissues were extracted from mammograms. The CNN consisted of an input layer, two hidden layers and an output layer and used backpropagation for training. CNNs have also been applied to lung nodule detection [5], and microcalcifications on mammography [6].
The typical CNN architecture for image processing consists of a series of layers of convolution filters, interspersed with a series of data reduction or pooling layers. The convolution filters are applied to small patches of the input image. Like the low-level vision processing in the human brain, the convolution filters detect increasingly more relevant image features, for example lines or circles that may represent straight edges (such as for organ detection) or circles (such as for round objects like colonic polyps), and then higher order features like local and global shape and texture. The output of the CNN is typically one or more probabilities or class labels. The convolution filters are learned from training data. This is desirable because it reduces the necessity of the time-consuming hand-crafting of features that would otherwise be required to pre-process the images with application-specific filters or by calculating computable features. There are other network architecture variants, such as a deep recurrent neural network (“DRNN”) known as long short-term memory [7].
CNNs are parallelizable algorithms, and acceleration of processing (approximately 40 times) is enabled by graphics processing unit (GPU) computer chips, compared to CPU processing alone. See [8]. In medical image processing, GPUs are usable for segmentation, reconstruction, registration, and machine learning [9], [10].
Training a deep CNN (“DCNN”) from scratch (or full training) is a challenge. First, CNNs require a large amount of labeled training data, a requirement that may be difficult to meet in the medical domain where expert annotation is expensive and the diseases (e.g., lesions) are scarce. Second, training a DCNN requires large computational and memory resources, without which, the training process would be extremely time-consuming. Third, training a DCNN is often complicated by overfitting and convergence issues, which often require repetitive adjustments in the architecture or learning parameters of the network to ensure that all layers are learning with comparable speed. Given these difficulties, several new learning schemes, termed “transfer learning” and “fine-tuning”, are shown to provide solutions and are increasingly gaining popularity. However, empirical adjustments to the process characterize the state of the art, with significant differences in approach, implementation, and results are observed in the art even when using similar algorithms for similar goals. See www.kaggle.com/c/diabetic-retinopathy-detection.
Note that the training of a network cannot be updated, and any change in the algorithm or training data requires re-optimization of the entire network. A new technique provides some mitigation of this constraint, See, U.S. Pat. No. 9,053,431, expressly incorporated herein by reference. Thus, the network may be supplemented after definition, based on a noise or error vector output of the network.
Computer-aided detection (CAD) is a well-established area of medical image analysis that is highly amenable to deep learning. In the standard approach to CAD [11] candidate lesions are detected, either by supervised methods or by classical image processing techniques such as filtering and mathematical morphology. Candidate lesions are often segmented, and described by an often large set of hand-crafted features. A classifier is used to map the feature vectors to the probability that the candidate is an actual lesion. The straightforward way to employ deep learning instead of hand-crafted features is to train a CNN operating on a patch of image data centered on the candidate lesion. A combination of different CNNs is used to classify each candidate.
Various works focus on the supervised CNNs in order to achieve categorization. Such networks are important for many applications, including detection, segmentation and labelling. Other works focus on unsupervised schemes which are mostly shown to be useful in image encoding, efficient image representation schemes and as a pre-processing step for further supervised schemes. Unsupervised representation learning methods such as Restricted Boltzmann Machines (RBM) may outperform standard filter banks because they learn a feature description directly from the training data. The RBM is trained with a generative learning objective; this enables the network to learn representations from unlabeled data, but does not necessarily produce features that are optimal for classification. Van Tulder et al., [18] conducted an investigation to combine the advantages of both generative and a discriminative learning objectives in a convolutional classification restricted Boltzmann machine, which learns filters that are good both for describing the training data and for classification. It is shown that a combination of learning objectives outperforms purely discriminative or generative learning.
CNNs enable learning data-driven, highly representative, layered hierarchical image features. These features have been demonstrated to be a very strong and robust representation in many application domains, as presented in this issue. In order to provide such a rich representation and successful classification, sufficient training data are needed.
When sufficient data are not available, there are several ways to proceed:
Transfer learning: CNN models (supervised) pre-trained from natural image dataset or from a different medical domain are used for a new medical task at hand. In one scheme, a pre-trained CNN is applied to an input image and then the outputs are extracted from layers of the network. The extracted outputs are considered features and are used to train a separate pattern classifier. For instance, in Bar et al. [25], [26] pre-trained CNNs were used as a feature generator for chest pathology identification. In Ginneken et al. [27] integration of CNN-based features with handcrafted features enabled improved performance in a nodule detection system.
Fine Tuning: When a medium sized dataset does exist for the task at hand, one suggested scheme is to use a pre-trained CNN as initialization of the network, following which further supervised training is conducted, of several (or all) the network layers, using the new data for the task at hand. Transfer learning and fine tuning are key components in the use of DCNNs in medical imaging applications.
Shin et al. [17] and Tajbakhsh et al. [28] show that using pre-trained CNNs with fine-tuning achieved the strongest results, and that deep fine-tuning led to improved performance over shallow fine-tuning, and the importance of using fine-tuning increases with reduced size training sets, regardless of specific applications domain (Tajbakhsh et al.), and for all network architectures (Shin et al. [17]). The GoogLeNet architecture led to state-of-the-art detection of mediastinal lymph nodes compared to other less deep architectures, see Shin et al.
The lack of publicly available ground-truth data, and the difficulty in collecting such data per medical task, both cost-wise as well as time-wise, is a prohibitively limiting factor in the medical domain. Though crowdsourcing has enabled annotation of large scale databases for real world images, its application for biomedical purposes requires a deeper understanding and hence, more precise definition of the actual annotation task Nguyen et al. [29], McKenna et al. [30]. The fact that expert tasks are being outsourced to non-expert users may lead to noisy annotations introducing disagreement between users. Likewise, use of uncontrolled clinical outcome data, or user data as a basis for feedback, may lead to poor outcomes, due to lack of standardization and control. Many issues arise in combining the knowledge of medical experts with non-professionals, such as how to combine the information sources, how to assess and incorporate the inputs weighted by their prior-proved accuracy in performance and more. Albarqouni et al. [31] present a network that combines an aggregation layer that is integrated into the CNN to enable learning inputs from the crowds as part of the network learning process. Results shown give valuable insights into the functionality of DCNN learning from crowd annotations. Crowdsourcing studies in the medical domain show that a crowd of nonprofessional, inexperienced users can in fact perform as well as the medical experts, which was observed by Nguyen et al. [29] and McKenna et al. [30] for radiology images. This perhaps means that the results do not capture and exploit the skill, knowledge and insight of the experts.
Unsupervised feature learning for mammography risk scoring is presented in Kallenberg et al. [32]. In this work, a method is shown that learns a feature hierarchy from unlabeled data. The learned features are then input to a simple classifier, and two different tasks are addressed: i) breast density segmentation, and ii) scoring of mammographic texture, with state-of-the-art results achieved. To control the model capacity, a sparsity regularizer is introduced that incorporates both lifetime and population sparsity. The convolutional layers in the unsupervised parts are trained as autoencoders; In the supervised part the (pre-trained) weights and bias terms are fine-tuned using softmax regression.
Yan et al. [33] design a multi-stage deep learning framework for image classification and apply it on body part recognition. In the pre-train stage, a CNN is trained using multi-instance learning to extract the most discriminative and non-informative local patches from the training slices. In the boosting stage, the pre-trained CNN is further boosted by these local patches for image classification. A hallmark of the method was that it automatically discovered the discriminative and non-informative local patches through multi-instance deep learning. Thus, no manual annotation was required.
Regression networks are not very common in the medical imaging domain. In Miao et al. [34], a CNN regression approach is presented, for real-time 2-D/3-D registration. Three algorithmic strategies are proposed to simplify the underlying mapping to be regressed, and to design a CNN regression model with strong non-linear modelling. Results show that the discriminative local (DL) method is more accurate and robust than two state-of-the-art accelerated intensity-based 2-D/3-D registration methods.
Golkov et al. [35] provide an initial proof-of-concept, applying DL to reduce diffusion MRI data processing to a single optimized step. They show that this modification enables one to obtain scalar measures from advanced models at twelve-fold reduced scan time and to detect abnormalities without using diffusion models. The relationship between the diffusion-weighted signal and microstructural tissue properties is non-trivial. Golkov et al. [35] demonstrate that with the use of a DNN such relationships may in fact be revealed: DWIs are directly used as inputs rather than using scalar measures obtained from model fitting. The work shows microstructure prediction on a voxel-by-voxel basis as well as automated model-free segmentation from DWI values, into healthy tissues and MS lesions. Diffusion kurtosis is shown to be measured from only 12 data points and neurite orientation dispersion and density measures from only 8 data points. This may allow for fast and robust protocols facilitating clinical routine and demonstrates how classical data processing can be streamlined by means of deep learning.
Kaggle (www.kaggle.org) organized a competition on detection and staging of DR from color fundus images, and around 80,000 images were made available (www.kaggle.com/c/diabetic-retinopathy-detection). May proposals used NN.
U.S. 2014/0314288 discloses a three stage system that analyzes fundus images with varying illumination and fields of view and generates a severity grade for DR. Image pre-processing includes histogram equalization, contrast enhancement, and dynamic range normalization. In the first stage, bright and red regions are extracted from the fundus image using various combination of global filters and thresholding techniques. An optic disc (OD) has similar structural appearance as bright lesions, and the blood vessel regions have similar pixel intensity properties as the red lesions. Hence, the region corresponding to the optic disc is removed from the bright regions using existing optic disc detection algorithms and the regions corresponding to the blood vessels are removed from the red regions using a simple detection technique. This leads to an image containing bright candidate regions and another image containing red candidate regions. Region-based features are computed, including area, perimeter, solidity, min, mas, mean, standard deviation, etc.; in all, 30 features are used selected by AdaBoost (en.wikipedia.org/wiki/AdaBoost). In the second stage, the bright and red candidate regions are subjected to two-step hierarchical classification. In the first step, bright and red lesion regions are separated from non-lesion regions based on the features. In the second step, the classified bright lesion regions are further classified as hard exudates or cotton-wool spots, while the classified red lesion regions are further classified as hemorrhages and micro-aneurysms. The classifier may take the form of GMM (en.wikipedia.org/wiki/Mixture_model # Gaussian_mixture_model), kNN (en.wikipedia.org/wiki/K-nearest_neighbors_algorithm), SVM (en.wikipedia.org/wiki/Support_vector_machine, en.wikipedia.org/wiki/Supervised_learning), etc. In the third stage, the numbers of bright and red lesions per image are combined to generate a DR severity grade. Such a system aims in reducing the number of patients requiring manual assessment, and in prioritizing eye-care delivery measures for patients with highest DR severity.
U.S. Pat. No. 8,098,907 discloses automatic detection of micro-aneurysms (MAs) by also taking into account information such as location of vessels, optic disc and hard exudates (HEs). The method works by (i) dividing the image into subregions of fixed size, followed by region enhancement/normalization and adaptive analysis, (ii) optic disc, vessel, and HE detection, and (iii) combination of the results of (i) and (ii) above. MA detection is based on top hat filtering and local adaptive thresholding.
U.S. 2015/0104087 discloses automated fundus image side, i.e. left or right eye detection, field detection and quality assessment. The technology uses physiological characteristics such as location of optic disc and macula and the presence of symmetry in the retinal blood vessel structure. Field and side detection are multi-channel or single channel. The algorithm detects the optic disc using normalized 2D cross-correlation using an optic disc template. For high-resolution images, blood vessel structure density is used to determine side and field. The quality for each image is assessed by analysis of the vessel symmetry in the image. This is done by obtaining a binary segmented image of the blood vessels, extracting features and applying a classification algorithm. Quality is a grade of 1 to 5. For vessel binarization, wavelets and edge location refinement are proposed. This is followed by morphological operations. For feature extraction, the image is divided into 20 rectangular windows, and local vessel density (LVD) is computed as the number of non-zero pixels in window. The 20 LVDs are normalized by a global vessel density (GVD) computed as the number of non-zero pixels for a segmented binary image of grade 5. A feature vector is formed by the LVDs and GVD of the image. For classification, SVM is proposed. Image registration and similar techniques to the above can be used to assign quality levels to overlapping stitched images.
U.S. Pat. Nos. 8,879,813, 8,885,901, 9,002,085, 9,008,391, U.S. 2015/0110348, U.S. 2015/0110368, U.S. 2015/0110370, U.S. 2015/0110372, U.S. 2017/0039412, and U.S. 2017/0039689 disclose various aspects of a diagnostic system. A retinal image may be enhanced based on median normalization, to locally enhance the image at each pixel location using local background estimation. Active pixels are detected in retina images, based on median filtering/dilation/erosion/etc. Essentially, this detects the retina disc and eliminates pixels close to the border. Regions of interest are detected. Descriptors of local regions are extracted, e.g., by computing two morphologically filtered images with the morphological filter computed over geometric shaped local regions of two different types or sizes, and taking their difference. Image quality is assessed using computer vision techniques to assess appropriateness for grading. Images are automatically screened for diseases. Image-based lesion biomarkers are automatically analyzed, over different visits of a patient, with image registration between visits. Changes in lesions and anatomical structures are computed, and quantified it terms of statistics wherein the computed statistics represent the image-based biomarker that can be used for monitoring progression, early detection, and/or monitoring effectiveness of treatment therapy.
U.S. Pat. No. 8,879,813 describe automated detection of active pixels in retina images by accessing a retina image, generating two median filtered versions with different window sizes, generating a difference image from the filtered versions, and then generating a binary image.
U.S. Pat. No. 8,885,901 describe enhancing a retina image by accessing a retina image, filtering it with a median filter and modifying the values in the original image based on the values in the original and filtered images, wherein the enhanced image is used for detecting a medical condition.
U.S. Pat. No. 9,002,085 and U.S. 2015/0110372 describe generating descriptors of local regions in a retina image by accessing a retina image, generating two morphologically filtered versions, the first with a circular/regular polygon window and the second with an elongated/elliptical window, generating difference values from the filtered images and using them as pixel descriptor values for the retina image.
U.S. Pat. No. 9,008,391 and U.S. 2015/0110368 describe accessing retina images for a patient, for each of the images designating a subset of pixels as active including regions of interest, computing pixel-level descriptors, providing pixel level classification from the descriptors using supervised learning, computing a second descriptor, and providing a second classification for a plurality of pixels using supervised learning.
U.S. 2015/0110348 and U.S. 2017/0039412 describe automated detection of regions of interest (ROIs) in retina images by accessing a retina image, extracting regions with one or more desire properties using multiscale morphological filterbank analysis, and storing a binary map.
U.S. 2015/0110370 and U.S. 2017/0039689 describe enhancing a retinal image by accessing a funduscopic image, estimating the background at single of multiple scales, and scaling the intensity at a first pixel location adaptively based on the intensity at the same position in the background image.