In medicine in general, and in radiotherapy in particular, medical imaging is used to probe the interior of the body to obtain information about the interior without having to physically enter the body using surgery or other invasive mechanisms. In general, the body can be described as a spatial (and possibly also temporal) distribution of a large set of parameters. For example, one point represented by an imaging unit (e.g., a voxel) might have density 1.2 g/mm3, contain 80% water, 20% fat, have a 18F uptake of 5 MBq/mL, temperature 38° C., contain 5000 glioblastoma cells/mm3, etc. The goal of medical imaging is to estimate one or more of these parameters. However, any particular imaging modality can only determine a very small number of these parameters. For example, CT can only determine the radiological density (which roughly correlates with density), PET can only determine radionuclei uptake, etc. It has therefore become increasingly common to combine several imaging modalities during treatment planning and treatment administration.
Most state-of-the-art methods for machine learning in medical imaging can be summarized as function approximation; training data consisting of input-output pairs of some type (e.g. CT-images with segmentations) are acquired from, for instance, expert clinicians and a function is “trained” to approximate this mapping. Popular methods involve, for example, neural networks. In these methods, a set of parametrized functions ƒθ are selected, where θ is a set of parameters (e.g. convolution kernels and biases) that are selected by minimizing the average error over the training data. If the input-output pairs are denoted by (xi, yi), this can be formalized by solving a minimization problem such as
      min    θ    ⁢            ∑      i        ⁢                                                              f              θ                        ⁡                          (                              x                i                            )                                -                      y            i                                      2      2      
Once the network has been trained (i.e. θ has been selected), the function ƒθ can be applied to any new input. For example, in the above setting of segmentation of CT images a never-before-seen CT image can be fed into ƒθ, and with the objective to obtain a segmentation that matches what an expert clinician would find.
In classical machine learning methods, however, the function ƒθ only takes input of some fixed type, e.g. CT-images or PET images. If a method is desired to be trained using both CT and PET images as input, the whole training process has to be re-done. Given that the current trend is an ever-increasing number of imaging modalities being used, in new and original combinations, a combinatorial expansion of data may result. For instance, with 2 imaging modalities there are 3 ways to combine them (either or both), but with 5 modalities there are 31 combinations, and with 10 modalities there are more than 1000 combinations.