3D volumetric images (or “volumes”) are widely used for clinical diagnosis, surgical planning and biomedical research. The 3D context information provided by such volumetric images is important for visualizing and analyzing the object of interest. However, given the added dimension, it is more time consuming and sometimes harder to interpret 3D volumes than 2D images by machines. Many conventional systems use convolutional neural networks (CNN) to extract the representation of structural patterns of interests in human or animal body tissues.
Due to the special imaging settings, many imaging modalities come with anisotropic voxels, meaning not all the three dimensions have equal resolutions. For example, in the 3D volumes of Digital Breast Tomosynthesis (DBT), and sometimes Computed Tomography (CT), the image resolution in xy plane/slice (or within-slice resolution) is more than ten times higher than that of the z resolution (or between-slice resolution). Thus, the xy slices preserve much more information than the z dimension. In DBT images, only the spatial information within the xy plane can be guaranteed. However, the 3D context between xy slices, even with slight misalignment, still carries meaningful information for analysis. Directly applying 3D CNN to such images remains a challenging task due to the following reasons. First, it may be hard for a small 3×3×3 kernel to learn useful features from anisotropic voxels, because of the different information density along each dimension. Second, theoretically more features are needed in 3D networks compared to 2D networks. The capability of 3D networks is bounded by the GPU memory, constraining both the width and depth of the networks. Third, unlike 2D computer vision tasks which nowadays can make use of the backbone networks pre-trained using millions of 2D images, 3D tasks mostly have to train from scratch, and hence suffer from the lack of large 3D datasets. In addition, the high data variations make the 3D networks harder to be trained. Also, 3D CNNs trained on such small image datasets with relatively small context are hard to generalize to unseen data.
Besides the traditional 3D networks built with 1×1×1 and 3×3×3 kernels, there are other conventional methods for learning representations from anisotropic voxels. Some methods process 2D slices separately with 2D networks. To make a better use of the 3D context, more than one image slice is used as the input for 2D networks. The 2D slices can also be viewed sequentially by combining a fully convolutional network (FCN) architecture with Convolutional LSTM to view the adjacent image slices as a time series to distil the 3D context from a sequence of abstracted 2D context. There are also conventional methods that apply anisotropic convolutional kernels to distribute more learning capability on the xy plane than on the z axis.