This section provides background information related to the present disclosure which is not necessarily prior art.
FIG. 1 is a diagram illustrating an example of a conventional method of segmenting a three-dimensional (3D) medical image using deep learning, in which a segmentation system includes n number of segmentation modules S1, . . . Sj, . . . , Sn, corresponding to n number of slices 1, . . . , j, . . . , n, a recurrent neural network (RNN) module R, and a segmentation probability map A for providing an integrated segmentation image. The 3D medical image is voxel data (e.g., CT image) including a plurality of slices, and in order to segment a target region such as a nodule from the 3D medical image, (1) a portion corresponding to a target region is segmented from each slice using a deep learning technique such as fully convolutional network (FCN) and the segmented portions are integrated, or (2) a 3D convolution may directly be used. In FIG. 1, a technique of performing segmentation on n number of slices 1, . . . , n using a variant of U-net and utilizing an RNN (Recurrent Neural Network) such as an LSTM (Long Short-Term Memory network) to utilize spatial information between slices, rather than directly integrating the segmented slices, is illustrated (Combining Fully Convolutional and Recurrent Neural Networks for 3D Biomedical Image Segmentation; Jianxu Chen, Lin Yang, Yizhe Zhang, Mark Alber, Danny Z. Chen (Submitted on 5 Sep. 2016 (v1), last revised 6 Sep. 2016 (this version, v2)); arXiv.org>cs>arXiv:1609.01006).
FIG. 2 is a diagram illustrating an example of a conventional method of extracting features from a plurality of video frames using deep learning, in which a semantic extraction system includes a feature extraction unit F and a sequence learning unit SL. As an input X of the system, a plurality of video frames having time series characteristics may be used. The system may extract features from each of a plurality of video frames through the feature extraction unit F (e.g., convolutional neural network (CNN)) and then allow these features to pass through the sequence learning unit (e.g., LSTM), thereby extracting features or a meaning having time series characteristics as an output Y from the video (Long-term Recurrent Convolutional Networks for Visual Recognition and Description; Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell; (Submitted on 17 Nov. 2014 (v1), last revised 31 May 2016 (this version, v4)); arXiv.org>cs>arXiv:1411.4389).