Deep learning is a technology used to cluster or classify objects or data. For example, computers cannot distinguish dogs and cats from photographs alone. But a human can easily distinguish those two. To this end, a method called “machine learning” was devised. It is a technique to allow a computer to classify similar things among lots of data inputted into the computer. When a photo of an animal similar to a dog is inputted, the computer may classify it as a dog.
There have already been many machine learning algorithms to classify data. For example, a decision tree, a Bayesian network, a support vector machine (SVM), an artificial neural network, etc. have been developed. The deep learning is a descendant of the artificial neural network.
Deep Convolution Neural Networks (Deep CNNs) are the heart of the remarkable development in deep learning. CNNs have already been used in the 90's to solve the problem of character recognition, but their use has become as widespread as it is now thanks to recent research. These deep CNNs won the 2012 ImageNet image classification tournament, crushing other competitors. Then, the convolution neural network became a very useful tool in the field of the machine learning.
FIG. 1 shows an example of various outputs to be acquired from a photograph using a deep CNN according to prior art.
Classification is a method for identifying a type of a class to be acquired from a photograph, for example, as shown in FIG. 1, determining whether an acquired object is a person, a lamb, or a dog. Detection is a method for finding every object and displaying the found object as enclosed in a bounding box. Segmentation is a method for distinguishing a region of a specific object from other objects in a photograph. As the deep learning has recently become popular, the classification, the detection, and the segmentation are using the deep learning heavily.
FIG. 2. is a diagram schematically illustrating a detection method by using the CNN.
By referring to FIG. 2, the learning device receives an input image and applies a plurality of convolution operations to the input image through a plurality of convolutional filters (or convolutional layers) to thereby generate at least one feature map. Then, the learning device allows the feature map to pass through a detection layer to thereby generate at least one bounding box, and then allows the bounding box to pass through the filtering layer to thereby generate a final detection result. Thereafter, backpropagation is performed by using a loss value obtained by referring to the detection result and its corresponding a ground truth (GT) value which has been annotated by a person in advance to thereby allow a detector (i.e., the learning device) to gradually make the detection result value get closer to the GT value.
Herein, the performance of the learned detector may be somewhat proportional to size(s) of database(s) for training.
Meanwhile, according to a conventional art, to create an image database for training, a person generates GTs by drawing GT boxes or by annotating classes on each of training images in the image database as shown in FIG. 3.
However, there was a problem in that the number of the training images included in the image database for training is not directly proportional to the performance of the learning device for, e.g., the detector. This is because an effective learning process is achieved only when there are many training images which include one or more objects with a low possibilities of being correctly detected by the detector. In general, as the performance of the detector becomes more enhanced during the learning process, it becomes more difficult to improve the performance of the detector by using additional training images.
For example, on condition that the performance of the detector becomes 95% through the learning process, if there are 10,000 images in the image database for training, useful images which can contribute to the performance of the detector may be only 5%, i.e., 500 images, of the 10,000 images. If the number of the training images in the database for training is increased by 10 times, people should manually generate GTs for 90,000 images, which requires a lot of costs for establishing such a database, but only about 4,500 images would be useful for the improvement of the performance of the detector. Furthermore, if the performance of the detector becomes 98% through the learning process using the useful 4,500 images, the costs for establishing the database for the improvement of the detector rapidly increase. In this case, in order to secure useful additional 4,500 images, GTs should be prepared for more than 2,000,000 images.