Field of the Disclosure
The present disclosure relates to a technique for training a recognizer which recognizes a recognition target from data.
Description of the Related Art
Recently, there are services which analyze activity patterns of people and crowds or detect and report specific events from moving image data pieces captured by monitoring cameras. In order to realize the services, recognition techniques of machine training are essential which can detect attributes of objects such as persons or vehicles, types of actions such as walking or running, and types of personal belongings such as bags or baskets from moving image data pieces captured by monitoring cameras. The services are used in various environments such as, nursing-care facilities, ordinary homes, public facilities such as stations and city areas, and stores like supermarkets and convenience stores. In addition, there are various needs of users to the services even in the same environment. Therefore, flexible and highly accurate recognition techniques of machine training are required which are applicable to various environments and use cases.
A technique for realizing flexible and highly accurate recognition by machine training is described in Ross Girshick, Jeff Donahu, Trevor Darrel, and Jitendra Malik, 2014, “Rich feature hierarchies for accurate object detection and semantic segmentation”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (hereinbelow, referred to as the non-patent literature 1). According to the technique described in the non-patent literature 1, first, general-purpose convolutional neural networks (hereinbelow, abbreviated as CNN) which is applicable to 1000 categories are trained in advance using large-scale supervised data pieces such as ImageNet. After the training, the number of categories are limited according to specific needs of a user, and training is performed in detail. The training in advance and the training in detail are respectively referred to as pre-training and fine-tuning. There is an advantage that pre-training of the CNN which requires an enormous number of parameters enables obtainment of a highly accurate recognizer corresponding to the specific needs in a relatively short time in the fine-tuning. In addition, since large-scale data pieces are used in the pre-training, it is expected that an issue that the enormous number of parameters overfit a specific recognition target can be reduced.
Japanese Patent Application Laid-Open No. 2006-31637 describes a method for selecting any one of a plurality of pre-trained hierarchal neural networks and performing fine-tuning thereon using an input impression degree in prediction of an impression of a musical piece determined by a human sensitivity.
However, the method described in Japanese Patent Application Laid-Open No. 2006-31637 uses a structure of the common hierarchal neural network in the pre-training and the fine-tuning. Thus, it is difficult to flexibly change a recognition target according to a user needs.
On the other hand, according to the technique described in the non-patent literature 1, the number of outputs of the CNN can be changed, so that a recognition target can be flexibly changed in the pre-training and the fine-tuning. However, it is not necessarily the case that 1000 categories of ImageNet which are the recognition targets of the pre-training cover a need of a user who will use the CNN in the future. If the needs is not covered in the pre-training, an enormous number of parameters is required to be learned again in the fine-tuning, and benefits from the pre-training namely shortening of a training time and avoidance of overfit cannot be reaped. The pre-training can be performed on every recognition target by further increasing the number of categories to avoid the issue, however, a further enormous number of parameters is required to recognize innumerable recognition targets. However, a recognition target finally required by a user is small in scale in some cases, and there is an issue that an unnecessarily complicated CNN is trained in many cases. On the other hand, a great labor is required to manually select a recognition target used in the pre-training from innumerable recognition targets in consideration of a user needs.