A number of systems and programs are offered on the market for the design, the engineering and the manufacturing of objects. CAD is an acronym for Computer-Aided Design, e.g. it relates to software solutions for designing an object. CAE is an acronym for Computer-Aided Engineering, e.g. it relates to software solutions for simulating the physical behavior of a future product. CAM is an acronym for Computer-Aided Manufacturing, e.g. it relates to software solutions for defining manufacturing processes and operations. In such computer-aided design systems, the graphical user interface plays an important role as regards the efficiency of the technique. These techniques may be embedded within Product Lifecycle Management (PLM) systems. PLM refers to a business strategy that helps companies to share product data, apply common processes, and leverage corporate knowledge for the development of products from conception to the end of their life, across the concept of extended enterprise. The PLM solutions provided by Dassault Systèmes (under the trademarks CATIA, ENOVIA and DELMIA) provide an Engineering Hub, which organizes product engineering knowledge, a Manufacturing Hub, which manages manufacturing engineering knowledge, and an Enterprise Hub which enables enterprise integrations and connections into both the Engineering and Manufacturing Hubs. All together the system delivers an open object model linking products, processes, resources to enable dynamic, knowledge-based product creation and decision support that drives optimized product definition, manufacturing preparation, production and service.
In this context and other contexts, scene understanding and image captioning are gaining wide importance. Image captioning is a problem at the intersection of computer vision and natural language processing and consists in, given an input image, generating a caption to describe the input image. Region captioning is a particular kind of image captioning that consists in, given an input image and an input region of interest inside the input image, generating a caption to describe the input region. Dense captioning is an approach going a step further: it consists in automatically finding the different regions of interest in an image and giving a description to each of them. These techniques may be useful in scene understanding applications, for example by providing for automatic generation of 3D experiences from an image.
The following papers relate to image captioning and are referred to hereunder:                [1] R. Krishna et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, arXiv 2016        [2] R. Kiros et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, ICCV 2015        [3] R. Lebret et al. Phrase-Based Image Captioning, 2015        [4] R. Kiros et al. Multimodal Neural Language Models, ICML 2014        [5] T. Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality, NIPS 2013        [6] S. Venugopalan et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015        [7] O. Vinyals et al. Show and Tell: A neural Image Caption Generator, IEEE 2015        [8] A. Karpathy et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, IEEE 2015        [9] A. Karpathy et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning, CVPR 2016        [10] K. Papineni et al. BLEU: a Method for Automatic Evaluation of Machine Translation, ACL 2002        [11] M. Denkowski et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language ACL 2014        [12] I. Sutskever et al. Sequence to Sequence Learning with Neural Networks, NIPS 2014        
Existing image captioning techniques are based on a database that consists of pairs of image/caption used to train a machine learning model (i.e. function) configured to generate the captions. Such a database may be obtained from a crowdsourcing platform where people are asked to write captions describing pictures. Existing databases include MSCOCO for image captioning and Visual Genome [1] for dense captioning. Existing approaches for captioning then consist of two categories: sentence retrieval from a learned multimodal space and sentence generation thanks to the encoder/decoder framework. In both approaches, the input image in the model is encoded and an image signature is retrieved. Then, a caption is retrieved after processing that signature. Evaluation of the quality of the generated captions may be performed by different language metrics [10, 11].
In the multimodal approach [2, 3, 4], a common space for image and phrase representations is learned. Such a common space is like an embedding space for two modalities, images and text, which is learned using techniques such as negative sampling as used in [5] when learning Word2Vec™. Once such a space is learned, the process of sentence generation is executed after having retrieved captions whose signatures are the most similar to the image query signature in the embedding space. A problem with such an approach is that the captions obtained are very biased by the captions already present in the database. Moreover, the retrieval of the most similar captions is an operation that can be very time consuming if the database becomes too large.
In the second approach, an encoder/decoder framework is used for sentence generation [6, 7, 8]. In the first step of encoding the image, a signature of the image is obtained after passing the image through a convolutional neural network and taking the output of some of the higher fully connected layers. Then, as in the general approach developed in [12], the image signature is decoded thanks to a recurrent neural network that generates the sentence word after word. The task of dense captioning also uses the encoder/decoder framework as described above when generating captions of the regions in the image. State of the art method [9] integrates a localization layer inside the neural network to automatically find the regions of interest in the image. Those approaches work well for entire image description, as long as the quality of the database they are trained on is good enough. However, the same models used to generate captions for regions inside an image do not give results as good as in the entire image.
Thus, there still exists a need of an improved solution for captioning a region of an image.