Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. The goal of dense captioning is to densely detect visual concepts (e.g., objects, object parts, interactions between objects, scenes, events, etc.) from images, and to label each visual concept with a short descriptive phrase.