Detailed reasoning about structures or objects in images is helpful in numerous computer vision applications. For example, it is often critical in the domain of autonomous driving to localize and outline all cars, pedestrians, and miscellaneous static and dynamic objects. For mapping, there is often a need to obtain detailed footprints of buildings and roads from aerial or satellite imagery, while medical and healthcare domains often require automatic methods to precisely outline cells, tissues and other relevant structures.
Neural networks are sometimes an effective way of inferring semantic and object instance segmentation information in challenging imagery. Often, the amount and variety of data that the networks see during training drastically affects their performance at run time. Collecting ground truth instance masks, however, may be an extremely time-consuming task, such as requiring human annotators to spend 20-30 seconds per object in an image.
As object instance segmentation may be time consuming to annotate manually, several approaches seek to speed up this process using interactive techniques. In some approaches, scribbles are used to model the appearance of foreground and background, and segmentation is performed via graph-cuts. Some approaches use multiple scribbles on both the object and background, and have been used to annotate objects in videos.
In some approaches, scribbles are used to train convolutional neural networks (‘CNN’) for semantic image segmentation. In one approach, called GrabCut, 2D bounding boxes provided by an annotator are exploited, and pixel-wise foreground and background labeling is performed using expectation maximization (‘EM’). In some approaches, GrabCut is combined with convnets to annotate structures in imagery. In some approaches, pixel-wise segmentation of cars is performed by exploiting 3D point clouds inside user-provided 3D bounding boxes.
Many approaches to object instance segmentation operate on the pixel-level. Many rely on object detection, and use a convnet over a box proposal to perform the labeling. Although in some works, a polygon is produced around an object. Some approaches first detect boundary fragments, followed by finding an optimal cycle linking the boundaries into object regions. Some approaches produce superpixels in the form of small polygons which are further combined into an object.
In some approaches, polygon object representation has been introduced as an alternative to labeling each individual pixel. One benefit of polygon object representation is that it is sparse; only a few vertices of a polygon represent large image regions. For example, this may allow the user to easily introduce any correction, by correcting the wrong vertices. A recurrent neural network (‘RNN’) may further provide a strong model as it captures non-linear representation of shape, thus effectively capturing typical shapes of objects. This may be particularly important in ambiguous cases such as imagery containing shadows and saturation.
For example, Polygon-RNN is a conceptual model for semi-automatic and interactive labeling to help speed up object annotation. Instead of producing pixel-wise segmentation of an object, as is done in some interactive tools such as Grabcut, Polygon-RNN predicts the vertices of a polygon that outlines the object. Polygon representation may provide several benefits; it is sparse with only a few vertices representing regions with a large number of pixels, it may be easier for an annotator to interact with, and the model may be able to directly take annotator inputs to re-predict a better polygon that is constrained by the corrections. In some embodiments, polygon representation models have shown high annotation speed-ups on autonomous driving datasets.
Further improved polygon representation models may further speed up annotation time, improve neural network learning from polygon representation models, and increase the output resolution of polygons.