Recent years have seen the emerging of machine-learning (ML) models for detection and identification of real-world objects and scenes from still images and video sequences. These models provide the core logic of ubiquitous applications, including, for example: robotic industrial systems, autonomous driving systems, security camera systems, and the like.
The state of the art for training such ML models for identification of real-world elements relies on obtaining diversified real-world data corresponding to real-world elements and classifying and localizing the real-world elements in a manner that is generalized enough so as to correctly classify newly-presented elements according to their appropriate classes. For example, stop road-signs may possess a variety of appearance parameters (e.g.: orientation, color, size, lighting, etc.), but must always be identified by an ML model of an autonomous vehicle as stop road-signs.
Data for ML model training may originate from for example real-world video footage or discrete images and generated synthetic data.
Real-life video footage and discrete images provide a wide diversity for training ML models but require extensive labor including meticulous tagging of each substantial element in the images. This process is therefore slow and tedious. Moreover, training through real-world imagery does not systematically cover every aspect of the appearance of the real-world element. Pertaining to the above example, a stop road-sign may be designed differently in various geographic territories and may appear differently in various lighting conditions.
In contrast, generated synthetic data may train ML models without need for human intervention in tagging different elements, and may therefore be accomplished more quickly than by real-world imagery training. For example, a processor may generate a synthetic image that includes a stop road-sign and present the image to the ML model for training.
One disadvantage of training ML models by synthetic images is that it tends to create real-world element classes that are very specific, according to the elements integrated within the generated images. Pertaining to the stop sign example: only specific designs of stop road-signs that have been integrated within the synthetic image will be included in a “stop sign” class of elements, excluding images of real-world stop signs of other designs from this class.
Another disadvantage of training ML models by synthetic images is that it tends to include irrelevant information in the training. Pertaining to the same example: a synthetic, repetitive background may be associated by the ML model with the appearance of a stop sign in the generated synthetic image and may erroneously influence the classification of stop signs.
A system and a method for training ML models to identify real-world elements from images or video sequences in a manner that is quick, not labor intensive and generalized, and that also takes into account the relevance of specific segments within the image or video sequence is therefore desired.