The exemplary embodiment relates to video analysis and finds particular application in connection with a system and method for initialization of virtual worlds based on real-world data. The system and method find application in Multi-Object Tracking (MOT), which entails automatically detecting and tracking objects, such as cars, in real-world and synthetic video streams.
Assessing performance on data not seen during training is often used to validate machine learning models. In computer vision, however, experimentally measuring the actual robustness and generalization performance of high-level recognition methods is difficult in practice, especially in video analysis, due to high data acquisition and labeling costs. Furthermore, it is sometimes difficult to acquire data for some test scenarios of interest, such as in inclement weather or for a scene of an accident.
The acquisition of large, varied, representative, unbiased, and accurately labeled visual data often entails complex and careful data acquisition protocols which can raise privacy concerns. In addition, it is time-consuming and requires expensive annotation efforts. Crowdsourcing may be used to obtain the annotations. See, for example, J. Deng, et al., “ImageNet: A large-scale hierarchical image database,” Computer Vision and Pattern Recognition (CVPR), pp. 248-255 (2009). For accurate results, this process often has to be done for every task and data source of interest, with little potential for re-usability.
Existing methods of video understanding are particular prone to these challenges, due to the vast amount of data and the complexity of the computer vision tasks of interest (e.g., multi-object tracking or action recognition), which often involve detailed annotations (e.g., tracking of various objects through a sequence of frames). Although cheaper annotation processes, or even omitting annotation all together, might be used during training via weakly-supervised or even unsupervised learning, experimentally evaluating model performance entails accurate labeling of large datasets.
These limitations of current video analysis methods are illustrated, for example, by a widely-used multi-object tracking dataset known as KITTI, which is a project of Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago. See, Geiger, et al., “Are we ready for autonomous driving? The KITTI vision benchmark suite,” Computer Vision and Pattern Recognition (CVPR), pp. 3354-3361 (2012), hereinafter, “Geiger 2012”. The KITTI vision suite contains only 29 test sequences captured in similar good conditions and from a single source. This and other existing datasets in computer vision generally do not contain a sufficient variety of conditions for properly assessing the performance of video analysis algorithms. Varying conditions, e.g., day, night, sun, rain, multiple detailed object class annotations, e.g., persons, cars, license plates, and different camera settings, are some of the factors which affect video analysis and should be considered in assessing algorithms.
The use of synthetic data in computer vision has been evaluated in low-level video analysis tasks, such as optical flow estimation, which entail costly pixel-level annotations. One example of using computer-generated videos to benchmark low-level vision algorithms is described in D. J. Butler, et al., “A naturalistic open source movie for optical flow evaluation,” ECCV, Part VI, LNCS 7577, pp. 611-625, 2012, hereinafter, “Butler 2012”. In practice, however, the use of synthetic data in existing computer vision methods faces two major limitations. First, data generation is itself costly, as it entails creating an animated video from scratch. In addition to the financial costs, the expertise and time involved in making such videos does not allow the creation of large amounts of data. This makes it difficult to generate specific scenes at the request of a client to test a particular scenario of interest. Recording scenes from humans playing video games is an alternative approach, but this approach also faces similar time costs and also limits the variety of scenes. See, e.g., J. Marín, et al., “Learning appearance in virtual scenarios for pedestrian detection,” Computer Vision and Pattern Recognition (CVPR), pp. 137-144, 2010, hereinafter, “Marín 2010”.
The second limitation of existing approaches for generating and using synthetic data in video analysis lies in its usefulness as a proxy to assess real-world video analysis performance. Existing synthetic datasets often include both a particular training set and test set, thus only allowing the assessment of performance in that particular virtual world. In addition, previous studies have indicated that existing computer graphics techniques do not ensure that low-level computer vision algorithms performing well in virtual worlds would also perform well on real-world sequences. See, e.g., T. Vaudrey, et al., “Differences between stereo and motion behaviour on synthetic and real-world stereo sequences,” Image and Vision Computing New Zealand, (IVCNZ), pp. 1-6, 2008.
Some existing computer vision approaches have used synthetic images for training data augmentation. These existing approaches are mainly limited to using rough synthetic models or synthesized real examples for learning. See, e.g., the following models for pedestrian detection: A. Broggi, et al., “Model-based validation approaches and matching techniques for automotive vision based pedestrian detection,” CVPR Workshops, p. 1, 2005; M. Stark, et al., “Back to the future: Learning shape models from 3D CAD data,” BMVC, Vol. 2, No. 4, p. 5, 2010; Marín 2010; and D. Vázquez, et al., “Unsupervised domain adaptation of virtual and real worlds for pedestrian detection,” ICPR, pp. 3492-3495, 2012.
Recent approaches have attempted to study whether appearance models of pedestrians in a virtual world can be learned and used for detection in the real world. See, e.g., D. Vázquez, et al., “Virtual and real world adaptation for pedestrian detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 36 (4), 797-809, 2014; and H. Hattori, et al., “Learning scene-specific pedestrian detectors without real data,” CVPR, pp. 3819-3827, 2015, which aimed to learn high-quality pedestrian detectors without real data. However, the learned detectors are scene and scene-location specific. In practice, this method is limited to a fixed camera, and involves knowledge of the scene geometry and camera calibration parameters.
Photo-realistic imagery has been used for evaluation purposes, but in most cases, the end-task was low-level image and video processing. One approach evaluated low level image features. See, B. Kaneva, et al., “Evaluation of image features using a photorealistic virtual world,” ICCV, pp. 2282-2289, 2011. Another work proposed a synthetic dataset for optical flow estimation. See, Butler 2012. Photo-realistic imagery has been used for basic building blocks of autonomous driving. See, Chenyi Chen, et al., “DeepDriving: Learning affordance for direct perception in autonomous driving,” Technical Report, 2015. These approaches view photo-realistic imagery as a way of obtaining ground truth that cannot be obtained otherwise (e.g., no sensor can measure optical flow directly). When ground-truth data can be collected, for example, through crowd-sourcing, real-world imagery is favored over synthetic data because of the artifacts the latter may introduce.
The potential of virtual worlds for generating endless quantities of varied video sequences on-the-fly would be especially useful to assess model performance, which is invaluable for real-world deployment of applications relying on computer vision algorithms. Thus, there remains need for the automatic generation of arbitrary photo-realistic video sequences with ground-truths to assess the degree of transferability of experimental conclusions from synthetic data to the real-world.