Regarding “real-world data,” one of the most challenging aspects of generating training data is that the training data should resemble an underlying distribution of “real-world data.” “Real-world data” is data that is similar to what a user is trying to match when a user is presented with documents or images on a screen.
The outcome of the service is only as good as the trained model. Use of better or more comprehensive training data allows for the creation of a better (e.g., more accurate or realistic) model, because the model is only as “smart” as the data that was used for training. This is why it is important to improve the training data generation process. Training data should satisfy two important aspects—(i) comprehensiveness, i.e., having richly tagged real-world images that are captured in a wide spectrum of uncontrolled environments (e.g., arbitrary poses, textures, backgrounds, occlusion, illumination) so that the model is proficient at handling a diverse array of image requests from the customers during production and (ii) scale, i.e., having large amounts of such tagged real-world images so that the model is adequately trained. There exists a shortage of such training data because collecting and tagging real-world images is tedious, time consuming, and error prone.