Object detection is an important problem in a variety of engineering and scientific disciplines such as computer vision, artificial intelligence, and biometrics. For example, in the many industrial settings today, robots are used for parts assembly and manufacturing. These robots are equipped with one or more cameras, e.g., CCD or CMOS, which give them vision. Often, objects (i.e., parts) are contained in a bin. The robot must recognize the object/part in the bin so it can pick it up to assemble the product. However, the object can be in any number of poses (position, orientation, rotation), under various lighting conditions, etc. So, the robot must be trained to recognize the part regardless of its pose and environment. As is known in the art, robots include software that attempts to identify the object from the camera image. Statistical learning and classification has been successfully used for some of such object detection applications.
In a real-world environment, the appearance of the object changes dramatically due to the change in view perspective, illumination, or deformation. As such, a single classifier cannot effectively detect objects whose appearance is subject to many changes. Classifier networks are general solutions based on the divide-and-conquer concept. The classifier networks must be trained to properly classify (detect, recognize) the particular object(s) of interest, such as an assembly line part. Generally, the process starts with an untrained network. A training pattern (e.g. images of the object in various poses and lighting conditions and possibly false target images) is presented to the network. The image signals are passed through the network to produce an output (for example, the result of classification, detection or measurement). The output results are evaluated and compared to optimal results and any differences are errors. This error can be a function of weights that are assigned to features of the object image, for example. Some features are better than others for recognizing the object and may be assigned a greater weight. The weights are iteratively adjusted to reduce the error and thus give greater confidence in the classification network. It is desirable to automatically train a classification network with minimum error, time, and effort.
In order to recognize an object in the images, the computer vision system should be initially trained on a digital representation of that object. Such digital representation involves modelling of the object and generation of a descriptor (or classifier) that could be applied to any image during runtime to find the target object. Creating or selecting of the appropriate classifier, as well as tuning of that classifier to ensure its robust performance during runtime, are driven by the application scenario, which could be acquired by (1) explicit user input, (2) existing geometrical model (such as CAD model) and (3) set of images captured in the target environment.
The images used for training and evaluation of the vision solution should represent possible appearances of the object in a real environment—if the classifier can recognize the target object in evaluation images, it should be able to successfully find it in any image during the runtime. In reality, capturing of the representative images to ensure that the system will reliably perform in normal operation is a great challenge and in most cases it is not practical to obtain them. Therefore, it takes a lot of human intuition and multiple interactions with the vision system users to address possible variations of environment such as noise, occlusions and lighting variations, and to create and tune the solution. In many cases, the users are not able to describe the factors affecting performance of their application in terms that could be effectively used for vision solution development and tuning. As a result, the researcher or image processing engineer has to modify the classifier or tune its parameters a number of times, based on failure cases during the system setup or operation.
Similar problems (difficulty to obtain user's prior knowledge and to collect images representing the problem across various environmental conditions) exist in other computer vision applications. For example, in machine vision inspection systems it is not always required to detect and recognize objects but is necessary to find abnormalities from the nominally good image. Variations of such abnormalities in the product under inspection compound with variation of ambient lighting, and, as a result, it is difficult for the users to define the requirements for their system. Consequently, it takes a number of iterations between the user and the vision system developer to create a vision solution and tune it for the required balance between falsely detected defects (false positives) and missed defects (false negatives).
Object detection algorithms typically require large datasets to adequately train the classifier network. In these datasets it is often necessary to have both positive and negative sample images. It is also necessary for samples that include the object to have been labelled with ground truth attributions (e.g. location, orientation, pose, etc). These visual ground truth annotations to the dataset are usually input manually by an operator that is observing the object when its image is taken by camera.
In general, the larger the dataset, the better the algorithm may be trained, which in turn leads to better detection results. However, large datasets require a long time to gather and are often not feasible to get manually as it could take days or weeks of labor to acquire and label the required number of images.