The invention relates to an automatic learning method for the automatic learning of the forms of appearance of objects in images in the form of object features from training images for using the learned object features in an image processing system and to a device for carrying out the method.
Such an image processing system can here comprise an object detection system, an object tracking system or an image recording system.
The purpose of object detection systems is the location and classification of objects (for example, vehicles or persons) in digital images. They are used, for example, in motor vehicles, where the surroundings and particularly the area in front of the motor vehicle needs to be investigated for objects such as other vehicles or pedestrians, or in the robotics sector, where the surroundings are to be searched for certain objects by a freely movable robot.
The purpose of object tracking systems is the retrieval of an object (for example, of a vehicle or of a person) in an image of an image sequence, with the prerequisite that its position, dimensions and form of appearance in one or more previous images of the image sequence are known.
The purpose of image recording systems is the determination of image transformations (for example, translations) between two images, which make it possible to cause the images to overlap by using the transformation. For example, methods for generating panoramic images cause the overlapping areas of two images to overlap in order to generate a total image (so-called stitching). From the relative positions of the image contents in two images, it is possible to determine the necessary transformation information.
The methodology of the monitored automatic learning of an object detection system uses a preferably large number of annotated training images which contain or represent both the image contents of the objects to be learned and also their image backgrounds. An image area around an image position in which an object to be learned is located in the training image is referred to as a positive training example, and it is annotated positively. Image areas in the training image in which no objects to be learned are located (in the image background) are referred to as negative training examples (negative annotation).
During the training of the object detection system, positive and negative training examples from the training images are used, in order to learn object features therefrom, which allow as unambiguous as possible a separation of object and background. The resulting learned object features are used in the object detection system for the purpose of allowing in any images (images not seen in the training) the detection of the learned object.
A basic problem here is the required processing of a preferably large number of positive and negative training examples, which is needed for the acquisition of the possibly multifaceted forms of appearance of backgrounds and objects. For example, let us assume a training image of size 1000×1000 pixels, in which an object of size 100×100 pixels is located. While in this case exactly one positive training example is given, (1000−100+1)×(1000−100+1)−1=811,800 usable negative training examples of size 100×100 pixels are contained in the training image, which overlap in the image plane.
A desirable processing of a large number of training examples is therefore of great interest both from the functional point of view (training of a larger variance of forms of appearance) as well as from an operational point of view (expense in terms of time and processing technology).
In the image tracking systems, the annotated training images consist of the images of an image sequence, in which the position, dimensions and form of appearance of the object to be tracked are already known from previous images of the image sequence or annotated. An initial annotation can occur, for example, by a user (marking of the object to be tracked), by an object detection system, or by the detection of moving objects. While, in an object tracking system, positive annotations (positive training examples) are available only from the previous images of the image sequence—and thus only in a small number—such a system benefits all the more from the rapid learning of many negative annotations (object backgrounds, negative training examples). This provides in particular large information content because they differ little from image to image. By comparison, an object detection system must be trained often against negative annotations (object backgrounds) which are not necessarily identical to the object backgrounds occurring in the operational use.
For recording two images in an image recording system, one of the two images is interpreted as a training image, and the other as a test image. The determination of the positive annotations in the training image has to be established specifically for the recording task and the transformation information to be determined therewith in terms of number and position. For example, for generating panoramic images, one or more positive annotations at fixed positions in the expected overlap area of two images are selected (for example, at the right image margin). The rest of the image is considered to have a negative annotation. Alternatively, positive annotations can be generated by manual or automatic determination of prominent image areas, i.e., by determining image areas that are particularly suitable for their retrieval in the test image (for example, highly structured image areas). If more than two images (for example, an image sequence) are to be recorded together, positive and negative annotations in appropriate form can be selected in more than one image in the sequence (in the sense of several training images).
While, in contrast to object detection systems and the object tracking systems, in the case of image recording systems, the aim is the retrieval of general image contents (not necessarily objects) in different images, the term objects is used below for the purpose of a simplified formulation. Accordingly, objects denote image contents that are to be located in images without being confused with other image contents.
The prior art is an explicit generation of a large number of positive and negative training examples in the form of feature data vectors with their explicit processing in an automatic learning approach (for example, support vector machine or neuronal network).
The conventional methods solve this problem in a discretized form. Individual training examples are here extracted in a discrete manner in the areas determined by the annotation images and converted to individual feature data vectors. Since, as a result of overlap in the image plane, a large number of such training data vectors can be obtained from a single feature image, typically only a small subquantity is selected in this step in order to reduce the computation expenditure. The resulting general validity of the object feature contributions that can be obtained from a training image in a single process step is consequently limited.