The invention relates to a method for observation of a person in an industrial environment.
Present day industrial manufacturing processes in automobile production can generally being divided into fully automatic cycles that are carried out exclusively by machines, and completely manual cycles that are carried out exclusively by individual workers or a number of workers cooperating with one another. To date, the close cooperation between persons and machines, in particular industrial robots, has been greatly limited owing to safety aspects. A plurality of complicated and expensive safety systems such as, for example, metal fences, light barriers, laser scanners or combined systems are required in order to keep workers in the production environment away from potentially hazardous machines. The systems are incapable of detecting the exact location, the body posture or the movement behavior of the human. As soon as a worker approaches the robot, the latter is stopped and the production process is interrupted.
The missing “knowledge” of such safety systems that relates to the monitored production environment is particularly disadvantageous in that the manufacturing processes greatly profit from a close collaboration of human and machine. Whereas the human behaves flexibly and adaptively, but is inclined to make mistakes when carrying out repetitive work operations, machines operate quickly and exactly but in this case are static and not very flexible. For example, in the case of a completely automatic manufacturing unit consisting of a number of cooperating robots the production process must be stopped when a single one of the cooperating robots is defective. It would be desirable here to replace the defective robot temporarily by a human worker who cooperates with the remaining robots such that the production can be continued. Efficiency, flexibility and quality of industrial manufacturing can be raised considerably by close cooperation of humans and machines for the purpose of semi-automated processes.
Present day safety systems in the field of industrial production consist mostly of metal fences, light barriers and/or laser scanners. The first approaches are being made to securing robot protection zones on the basis of image processing, and these are described in detail in [1] and [2]. The method described in [1] uses stereo image analysis to detect whether an object is located in the protection zone of the robot, without in so doing extracting information about the nature of the object (for example human or object) or its movement behavior. In [2] a person is detected exclusively with the aid of the skin color of the hands, something which leads to problems with the reliability of detection in the case of inconstant lighting conditions (variable color temperature); the method described cannot be employed at all when working gloves are used. Just like the prior art set forth in [1], these methods do not extract any information about the type of the object. Again, in the case when a person is involved they do not detect the body parts and the movement behavior of said person. Such systems are therefore certainly capable of shutting down a robot when a person intrudes into its protection zone, but are incapable of detecting whether a collision is being threatened or whether human and machine are cooperating regularly and without any hazard in the case when a person is located in the immediate vicinity of the robot.
In accordance with the review article [3], in the field of the recognition of persons the appropriate approaches are divided into two-dimensional methods with explicit shape models, or no models, and into three-dimensional models. In [4], windows of different size are pushed over the initial image; the corresponding image regions are subjected to a Haar wavelet transformation. The corresponding wavelet coefficients are obtained by applying differential operators of different scaling and orientation to different positions of the image region. A small subset of the coefficients based on their absolute value and their local distribution in the image are selected “by hand” from this set of features, which can be very large in some circumstances. This reduced set of features is fed for classification to a support vector machine (SVM). For detection purposes, windows of different size are pushed over the image, and the corresponding features are extracted from these image regions; the SVM subsequently decides whether the corresponding window contains a person or not. In [5], temporal sequences of two-dimensional Haar wavelet features are combined to form high dimensional feature vectors, and these are classified with the aid of the SVMs, thus resulting in a gain in detective performance by comparison with the pure individual image approach. In [6], the method of chamfer matching is applied to the detection of pedestrian contours in the scenario of road traffic using a non-stationary camera. In [7], the technique of chamfer matching is combined with a stereo image processing system and a neural network with local receptive fields in accordance with [8] which is used as a texture classifier in order to attain a reliable and robust classification result.
Other methods use statistical shape models in order to detect and to track persons. Here, [9] concerns models that are obtained by means of a training phase and in which exemplary contours are described by positions of feature points. The parameter set is reduced by using a principal component analysis (PCA), thus resulting in a certain generalization ability in addition to a reduction in the computational outlay. This is useful in the event of the tracking of such a deformable contour, for example of a moving pedestrian, over time, since parameter sets inconsistent with the learning set are avoided from the very first. It is not only the contours of whole persons that can be detected—so also can those of a hand, and the corresponding movements can be detected. However, with this approach all the features must be present at any time, and for this reason no instances of masking are permitted. Furthermore, it is not excluded that the parameterization determined by the training phase permits physically impossible states. The shape representation is given by B splines in [10]. Assuming a stationary camera, the person is segmented out from the background by difference image analysis; the tracking algorithm operates with Kalman filters.
Elsewhere, the technique of color cluster flow is used [11] in order to detect persons in image sequences recorded with a moving camera. Even in the event of partial masking of the person, it is therefore possible to detect persons and track them over time very reliably. This detection stage is combined with the TDNN classification approach described in detail in [8].
Recent work relating to a complete, real time system for detecting pedestrians in road traffic scenes and consisting of a detection stage, a tracking stage and an object classification stage are described in [12].
Another group of methods for detecting persons are model based techniques in which explicit prior knowledge about the appearance of persons is used in the form of a model. Since instances of masking of parts of the body are problematic in this case, many systems additionally assume prior knowledge about the type of the movements to be detected and the viewing angle of the camera. The persons are segmented out by subtraction of the background, for example, and this presupposes a stationary camera as well as a background which does not change, or changes only slowly. The models used consist, for example, of straight rods (“stick figures”), with individual body parts being approximated by ellipsoids [13-16].
An example of the simultaneous use of the very different features of intensity, edges, distance and movement for the purpose of a multi-cue approach to the detection of persons standing or moving in a fashion aligned laterally to the camera is described in [17]. This approach is “object oriented” to the effect that for a specific application generic objects are defined (for example person, background, floor, light source) and associated methods are made available for detecting these objects in the image. If a few object properties are extracted from the image, the objects are instantiated such that it is possible subsequently to apply further, specialized methods.
Commercial systems for three-dimensional determination of the posture (location and fashion in which the body parts are adopted) of persons are based on the detection of marks applied to the body. A powerful method for marker-less three-dimensional determination of posture is described in [18].
A large portion of the work on detection of the posture of persons is concentrated on the 3D reconstruction of the hands. In [19], the hand is described by an articulated model with kinematic constraints, in particular with regard to physically possible joint angles. These constraints enable determination of the three-dimensional position, posture and movement of the hand. A method for detecting movement cycles of the hands (and gestures) that is based on a contour analysis, a tracking stage and a classifier, based on hidden Markov models (HMMs), for the movements is described in [20]. The GREFIT system described in [21] is capable of classifying the dynamics of hand postures on the basis of gray scale images with the aid of an articulated model of the hand. In a first stage, a hierarchical system of neural networks localizes the 2D position of the finger tips in the images of the sequence. In the second stage, a further neural network transforms these values into the best fitting 3D configuration of the articulated hand model. In [22], hand postures are detected directly by labeling corresponding images by means of a self-organizing map (SOM) and by subsequent training with the aid of a neural network.
A trajectory analysis that is based on a particle filter and which also includes symbolic object knowledge is used in [23] the detection of “manipulative gestures” (hand movements that serve for gripping or displacing objects). This approach is extended in [24] in the context of human/robot interaction to the effect that the classification of the hand trajectory by a hidden Markov model is performed in combination with a Bayes network and a particle filter. An approach to the classification of building actions (for example assembly of parts) by an analysis of movement patterns with the aid of the particle filter approach is described in [25]. It is described in [26] how the results of the analysis of the hand movements are integrated with the aim of a more reliable object detection in an approach for detecting components composed from individual elements. In this context, [27] describes a view-based system in which objects are detected by means of neural networks that can be subsequently trained online, that is to say during the operating phase.
A method for 3D modeling of a person starting from 2D image data is described in [30]. Here, a multicamera system is used to acquire image data of a person, and body parts of the latter identified in the 2D image data, in particular by means of a template matching. The body parts thus identified are then modeled by dynamic template matching with the aid of 3D templates. The result of this is that the persons can be identified quickly and continuously even if they are partially masked, or temporarily could not be acquired by the multicamera system. The detected persons are then tracked in the image data with the aid of a kinematic movement model and of Kalman filters.
An identification of persons and their body parts within image data transformed into 3D space is described in [31]. 3D voxel data are generated starting from the image data generated by a multicamera system. Proceeding therefrom, corresponding templates are matched to body parts by means of specific matching algorithms. Here, as well, reference is made to a kinematic body model as previously in the case of [30].
In addition to generation of 3D person models from 2D image data and general movement analysis, the contributions described in [32] additionally indicate a first approach to the analysis of the biometric behavior of the observed persons, in particular their gestures (“hand raising for signaling the desire to ask a question”).
The prior art described above shows that a plurality of methods based on the image processing are known for the purpose of detecting persons in different complex environments, for detecting body parts and their movement cycles, and for detecting complex objects composed of individual parts and the corresponding assembly activities. The applicability of these algorithms is, however, frequently described only with the aid of purely academic applications.