1. Field of the Invention
The present invention generally relates to techniques for recognizing a target within an image sequence, and more particularly to a method and an equipment for extracting image features from the image sequence which describes a time sequence of frames of the image.
The image sequence refers to an image which is obtained from a video camera, weather radar equipment, remote sensing or the like, for the purposes of monitoring people, traffic and the like, controlling fabrication processes, analyzing or predicting natural phenomena such as the weather.
2. Background Art
Local (for example, several tens to several hundreds of km.sup.2) and short-term (for example, 5 minutes to several hours) precipitation phenomena such as heavy rain, heavy snow and thunderstorm have yet to be elucidated completely. However, the effects of the local and short-term precipitation phenomena on daily lives and various industrial activities are large, and it is an important task to predict the precipitation phenomena.
Conventionally, in order to forecast such local precipitation phenomena, an expert such as a meteorologist visually specifies the phenomena from an observed weather radar image and creates a weather forecast. In addition, the weather forecast is created by analyzing a motion of an echo pattern within a weather radar image, and referring to a predicted echo image which is obtained by predicting a future echo pattern. The former prediction is based on the regularity of the weather phenomena acquired by the expert from past experiences, and requires years of skill. On the other hand, according to the latter prediction using image analysis, it is assumed in most cases that the phenomenon of immediately preceding several hours is maintained, and it is thus impossible to follow a rapid change in the phenomenon even though the forecast most expected to predict such a rapid change. Furthermore, because it is impossible to satisfactorily represent the phenomena such as an accurate moving velocity, appearance, disappearance, deformation and the like of a precipitation region, there is a problem in that the prediction accuracy is insufficient.
Accordingly, as one method of making an improvement with respect to the above described problem, it is conceivable to utilize a repeatability of the weather phenomena that "similar weather phenomena occur repeatedly", and to automatically retrieve past weather radar images with similar phenomenons based on the weather radar image, so as to present the similar past weather radar images to the expert. Alternatively, it is conceivable to categorize the weather radar images into categories of the weather phenomena, and to select and apply a prediction technique suited for each specified weather phenomenon. In order to realize such methods, it is necessary to extract an image feature value (hereinafter also simply referred to as an image feature) from the weather radar image which is an image sequence data.
Conventionally, as methods of extracting the image feature of the image sequence, texture analysis techniques which obtain the features of a texture within a still image, and motion estimation techniques which obtain a displacement quantity of the image pattern between frames of the image sequence have been proposed.
For example, Robert M. Haralick, "Statistical and Structural Approaches to Texture", Proceedings of the IEEE, Vol.67, No.5, May 1979 proposes a statistical texture analysis which is one approach of the conventional texture analysis technique. According to this statistical texture analysis, statistics such as "a frequency of existence of a combination of a certain pixel and another pixel located 3 pixels to the right of the certain pixel having a luminance difference of 1 between the certain pixel and the other pixel" is calculated, and the image features are extracted. This statistical texture analysis is used to detect a difference in two-dimensional image features such as a pattern (called "texture") on the image surface obtained by a repetition of basic graphic elements. More particularly, a set of basic elements called primitives is first obtained from the image of 1 frame of the image sequence by a process such as image binarization. Next, a spatial feature such as directionality is calculated as the statistics such as the direction and length of an edge of each primitive. In addition, the spatial feature such as the regularity of the above described repetition of the primitives is calculated from relative position vectors among the primitives.
The image feature proposed by Robert M. Haralick referred above includes a feature value which is defined from a co-occurrence matrix of the image gray level. The co-occurrence matrix is a matrix having as its element a probability P.sub..delta. (i, j), (i, j=0, 1, . . . , n-1) that a point which is separated by a constant displacement .delta.=(r, .theta.) from a point having a gray level (or brightness or intensity) i in the image has a gray level j. For example, feature values such as those described by the following formulas (0.1) and (0.2) can be calculated from the co-occurrence matrix, where .delta. is set to r=1, .theta.=0 (deg), for example. ##EQU1##
The angular second moment described by the formula (0.1) represents the concentration and distribution of the elements of the co-occurrence matrix, and it is possible to measure the uniformity of the texture. Such a feature value is used to analyze the geographical features from an air photograph and sandstone. However, in general, the feature value obtained from the co-occurrence matrix is in many cases unclear as to what is being physically measured.
According to the conventional technique using the texture analysis, each frame of the image sequence is treated as an independent image. For this reason, no measurement is made with respect to the features related to the motion, although the motion is an essential element in determining the features of the image sequence.
On the other hand, as conventional motion estimation methods, Yoshio Asuma et al., "A Method for Estimating the Advection Velocity of Radar Echoes Using a Simple Weather Radar System", Geophysical Bulletin of Hokkaido University, Sapporo, Japan, Vol.44, October 1984, pp.23-34 or Yoshio Asuma et al., "Short-Term Prediction Experiment (Part 1) of Snow Precipitation Using a Simple Weather Radar System", Geophysical Bulletin of Hokkaido University, Sapporo, Japan, Vol.44, October 1984, pp.35-51 propose methods of obtaining 2 frames of the image sequence, matching each small region within the frames, and measuring the motion (velocity component) of a target included in the small region, for example. These proposed methods use the images of 2 different frames of the image sequence. First, a best matching position where a certain region (normally, a square region) within the image of one frame best matches the image of the other frame is searched. Next, the moving velocity of the object within the target region is estimated from a displacement between the 2 frames and the frame interval of the 2 frames. A cross-correlation coefficient of the image gray level value is used to describe the degree of matching of the 2 image regions. When the gray level distributions within the 2 image regions are respectively denoted by I.sub.1 (i, j) and I.sub.2 (i, j), the cross-correlation coefficient can be: calculated from the following formulas (0.3), (0.4) and (0.5), where M and N indicate the sizes of the 2 image regions. ##EQU2##
The cross-correlation coefficient is calculated while shifting the position of one image region on the image, and a search is made for a displacement (K, L) which makes the cross-correlation coefficient a maximum. Based on the displacement (K, L) which is obtained, moving velocity components can be calculated from the following formulas (0.6) and (0.7), where V.sub.x and V.sub.y respectively denote a x-component and a y-component of the velocity component, and .DELTA. denotes the frame interval. If adjacent frames are used, .DELTA.=1. In addition, the obtained velocity uses the units "pixels/frame". EQU V.sub.x =K/.DELTA. (0.6) EQU V.sub.y =L/.DELTA. (0.7)
The above described method calculates the moving velocity using an assumption that the target within the block where the matching is carried out does not change shape with time and translates uniformly. However, the calculated moving velocity does not sufficiently reflect the features of the target non-rigid body which appears and disappears and locally includes various motion components. According to the method of measuring the velocity component from the image sequence, it is only possible to measure the velocity component such as the translation of the target. In addition, it is impossible to measure the spatial features such as the shape and surface texture of the target within the image sequence, and the arrangement of the image elements.
Furthermore, Japanese Laid-Open Patent Applications No.10-197543 and No.10-206443 propose methods of detecting a motion trajectory which has a surface shape and is drawn by the edge or contour of the target within the image plane in a space (hereinafter also referred to as a spatiotemporal space) which is formed when the image sequence is stacked in the time-base direction, and measuring the motion (velocity component) of the target from the directions of intersection lines formed by a plurality of different tangent planes tangent to the motion trajectory.
According to the method of measuring the motion of the target in the spatiotemporal space, the Hough transform (also called voting) is first used, for example, and the spatiotemporal space image is transformed into a parameter space which represents the velocity component (direction and magnitude of the velocity) of the target object. Next, a peak of the distribution within the parameter space is detected, and the velocity component of the target object is obtained from the peak coordinate values. In this method of measuring the motion of the target, it is known that the most dominant translational velocity component within the target region can be acquired robustiously with respect to noise and occlusion.
Furthermore, as a conventional method of detecting a dynamic target within the image sequence and measuring the motion of the target, a method based on a gradient of the local gray level value is also known.
According to the conventional texture analysis technique, each frame of the image sequence is treated as an independent image, and thus, it is impossible to measure the features related to the motion which is an essential element of the features of the image sequence. In addition, since this conventional texture analysis technique extracts the features for each frame, it is impossible to distinguish the dynamic target and the background, thereby being easily affected by concealment, that is, occlusion and noise. As a result, it is difficult to stably extract the space features of the dynamic target.
Moreover, according to the above described conventional method of measuring the velocity component from the image sequence, it is only possible to measure the velocity component such as the translation of the target, and it is impossible to measure the features such as the shape and the surface texture of the target within the image sequence. In addition, according to the conventional method of measuring the velocity component, it is assumed that a single and only conspicuous motion component exists in the region of the image sequence of interest. For this reason, if a plurality of objects having different motions coexist in the same region, it is impossible to accurately estimate the velocity component included in the image sequence.
On the other hand, in the case of the conventional method of measuring the motion of the dynamic target, it is assumed that the continuity of the target motion and the unchangeability of the target shape are maintained. For this reason, in a situation where an occluding object exists between an observer and the moving target and the target becomes visible and invisible, it is difficult to accurately measure the target motion. In such a situation which is often referred to as an occlusion state, information such as the existence of the occlusion, the degree of occlusion and the position of the occlusion so as to realize a highly accurate measurement of the motion. However, in the situation where the occlusion occurs, the moving target which is to be observed appears, disappears and re-appears, thereby making it difficult to track the target, and from the practical point of view, it is impossible to acquire information related to the occlusion.
An image sequence such as a weather radar image obtained from a weather radar equipment is an example of a target which has an indefinite shape, includes a non-rigid body which appears and disappears, and is characterized by the motion within the image. According to the conventional technique, it is difficult to obtain the features peculiar to such an image sequence. The reason for this difficulty is that, essentially, the features peculiar to the above described image sequence cannot be obtained from the image features obtained from a single image frame or 2 image frames.
Research related to the motion pattern which changes with time, that is, the temporal texture, is introduced in Randal C. Nelson and Ramprasad Polana (Nelson et al.), "Qualitative Recognition of Motion Using Temporal Texture", CVGIP: Image Understanding, Vol.56, No.1, July, pp.78-89, 1992, and Martin Szummer, "Temporal Texture Modeling", M.I.T. Media Laboratory Perceptual Computing Section Technical Report No.346, 1995, for example.
Nelson et al. define feature values such as the non-uniformity of the flow direction using statistics calculated from an optical flow field. For example, these feature values are extracted in the following manner. First, a normal flow, which is a component in a direction perpendicular to a gray level gradient within components of the optical flow, is obtained for each pixel within the image. Next, a value obtained by dividing an average value of the magnitudes of the normal flows by a standard deviation is calculated or, values of positive and negative curls and divergence of the flow are calculated or, the direction of the flow is made discrete in 8 directions, and a histogram is thereafter created, and the statistics of the absolute deviation is calculated from the uniform distribution.
The feature value which is obtained in this manner has an advantage in that the value does not change with respect to the illumination and color. However, this feature value cannot sufficiently represent information related to the shape, and there is a problem in that the optical flow itself cannot be accurately estimated. The measures taken with respect to the phenomena such as the appearance and disappearance of the target are also insufficient.
On the other hand, Martin Szummer and Rosalind W. Picard, "Temporal Texture Modeling", IEEE International Conference on Image Processing, September 1996 proposes a method of modeling temporal texture using a spatiotemporal auto regressive model.
In the spatiotemporal auto regressive model, the value of each pixel is represented, spatially and time-wise, by a linear combination of the values of a plurality of surrounding pixels, as described by the following formula (0.8), where s(x, y, t) denotes a luminance value of the image sequence, a(x, y, t) denotes a Gaussian white noise, and .DELTA.x.sub.i, .DELTA.y.sub.i and .DELTA.t.sub.i denote neighboring pixels. ##EQU3##
A model parameter .o slashed..sub.i is estimated from the input image sequence using the method of least squares. It may be regarded that the estimated model parameter .o slashed..sub.i represents the temporal and spatial features of the input pattern. A pattern recognition or the like is made using this model parameter .o slashed..sub.i.
However, since this technique uses the local gray level value of the image, the modeling is easily affected by the change in illumination and noise added to the image. In addition, the physical meaning or significance of the obtained model parameter .o slashed..sub.i is unclear. Further, because the modeling is based on the image gray level, there is a disadvantage in that the structural features of the image cannot be clearly obtained.
Therefore, the echo pattern included within the weather radar image is a motion pattern of a non-rigid body which repeats appearing and disappearing, and it is difficult to represent the features of such a motion pattern using the conventionally proposed techniques. Accordingly, there are demands to realize a method and an equipment for extracting image features which can represent the features of the motion pattern of the non-rigid body which repeats appearing and disappearing and is included in the image. In addition, it is expected that the image feature of the motion pattern of the non-rigid body is also effective with respect to retrieval, indexing and the like of a general video database or the like.