1. Field of the Invention
The present invention relates to automatic target recognition, digital video processing (compression, frame segmentation, image segmentation, watermarking), sensor fusion and data reduction.
2. Background Discussion
ATR, also known as Automatic Target Recognition, or Target Identification (ID) is a well-established method of automatically recognizing and discriminating true targets from false targets. Targets can be military (tank, artillery gun, UAV (Unmanned Aerial Vehicle), UGV (Unmanned Ground Vehicle)), or civilian (human, animal, auto, et cetera). Targets of interest are usually mobile, or in motion. The basic problem of ATR is successful target acquisition, or identification of ROI, or Regions of Interest, or successful data reduction, called pre-ATR. Such pre-ATR should be provided in real-time, or in Ultra-Real-Time (URT), in order to make the ATR effective in real-world scenarios, both military and civilian. This is a natural objective if we consider biologically-inspired pre-ATR that is done on a millisecond (msec) scale. In typical video, which is 30 frames per second, with 30 msec-frame duration, effective pre-ATR should be done within a few milliseconds, or even in sub-milliseconds (URT). This is a formidable task, only rarely achievable, mostly in a research environment. This is a general problem of imagery or video sensors, including sensor fusion (see L. A. Klein, Sensor and Data Fusion, SPIE Press, 2004 and E. Waltz and J. Llinas, Multisensor Data Fusion, Artech House, 1990). Such sensors acquire a tremendous amount of information. For example, for a typical video frame of 740×480 pixels, 24 bits per RGB pixel, or 24 bpp, the video frame content is: 740×480×24-8.5 million bits per frame, and the original video bandwidth, for 30 fps, is 256 Mbps. Therefore, because of the large amount of information acquired by such sensors, any reasonable data reduction is a formidable task, especially if made in real time, or in Ultra-Real-Time (URT). In contrast, for the single pointing sensor such as acoustic range sensors, the data reduction is simple (T. Jannson, et al., “Mobile Acoustic Sensor System for Road-Edge Detection,” SPIE Proc., vol. 6201-36, 2006), but the amount of information they acquire is very low. This problem is discussed, in detail, in T. Jannson and A. Kostrzewski, “Real-Time Pre-ATR Video Data Reduction in Wireless Networks,” SPIE Proc., vol. 6234-22, 2006.
The literature on ATR is very comprehensive, and in the 1960s and 1970s focused mostly on coherent ATR, i.e., ATR based on objects illuminated by laser (coherent) light beams. Such ATR, based mostly on Fourier transform, and complex-wave-amplitudes (see, e.g., J. W. Goodman, Introduction to Fourier Optics, 2nd ed., McGraw-Hill, 1988), and recently on wavelet-transform (WT), has been successfully applied to SAR (Synthetic Aperture Radar) imaging, where optical hardware (lenses, holograms) have been replaced by electronic hardware. Such ATR has very limited applications to this invention, since TV or video cameras are mostly passive devices in that they use ambient (white) light rather than active light sources such as lasers (nevertheless, some cameras can use laser light).
Many digital video cameras use some kind of digital video processing, including various types of video compression (MPEG, wavelet), frame segmentation, novelty filtering, et cetera. The literature on video compression is very broad, including many patents, including Applicant's issued U.S. Pat. Nos. 6,137,912; 6,167,155; and 6,487,312, the content of which is hereby incorporated herein by reference. These techniques provide high quality video images at relatively low bandwidth, with Compression Ratios (Cs) approaching 4000:1. These are MPEG-based, with a new type of I-frames called M-frames, which are meaningful I-frames, to be introduced only, when motion error, in respect to a reference I-frame, exceeds a pre-defined threshold value (see “Soft Computing and Soft Communication (SC2) for Synchronized Data” by T. Jannson, D. H. Kim, A. A. Kostrzewski, and V. T. Tarnovskiy, Invited Paper, SPIE Proc., vol. 3812, pp. 55-67, 1999).
The difficulties of video data reduction, in general, and pre-ATR, in particular, are well described in “Real-Time Pre-ATR Video Data Reduction in Wireless Networks” by T. Jannson and A. Kostrzewski, SPIE Proc., vol. 6234-22, 2006, where the concept of M-frames is also described. An example of primitive pre-ATR is described in “Real-Time Pre-ATR Video Data Reduction in Wireless Networks” by T. Jannson and A. Kostrzewski, SPIE Proc., vol. 6234-22, 2006, where a method of moving object location by triangulation through a cooperative camera network, as well as object vector (value, and direction) evaluation, is used.
Prior-art computer vision object recognition and scene interpretation strategies are typically applied in two-steps: low-level (pre-ATR edge/boundary detection); and high-level (image segmentation). Natural terrestrial landscape, oblique aerial, UAV images, and others, typically consist of pattern combinations, some of them true targets, some of them false targets, with boundaries created by abrupt changes in feature signs such as specific motion, color, texture, and other signatures, greatly complicating automatic image processing, or ATR. A reliable algorithm needs to consider all types of image attributes to correctly segment real natural images. There is a larger literature of so-called image understanding Geometric Invariance in Computer Vision by Mundy et al, The MIT Press 1992 which considers image invariants and geometrical invariants in order to analyze mostly rigid bodies in motion, or their combinations, and formulates adequate mathematical framework, mostly in the form of so-called affine transforms, and covariance matrices, that analyzes mathematical relations between movement of a rigid body (3 rotations and 3 translations, or 6-degrees of freedom) and its projections obtained at the camera image plane (see Gerald Sommer, “Applications of Geometric Algebra in Robot Vision”, Computer Algebra and Geometric Algebra with Applications, Volume 3519, 2005). This image understanding is then collapsed to algorithmic image segmentation. This, however, itself is an ill-posed problem. That is, it involves inferring causes (a large pool of events), or actual scenes from effects (a small pool of effects, or sensor readings), or detected images. This is generally called Bayesian inference and it is a natural cost of any sensor reading (human organism is such a large sensory system).
One recent solution to this sensory problem has been introduced, see “Edge Flow: A Framework of Boundary Detection and Image Segmentation” by W. Y. Ma and B. S. Manjunath, IEEE Computer Vision and Pattern Recognition, 1997, by using boundary detection and image segmentation called “edge flow”. In their framework, a predictive coding model identifies and integrates the direction of change in image attributes (color, texture, and phase discontinuity) at each image location, and constructs an edge flow vector that points to the closest image boundary. By interactively propagating the edge flow, the boundaries where two opposite directions of flow meet in a stable state can be located. As a rule, additional expert information is needed to segment the objects or ROIs. Traditionally, in the literature (see, e.g. “A Computational Approach To Edge Detection” by Canny, J., IEEE Trans. Pattern Analysis and Machine Intelligence, 8:679-714, 1986), edges are located at the local maxima of the gradient in intensity/image feature space. In contrast, in “edge flow”, as in “Edge Flow: A Framework of Boundary Detection and Image Segmentation” by W. Y. Ma and B. S. Manjunath, IEEE Computer Vision and Pattern Recognition, 1997, edges (or, image boundaries in a more general sense) are detected and localized indirectly. This is done by first identifying a flow direction at each pixel location (a gradient) that points to the closest boundary, and then detecting where edge flow in two opposite directions meet. This is a very effective method which gives excellent results provided there is sufficient time for computation. Unfortunately, typically such sufficient time is much too long to realize any real-time operation.
The same conclusion is true for other prior-art methods of spatial image segmentation, including recent efforts in video surveillance, used in Homeland Security applications.
Patent prior art deemed to be relevant to the present invention includes U.S. Pat. Nos. 7,010,164; 7,088,845; 6,404,920; 5,768,413; 6,687,405; 6,973,213; 5,710,829 and 5,631,975 which all relate to image segmentation; U.S. Pat. Nos. 5,654,771 and 6,983,018 which relate to motion vector image processing; U.S. Pat. No. 6,453,074 which deals with image decimation and filtering (although not in real time and for still images, not video images); U.S. Pat. No. 5,970,173 which relates to affine transformation for image motion between frames; U.S. Pat. No. 6,285,794 which treats compression by morphing; U.S. Pat. No. 6,628,716 which treats wavelet-based video compression; and U.S. Pat. No. 7,027,719 which discloses a catastrophic event recorder including video data compression.