Digital video is seldom transmitted or stored in its raw, original form. Rather, the digital video data is compressed in some fashion. Compression of video is possible because there are, depending on the type of footage, various amounts of redundancy present in the video signal. There exists spatial redundancy because, within the video frames, the signal does not change much between most pixels (picture elements of the video frame); there exists temporal redundancy because the video signal does not change much between most frames. There also exists perceptual redundancy because the pixel value fluctuations within frames and between frames contain more information than can be perceived by the human eye.
There are many video compression techniques, among which, such as the MPEG-1 and MPEG-2 standards, that try to exploit these redundancies in order to compress a video signal as much as possible while still maintaining the visual content of the video as well as possible. Spatial redundancy is exploited by transmitting the coefficients of the DCT transform of 8×8 image blocks. Temporal redundancy is exploited by transmitting only differences between subsequent frames, where these differences are expressed using motion compensation vectors. Perceptual redundancy is exploited by limiting the color information in the signal.
These compression standards support high resolution and high frame rate video. Lower-bandwidth video compression techniques (like H.263, H.320, and H.323) also exist, but these usually support only low resolution images (QSIF) at low frame rates (2 fps). Such compression schemes are usually designed either as general-purpose systems for any image type, or specifically as video conferencing systems.
A more recent compression standard, which is still under development, is MPEG-4. Where MPEG-1 and MPEG-2 do not take into consideration the visual content of the individual video frames, MPEG-4 does. Rather than basing the compression on image blocks, the compression is based on image regions that actually may correspond to semantically meaningful area of the 3D scene. For example, a textured region can be compressed as a representation of its boundary plus parameters that describe the texture, possibly with a residual image as well. Although MPEG4 does not prescribe how the regions are to be extracted, computer vision techniques are often used. MPEG-4 also has provisions for very high-level compression of moving faces. A general geometric face model is predefined with a number of control points. The encoder just has to set the initial location of these points and provide trajectories for them as the video progresses. It is up to the decoder then to take care of reconstructing and displaying a suitable face based on this parameter set.
A compressor and corresponding decompressor pair that can code a signal into a compressed form and then can decode a signal back into its original format is called a codec The compression can either be lossless, in which case the decoded signal is equal to the original signal, or lossy, in which case the decoded signal is merely a “good” approximation of the original signal. In the latter case, information is lost between the original and reconstructed signal, but a good compression algorithm attempts to ensure the best possible decoded signal (usually from a human perceptual standpoint) within a given bit rate. Lossless techniques could also be applied to an image or video, but generally do not yield enough data reduction to be very useful (typically compression ratios between 1.2 and 2.5, whereas MPEG-1 usually runs at 30 to 50).
The following reference describes examples of the state of the prior art in compression technology:                B. G. Haskell, A. P. Puri, and A. N. Netravali,        Digital Video: An Introduction to MPEG-2. Chapman & Hall: New York, 1997.        
Chapter 1, pages 1–13, introduces compression, standards for video conferencing (H.320), MPEG1 and MPEG2. The low bit-rate standard, H.263, is handled on pages 370–382. MPEG4 is introduced on pages 387–388. These references are incorporated herein in their entirety.
The compression techniques proposed herein require computer vision techniques. The following computer vision techniques are especially relevant.
Edge detection: These are techniques to identify sharp-discontinuities in the intensity profile of images. Edge detectors are operators that compute differences between pairs of neighboring pixels. High responses to these operators are then identified as edge pixels. Edge maps can be computed in a single scan through the image. Examples of edge detection are the Gradient- and Laplacian-type edge finders and edge templates such as Sobel.
Region finding: This is a class of techniques that identify areas of continuity within an image (in a sense, the opposite of edge detection). The areas that are to be detected are constant in some image property. This property can be intensity, color, texture, or some combination of these. Using connected components techniques, regions can be computed in a single scan. Clustering approaches have also been used successfully. An example here is the detection of hands or faces in frames by finding regions with flesh tone.
Background subtraction: This is a method where two images are used to find image regions corresponding to objects. A first image is acquired without the objects present, then a second image with the objects. Subtracting the second image from the first and ignoring regions near zero results in a segmented image of the objects.
Normalized correlation: This is a technique for comparing two image patches Q1 and Q2. The normalized correlation at some translation T is defined as:NC=[E(Q1Q2)−E(Q1)E(Q2)]/Sigma(Q1)Sigma(Q2)
with E(.) the expectation and Sigma(.) the variance. High values here indicate that the patches are very similar, despite possible differences in lighting conditions.
Normalized correlation and other computer vision techniques, are described more fully in:                D. Ballard and C. Brown, Computer Vision, Prentice-Hall: New Jersey, 1982.        
Gradient- and Laplacian-type edge finders and edge templates can be found on pages 75–80; pages 149–155 describe region finding and connected components techniques; background subtraction on pages 72–73; and normalized correlation can be found on pages 68–70. These references are incorporated herein in their entirety.
Some of the above techniques are also used to process the frames in order to compute MPEG4 compression. However, MPEG4 (and MPEG1-2) coding techniques are, in general, proprietary and hence descriptions of the actual techniques used are not available. Yet all that is important from a functional standpoint is that it is possible for decoders which adhere to the standard to decode the resulting signal.
Problems with the Prior Art
One of the concerns of this invention is the efficient use of professional's and expert's time, especially through savings on the time and money that is used for travel. Traditional means of dispersing experts to locations that can be remote i& expensive and inefficient, mainly since it involves time-consuming expensive travel. Consider the following scenarios that are very costly in terms of personnel resources and travel.
A company is building a large hydroelectric dam. Sometimes life-threatening situations arise and an expert must be flown in. Typically, most work at the site stops until this individual arrives and diagnoses the problem.
High-priced service contracts for photocopiers guarantee that a technician will be on site within an hour, when needed. Such field service personnel often spend a large fraction of their time driving from site to site, not using their expertise. Other times, they sit around waiting for a call.
These cases may mean idled manpower and machinery, schedule slippage, the need for a large staff and high travel costs.
There are prior art techniques that address these concerns. For example, the field of telemedicine is the practice of medicine at a distance, e.g., telepresence surgery. A military application of this is where highly qualified surgeons remotely assist battle field doctors and medics to perform delicate surgery on casualties with life threatening injuries. This work must be done in the field since the soldiers are often so injured that they cannot be easily moved. Civil applications of telemedicine, where the field doctors may be assisted by remotely controlled robot arms (by the expert surgeon), may eventually become widespread also. High quality, high-resolution cameras that record and transmit pertinent images needed for performing the medical task are of essence in telemedicine.                F. Hamit, “To the Virtual Collaborative Clinic: NASA and Telemedicine's Future,”        Advanced Imaging, July 1999, pp. 31–33.        
This reference is incorporated in its entirety.
Many other types of tele-operations can be envisioned. For instance, in the civil engineering example mentioned above, an expert could remotely diagnose the problem and field personnel could then fix the problems under supervision of the expert. Similar solutions can be used in the copier repair arena. Lower paid field personnel could service and repair the copiers. When problems are difficult to diagnose or repair, the field agents could contact an expert in the office and establish a video link to let the expert direct and guide the field work.
A problem with these remotely controlled diagnosis and repair processes is that video images of high resolution and high quality need to be transmitted. For digital video transmission this means that high bandwidth communications channels are required, even if compression is used. For analog video transmission this means that high power transmitters are required. Furthermore, hard-to-obtain licenses from the FCC need to be secured.
Compression techniques like the MPEG standards are designed to compress and decompress the video such that there is as little information loss as possible. That is, the decoded video is a faithful reconstruction of the original video, at least to the human eye. This is not always possible when there is a lot of motion in the video. For standard resolution, and in particular for HDTV, even such compressed signals are of too high a bandwidth. The bandwidth required is in the 1–20 Mbaud range depending on image resolution and quality. High bandwidth communication channels and high-bandwidth equipment are expensive, so low bandwidth is much more preferable. Also, in many remote areas such broadband links are not available. Instead there may only be a phone line (28–53 Kbaud) or cell-phone service (10–20 Kbaud).
However, while low-bandwidth codecs (like H.263) already exist, they usually support only low resolution images (QSIF) at low frame rates (2 fps or worse) over such channels. A number of tasks require better resolution. Other tasks require high update rate interaction between a field agent and the directing expert. Furthermore, these codecs are usually designed either as video conferencing products (where facial motion is paramount) or as general-purpose systems for any video type and genre. These prior art compression techniques used have no notion of what is important to a particular task and hence degrade all information uniformly. That is, prior art compression methods do not have a proper model of what information in the images is important to a given task and what is not. Hence, commonly available low bandwidth channels constrain standard video codecs to operate at too slow a speed to provide real-time feedback. This makes it impossible to direct some tasks remotely; an expert must be located in the field instead. In general, the problem with prior art compression is that it is not possible to transmit high-resolution, high frame rate over low bandwidth channels because these compression techniques are not designed for low bandwidth telepresence applications.
Much prior art in semantic or content-based compression concentrates on compression for video telephony and conferencing. This type of compression is highly geared to the fact that the video images contains a “talking head.” An instance is the a semantic compressor described in U.S. Pat. No. 5,832,115 to J. R. Rosenberg. This codec uses an edge detector to produce an edge map of each frame. A set of different size pattern templates having a pair of ellipsoid face-edge contours is defined off-line. These templates are correlated with the edge map and detect size and position of the face. Block-based compression (as in the MPEG1-2 standards) are then applied preferentially to the macro blocks (2×2 blocks) within the ellipse. Here, there is strong reliance on a two-dimensional model of a talking head, although presumably other object models might also be used.
A content-based compression technique that is not dependent on object models is disclosed in U.S. Pat. No. 5,974,172 to T. Chen. Here the frames are segmented into subject and non-subject regions. A simple way to do this is define a color spectrum for the desired subject region, and then declare any areas which do not have suitable pixel colors relative to this to be to non-subject regions. Just the subject regions are then coded and decoded. Video telephony is one application for this compression scheme.
U.S. Pat. No. 5,854,856 to Moura and Jasinschi describes a motion-based codec. First, moving figure velocity and background velocity are estimated. Second, the figure velocity is compensated with relation to the background velocity. Third, the figure is segmented using a threshold to detect if a figure moves in relation to the background. Fourth, the segmented figures are tessellated into blocks. A background image is computed using cut-and-paste operations. Compression is then achieved by transmitting the tessellated segmented figures and, only when border updates are needed, appropriate background images.
U.S. Pat. No. 6,026,183 to R. K. Talluri et. al. describe a similar content-based compression scheme based on MPEG1-2. Regions of change (moving objects) are detected from reconstructed frame F(N-1) to the frame F(N). The boundaries of these regions, including holes, are encoded and added to the frame. Removal of temporal redundancies is achieved by finding blocks in the previous frame that match blocks in the current frame F(N). The signal is further compressed by synthesizing F(N)′ from the previous frame and comparing F(N)′ to F(N). This is done to find frame regions that still contain significant amounts of information (residual), which is then compressed in the same way. This helps support selective encoding/decoding of objects in the bitstream sequence as well as object scalability.
For all these prior art teachings, the objective is to reconstruct the video at the receiver-end as photo-realistic images with as much information as possible, at least in the frame areas of-interest. That is, in important areas the decoded video should have all the properties of the original image (such as colors, edges, textures, motion, etc.) and also be visually pleasing. This is achieved by using motion detection or motion segmentation, region (sub/object) detection, or models of the expected objects in the video. None of the above systems describes selectable codecs, in the sense that the receiver has the choice of different codecs to use.
U.S. Pat. No. 6,026,183 to R. K. Talluri et. al. describe a codec that allows the operator to choose which objects in the video are to be encoded but, still, the goal is to make these objects look close to their original appearance when decoded.- None of the codecs is geared to compressing the video in such a fashion that only that information that is pertinent to a given task is encoded. In addition, none of the codecs have the capability to transmit high-fidelity frames at the request of the viewer or according to given algorithmic rules. Further, prior art encoding depends heavily on fairly complex image processing and computer vision techniques. The breakdown of these techniques results in abrupt degradation of the decoded video signals. In general, it is preferable for the encoding, and hence the decoded signal, to instead degrade gratefully when the input video degrades in quality.