In contrast to ranging instruments such as radar, video cameras do not directly yield the depth or distance from measuring device to the objects observed. Only vertical and horizontal positions and movements are observed directly. When optically dense objects move behind each other, the resulting occlusions are observed as loss of signal from the hidden objects.
However, for some video modelling and compression schemes it is important to have compact representations of the depth dimension. Depth is here defined to be position along line of sight from camera or observer.
Depth information is wanted in most object based video compression systems. One usage is efficient handling of overlapping objects. An encoder can use depth order information to assign image parts to correct objects, and a decoder can use this information so that only the frontmost object is displayed in case of overlaps. Such depth modelling could be called ordinal or qualitative depth modelling.
Another group of video modelling systems where depth is important is automatic map construction systems, where a map is automatically made based on terrain photographs made from different positions. e.g. from a plane travelling over the terrain. Here a depth, or height, model with numeric values for each terrain point can be computed, according to the theory of stereogrammetry. When there are steep mountains in the terrain, this may give rise to occlusions for some of the frames in the sequence. Such occlusion may pose a problem for some of the existing methods, while they will be a source for information in a system according to this invention. Such depth modelling could be called quantitative depth modelling.
For another example of an object based video modelling system, consider a garbage sorting system consisting of a camera and a robot arm mounted next to a conveyor belt. Some types of cameras, especially operating in parts of the near infrared spectrum, are good at recognizing different types of materials, e.g. plastics, and results from such analysis could then be used to control a robot arm which could grab the objects from the conveyor belt and release them into sorting bins. It could then be important that the robot arm only tries to grab the objects that do not lie partly behind other objects. Analysis of the camera images according to the present invention could then be performed, so that a depth graph is obtained, giving information on which objects occlude other objects. Such depth modelling could be called depth graph modelling.
For yet another example of an object based video modelling system, consider a video-based system for automatically steering a driving car. Other cars, and the terrain around, may occlude each other. It is then interesting to know not only which objects occlude each other, but also how fast they move, how fast they accelerate, and how they will occlude each other in the near future. Such information can according to this invention be summarized in a bilinear model consisting of one spatial and one temporal part.
In some video modelling techniques it is also advantageous to be able to determine and represent temporal changes in the depth dimension. Present video codec (encoder/decoder) systems do not yield sufficient descriptions of systematic temporal changes in the depth dimension.
This invention only relies on occlusions, that is, information about which points or areas can be followed through which frames. It does not need depth clues that demands "understanding" of the scene. Such clues could have been that mountains that appear to be blue (for a camera or observer) are often farther away than mountains that appear to be green, objects that are close to the horizon are often farther away than objects that are far away from the horizon, a face that seems small is often farther away than a face that seems large, a thing that seems to move fast is often closer than a thing that seems to stand still, or that a thing that is in camera focus has another depth than a thing that is outside camera focus. It also does not use stereo vision or other types of parallax. However. if any such side information is available, it can be included to further stabilize the estimation.
Accordingly, it is an object of the present invention to provide a method and apparatus for deriving ordinal or qualitative depth information from spatially resolved signals (the input signals) in general, and from video sequence image data from 1D or from 2D cameras in particular.
It is also an object of the present invention to detect temporal or spatial inconsistencies in depth order, and then either resolve these inconsistencies or model them compactly.
Yet another object is to yield quantitative depth information for different parts of the image and to output the depth information in any form.
Yet a further object is to yield quantitative information about how the depths in different parts of the frames change with time and to output the depth information in any form.
Yet a further object is to represent the depth information in a compact model representation.
Yet a further object is to enable temporal and spatial interpolation and extrapolation on the basis of this compact model representation.
Yet a further object is to estimate depth with sufficient quantitative detail for a sequence such that the frames in the sequence can be well decoded or reconstructed, but without having to find the "true" depth.
Yet a further object is to convert qualitiative occlusion data, at the nominal or ordinal measurement levels, to depth predictions, at the ratio or interval measurement level.