Means for merging two or more video signals to provide a single composite video signal is known in the art. An example of such video merging is presentation of weather-forecasts on television, where a weather-forecaster in the foreground is superimposed on a weather-map in the background.
Such prior-art means normally use a color-key merging technology in which the required foreground scene is recorded using a colored background (usually blue or green). The required background scene is also recorded. In its simplest form, the color-key video merging technique uses the color of each point in the foreground scene to automatically "hard" switch (i.e., binary switch) between the foreground and background video signal. The color-key video merging technique uses the color of each point in the foreground scene to automatically switch between the foreground and background video signal. In particular, if a blue pixel is detected in the foreground scene (assuming blue is the color key), then a video switch will direct the video signal from the background scene to the output scene at that point. If a blue pixel is not detected in the foreground scene, then the video switch will direct the video from the foreground scene to the output scene at that point. After all points have been processed in this way, the result is an output scene which is a combination of the input foreground and background scenes.
More complex "soft" forms of the color-key video merging technique are taught in the article by Nakamura et al., in an article in SMPTE Journal, Vol. 90, February 1981, p. 107 and in U.S. Pat. No. 4,409,611. In these more complex forms of the color-key video merging technique, the effects of switching may be hidden and more natural merging may be achieved. For instance, shadows of foreground subjects may be made to appear in the background.
The color-key merging technique is simple, and cheap hardware for this method has been available for some time. As a result, color-key insertion can be performed on both recorded and live video. It is used widely in live television for such purposes as superimposing sports results or images of reporters on top of background scenes, and in the film industry for such purposes as superimposing foreground objects (like space-ships) onto background scenes (like space-scenes).
However, there are two important limitations of color-key merging technology. First, this technique cannot be used to combine video sources where the separation color (e.g., blue or green) in the scene cannot be controlled by the employer of this technology. This has often limited the use of color-key insertion to image sequences recorded in a broadcasting or film studio. Second, it is not currently possible to automatically combine video signals in such a way that patterns inserted from one sequence follow the motion of objects (foreground or background) in the other sequence so that the inserted patterns appear to be part of these objects. While, in the past, synchronization of the motions of background and foreground scenes has been performed manually in a very limited number of film productions, such manual synchronization is highly expensive and tedious and requires that the video material be prerecorded and not `live`.
The prior art includes a dynamic pattern recognition method which employs a hierarchical structured search for detecting a pattern within a video scene. An example of the use of this method is described in U.S. Pat. No. 5,063,603, the teachings of which are incorporated herein by reference. Briefly, this dynamic pattern recognition method consists of representing a target pattern within a computer as a set of component patterns in a "pattern tree" structure. Components near the root of the tree typically represent large scale features of the target pattern, while components away from the root represent progressively finer detail. The coarse patterns are represented at reduced resolution, while the detailed patterns are represented at high resolution. The search procedure matches the stored component patterns in the pattern tree to patterns in the scene. A match can be found, for example, by correlating the stored pattern with the image (represented in a pyramid format). Patterns are matched sequentially, starting at the root or the tree. As a candidate match is found for each component pattern, its position in the image is used to guide the search for the next component. In this way a complex pattern can be located with relatively little computation.
Further, it is known in the prior art how to estimate the orientation of a flat surface of a given detected pattern in a scene depicted in a video image. The particular parameters that need to be determined are the position of the given detected pattern in the scene, its scale and orientation in the plane of the image, and its tilt into the image plane. Pose is estimated by measuring the geometric distortions of other "landmark" patterns on or near the given detected pattern. Pose may be estimated in two steps.
The first step is to make a rough estimate of pose by locating three or more of such landmark patterns in the scene that are on or near the given detected pattern. The positions of these landmark patterns relative to the given detected pattern are known from training images. However, the positions of these landmark patterns relative to one another change with changes in pose of the given detected pattern. Therefore, the relative positions of the landmark patterns in the observed scene can be used to determine that pose. Landmark patterns can be located using hierarchical structured search, as described above.
The second step, which refines makes use of "locator patterns" that are on or near the given detected pattern. These "locator patterns" are more extensive patterns than are typically used as landmarks. Stored copies of the pattern are matched to the scene through a process that successively estimates position and orientation, and a process that warps the stored copies into alignment with the observed patterns in the scene. This alignment process, known in the art and called herein "affine precise alignment estimation," can provide a very precise estimate of the pattern positions, and hence of the pose of the given detected pattern. The atfine precise alignment estimation process is described in various publications, including "Hierarchical Model-Based Motion Estimation" in the Proc. European Conference on Computer Vision, 1992, pp. 237-252, by Bergen et al and U.S. Pat. No. 5,067,014 to Bergen et al. and assigned to the assignee of this invention.