1. Field of the Invention
This invention relates to a device for inserting realistic indicia into video images.
2. Description of the Related Art
Electronic devices for inserting electronic images into live video signals, such as described in U.S. Pat. No. 5,264,933 by Rosser, et al., have been developed for the purpose of inserting advertising into broadcast events, especially sports events. These devices are capable of seamlessly and realistically incorporating new logos or other indicia into the original video in real time, even as the original scene is zoomed, panned, or otherwise altered in size or perspective. In addition, in order to use these devices to alter a video feed downstream of the editor's mixing device, electronic insertion devices have to be capable of dealing with scene cuts. This requires recognizing a feature or features reliably and accurately within a very short time, typically a few fields of video or about 1/30th of a second. The need for fast recognition has meant that pyramid processing techniques, as described by Burt, et al., tend to be used. Pyramid processing is a well known technique in which an image is decomposed, sometimes referred to as "decimated," into a series of images, each of which comprises the whole original scene, but each with progressively less detailed information. Typically each successive image will have one quarter of the number of pixels of its predecessor. A level 3 (or third generation) image has 1/64th the number of pixels of the original. A search for a gross feature can thus be done 64 times faster on a level 3 pyramid image and this result quickly related back to the level 0 or original image. Speed is also improvable by searching for a small number of distinct landmarks or features that characterize the target object. This simplification of the search strategy, however, increases the possibility of false alarms or insertions. The enormity of the false alarm problem can be appreciated from the fact that in a typical three hour football game, there are 648,000 fields of video. This means that in a single football game there are at least 600,000 opportunities for the insertion device to do something that would be commercially unacceptable, such as inserting an advertisement in the crowd, or on a group of players, just because of a chance juxtaposition of features that fool the computer into thinking the current scene is equivalent to a scene it is looking to find. To avoid this possibility, or at least reduce the risk of it occurring to an acceptable commercial level, it is necessary to have recognition strategies that, on average, would only make one error in at least twice as many attempts at recognition as would occur in the event being covered. For a three hour football game, the computer must therefore make, on average, no more than one false insertion per 1.3 million fields of video. At the same time the search strategy must be kept sufficiently simple and invariant to changes in lighting conditions, video noise and incidental artifacts that may occur in the scene that it is attempting to recognize, that the recognition strategy can be performed by an affordable computing system in no more than 1/30th of a second. The final problem is that the systems capable of meeting these stringent requirements must be developed in a timely and efficient manner. This includes verifying that performance goals are being attained.
Typically, electronic insertion devices as described in U.S. Pat. No. 5,264,933 have used a dynamic pattern recognition method, as described in detail in U.S. Pat. No. 5,063,603, the teachings of which are incorporated herein by reference. Briefly, as described in PCT WO 93/06691, the preferred prior art dynamic pattern recognition method consists of representing a target pattern within a computer as a set of component patterns in a "pattern tree". Components near the root of the tree typically represent large scale features of the target pattern, while components away from the root represent progressively finer detail. The coarse patterns are represented at reduced resolution, while the detailed patterns are represented at high resolution. The search procedure matches the stored component patterns in the pattern tree to patterns in the scene. A match can be found, for example, by correlating the stored pattern with the image (represented in pyramid format). Patterns are matched sequentially, starting at the root or the tree. As a candidate match is found for each component pattern, its position in the image is used to guide the search for the next component. In this way a complex pattern can be located with relatively little computation. However, such correlation methods, while having the advantage of speed when the search tree is kept to a reasonable size--typically no more than twenty correlation's in current hardware implementations--are liable to significant false turn on rate. This is caused in part by the need for a simple search tree and in part by a problem fundamental to correlation techniques. The fundamental problem with correlation techniques in image pattern match is that the stored pattern for each element of the search tree represents a particular pose of the object being looked for--i.e. a particular magnification and orientation. Even if the system only requires recognition on the same or similar orientation, magnification remains a significant problem as in a typical broadcast application, such as recognizing football goal posts. The difficulty is that the magnification of the goal post in the initial shot (i.e. the first image of the required goal post in a sequence of images containing it) may vary by a factor of two. This means that the stored pattern is in general of the wrong size, making the correlation's weaker than in the case where the search pattern matches the image pattern exactly and thus more difficult to distinguish from other partially similar features. Traditional attempts to deal with this have been to include search trees containing images of different pose, particularly magnification. This results in longer search trees, and slower recognition. This is taken to an extreme in the system described in U.S. Pat. No. 5,353,392 by Laquent in which all attempts to automatically cope with scene cuts are abandoned and the identifying marks are indicated manually on the first image of each sequence. This may be adequate for a none real time editing machine, or for a real time electronic insertion device attached to a single camera in a situation where the recognition landmarks are never fully occluded, but is unacceptable in a standard broadcast environment with the electronic insertion occurring downstream of the editor's switching equipment, or at a remote location.
In U.S. Pat. No. 4,817,175, Tenenbaum, et al., describes a pattern recognition system which uses parallel processing of the video input to attain speed. This system is directed towards inspection techniques in which the camera is under control of the recognition system and in which real-time performance is not required. The Tenenbaum, et al. system, therefore, uses time averaging of a number of frames of video to obtain high signal-to-noise in the image. The heart of that recognition strategy, which in the preferred embodiment is set up to locate rectangles of varying size, is to look for corners, because of their invariance to magnification, using corner templates and standard correlation techniques. As an example, Tenenbaum, et al. describes a system which has templates representing a corner at all possible orientations. This is used to locate all possible lower left hand corners of possible rectangles. From these, the system detects corners and then looks along the diagonal for the matching upper right hand corner, using only the corner template having the correct pose. Finally, the system uses the predicted location of the other two corners of the rectangle as a means of confirming the existence of the rectangle, again using corner templates in the correct pose. All correlation is done in the traditional manner, using like templates.
The existing methods of structured pattern recognition used in electronic insertion devices require either relatively long and complex search trees, resulting in prior art methods taking too much time with existing hardware to be of use in a real time, multi-camera environment under the range of conditions required by conventional broadcast practice or if the search trees are kept sufficiently simple, the search strategies become fragile, making them overly sensitive to false alarms in complex or noisy images, both of which are part of a real television broadcast.