The present invention relates to improvements in systems that insert selected indicia into live video broadcasts.
Electronic devices for inserting indicia into live video broadcasts have been developed and used for the purpose of inserting advertising, for instance, into sports events. The viability of such devices depends directly on their ability to make the insertion seamlessly and appear as realistically as possible to be part of the actual scene. The insertion must also be robust enough to handle typical camera manipulations such as panning, tilting, zooming, etc. without compromising the integrity of the broadcast.
A key component in any such indicia insertion system is the ability to track scene motion and background motion from image to image in the broadcast. Reliable tracking data is necessary in order to calculate transform models that adjust an intended insertion to the proper size and perspective prior to insertion of an image in each new image frame.
U.S. Pat. No. 5,264,933 to Rosser notes that standard methods of pattern recognition and image processing can be used to track background and scene motion. The standard methods of pattern recognition and image processing referred to are feature tracking using normalized correlation of previously stored image templates. These methods work well but not under all conditions.
Subsequent methods have incorporated what has been termed xe2x80x9cadaptive geographic hierarchical trackingxe2x80x9d in which an elastic model is used to extend the domain of image frames that can be tracked adequately. The extended domain includes noisy scenes containing a great deal of occlusion. Occlusion refers to action in the current image obscuring some or most of the pre-selected landmarks utilized by an insertion system to calculate the position and perspective of an insert in the live broadcast. The extended domain also includes images containing rapid variations in overall illumination conditions. Adaptive geographic hierarchical tracking requires at least three separate landmarks always be visible in the image as it is being tracked. Since precise image conditions cannot be predicted ahead of time a block matching technique termed xe2x80x9cunnormalized correlationxe2x80x9d is usually employed.
The present invention further extends the domain of image frames that can be tracked to include frames in which there are no pre-selected landmarks visible. Unlike adaptive geographic hierarchical tracking, which preferably uses predefined synthetic templates, the present invention uses templates taken from the stream of images being broadcast.
There is also prior art concerning motion estimation schemes. Digital video encoders employing motion estimation for data compression purposes extract image templates from video images and calculate motion vectors. A current image is tiled with a set of templates and motion vectors are calculated for each template using a previously transmitted image. The object is to reduce the number of bits needed to encode an image block by transmitting only a motion vector plus an optional correction factor as opposed to transmitting a complete image block. After coding the image the templates are discarded.
Typical block matching criteria for this scheme include L1 norm, L2 norm, and normalized correlation. L1 norm is defined as D=xcexa3d, L2 norm is defined as xcexa3d2 where d is the difference in pixel values between the image and the template. The summation is carried out over all the pixels in each template. The normalized correlation is defined as:   N  =            ∑              xe2x80x83            ⁢      IT                      ∑                              I            2                    ⁢                      ∑                          T              2                                          
where T represents the values in the template and I represents the values in the image.
In this description, block matching techniques will be defined so that the best match corresponds to the smallest value of the selected matching criteria. Thus, if normalized correlation were used as the block matching criteria the mismatch would be defined as:   M  =            1      -      N        =          1      -                        ∑                      xe2x80x83                    ⁢          IT                                      ∑                                          I                2                            ⁢                              ∑                                  T                  2                                                                        
As the template is moved over the current image the resulting array of values calculated using the selected block matching criteria is called an error surface and the best match occurs when the error surface has a minimum value.
Since the average illumination levels in the current image are likely to be similar to the matching blocks in the previously transmitted image, block matching is more reliable than using methods which include the average illumination information.
The present invention differs from motion estimation used in video encoding in a number of significant ways. In the present invention the templates are a carefully selected subset of the total blocks available rather than all possible positions. Careful choice of a region and template is necessary because, unlike motion estimation in compression algorithms, the result of the present calculation is not a set of motion vectors for the blocks, but rather a single transform model. In a xe2x80x9cleast square errorxe2x80x9d sense the single transform model is the best descriptor of the motion of the template ensemble. Moreover, the templates are placed in selected positions in the image rather than tiling the image. Further, the templates are stored in memory and are not discarded after each image is processed.
In the present invention, the current position of a template is determined relative to this previous position whereas in motion estimation the previous position is determined relative to the current tiled position. Motion estimation in video encoding is directed toward finding the best displacement match, i.e. that with the smallest coding error, to the current image from a previously transmitted image. In contrast, position location of the present invention is directed toward the visual correctness (the viewer""s perception of the image) of the motion of the image. In ambiguous cases it is not important how motion estimation in video coding resolves the ambiguity but it is critical how the position location method of the present invention resolves the ambiguity. Resolution of the ambiguity may involve examination of the model as determined from other nearby blocks. Motion estimation has limited accuracy, often to xc2xd pixel, due to computational and coding requirements associated with increased accuracy. In position location, however, there are no such limits on accuracy.
The present invention utilizes image templates taken directly from a broadcast video stream. Depending on the intended application, i.e. baseball, football, soccer, etc. . . . , specific capturing criteria are used to select templates from the current image. For long term spatial stability, templates are stored in memory and remain useful so long as the templates continue to meet certain retention criteria. Retention criteria include a satisfactory match to the current image of the broadcast as well as spatial consistency with other templates. Spatial consistency means that templates to be retained are consistent with other templates with respect to position as opposed to curvature. Templates are updated periodically to purge those no longer capable of giving satisfactory positional data. New templates selected from the current image are then used to replace those discarded. The position of each template is determined by comparing the template against the current image. The preferred comparison method uses an integer position search followed by a two-dimensional interpolation process to obtain positional information accurate to fractions of a pixel. A transform model is then calculated from the derived position data using additional data relating to the shape of the error surface near the matching position. The transform model provides a description of the current image so that indicia may be inserted into the current image in the desired location and correct perspective. There may be various forms for this transform model. For example, the simplest model defines the pan, tilt, and zoom of the camera recording the event. More complex models may include camera parameters such as roll, mounting offsets, and other camera motion. The transform model may be confirmed by examining pre-defined synthetic templates and the model can be adjusted if necessary. Changes in mis-match values over time allow video transitions such as scene cuts and fade-outs to be detected. Lastly, the system and method of the present invention is viable so long as there is texture in the current image. The texture need not be stationary, however, over periods longer than several frames of video, i.e. crowds.