1. Field of the Invention
This invention relates to a system and method for tracking image frames for inserting realistic indicia into video images.
2. Description of Related Art
Electronic devices for inserting electronic images into live video signals, such as described in U.S. Pat. No. 5,264,933 by Rosser, et al., have been developed and used for the purpose of inserting advertising and other indicia into broadcast events, primarily sports events. These devices are capable of seamlessly and realistically incorporating logos or other indicia into the original video in real time, even as the original scene is zoomed, panned, or otherwise altered in size or perspective. Other examples include U.S. Pat. No. 5,488,675 issued to Hanna and U.S. Pat. No. 5,491,517 issued to Kreitman, et al.
Making the inserted indicia look as if it is actually in the scene is an important but difficult aspect of implementing the technology. A troublesome aspect is that the eye of the average viewer is very sensitive to small changes in the relative position of objects from field to field. Experimentally, instances have been found where relative motion of an inserted logo by as little as one tenth of one pixel of an NTSC television image is perceptible to a viewer. Placing, and consistently maintaining to a high precision, an inserted indicia in a broadcast environment is crucial in making video insertion technology commercially viable. A broadcast environment includes image noise, the presence of sudden rapid camera motion, the sporadic occurrence of moving objects which may obscure a considerable fraction of the image, distortions in the image due to lens characteristics and changing light levels, induced either by natural conditions or by operator adjustment, and the vertical interlacing of television signals.
In the prior art, the automatic tracking of image motion has generally been performed by two different methods.
The first method utilizes pattern recognition of the frames and examines the image itself and either follows known landmarks in the video scene, using correlation or difference techniques, or calculates motion using well known techniques of optical flow. See, Horn, B. K. P. and Schunck, B. G., "Determining Optical Flow", Artificial Intelligence, pp 185-203 (1981). Landmarks may be transient or permanent and may be a natural part of the scene or introduced artificially. A change in shape and pose of the landmarks is measured and used to insert the required indicia.
The second method, described, for instance, in U.S. Pat. No. 4,084,184 issued to D. W. Crain, uses sensors placed on the camera to provide focal distance, bearing and elevation information. These sensors exist to provide similar landmark positional data within a given camera's field of view.
Pattern Recognition Systems
In the pattern recognition type of image insertion systems developed by Rosser et al., for instance, the system has two distinct modes. First is the search mode wherein each new frame of live video is searched in order to detect and verify a particular target image. Second is the tracking mode, in which the system knows that in the previous frame of video the target image was present. The system further knows the location and orientation of that previous frame with respect to some pre-defined reference coordinate system. The target image locations are tracked and updated with respect to the pre-defined reference coordinate system.
The search mode encompasses pattern recognition techniques to identify certain images. Obtaining positional data via pattern recognition, as opposed to using camera sensors, provides significant system flexibility because it allows live video insertion systems to make an insertion at any point in the video broadcast chain. For instance, actual insertion can be performed at a central site which receives different video feeds from stadiums or arenas around the country or world. The various feeds can be received via satellite or cable or any other means known in the art. Once the insertion is added, the video feed can be sent back via satellite or cable to the broadcast location where it originated, or directly to viewers.
Such pattern recognition search and tracking systems, however, are difficult to implement for some events and are the most vulnerable element prone to error during live video insertion system operation. The Assignee herein, Princeton Video Image, Inc., has devised and programmed robust searches for many venues and events such as baseball, football, soccer and tennis. However, the time and cost to implement similar search algorithms can be prohibitive for other types of events. Pattern recognition searching is difficult for events in which major changes to the look of the venue are made within hours, or even days, of the event. This is because a pre-defined common reference image of the venue is difficult to obtain since the look of the venue is not permanently set. In such cases a more robust approach to the search problem is to utilize sensors attached to one or more of the cameras to obtain target positional data.
Camera Sensor Systems
The drawbacks of relying solely upon camera sensor systems are detailed below. In field trials with televised baseball and football games, previous systems encountered the following specific, major problems.
1. Camera Motion
In a typical sport, such as football or baseball, close up shots are taken with long focal length cameras operating at a distance of up to several hundred yards from the action. Both of these sports have sudden action, namely the kicking or hitting of a ball, which results in the game changing abruptly from a tranquil scene to one of fast moving action. As the long focal length cameras react to this activity, the image they record displays several characteristics which render motion tracking more difficult. For example, the motion of the image may be as fast as ten pixels per field. This will fall outside the range of systems that examine pixel windows that are less than 10 by 10 pixels. Additionally, the images may become defocused and suffer severe motion blurring, such that a line which in a static image is a few pixels wide, blurs out to be 10 pixels wide. This means that a system tracking a narrow line, suddenly finds no match or makes assumptions such as the zoom has changed when in reality only fast panning has occurred. This motion blurring also causes changes in illumination level and color, as well as pattern texture, all of which can be problems for systems using pattern based image processing techniques. Camera motion, even in as little as two fields, results in abrupt image changes in the local and large scale geometry of an image. An image's illumination level and color are affected by camera motion as well.
2. Moving Objects
Sports scenes generally have a number of participants, whose general motion follows some degree of predictability, but who may at any time suddenly do something unexpected. This means that any automatic motion tracking of a real sports event has to be able to cope with sudden and unexpected occlusion of various parts of the image. In addition, the variety of uniforms and poses adopted by players in the course of a game, mean that attempts to follow any purely geometric pattern in the scene have to be able to cope with a large number of occurrences of similar patterns.
3. Lens Distortion
All practical camera lenses exhibit some degree of geometric lens distortion which changes the relative position of objects in an image as those objects move towards the edge of an image. When 1/10th of a pixel accuracy is required, this can cause problems.
4. Noise in the Signal
Real television signals exhibit noise, especially when the cameras are electronically boosted to cover low light level events, such as night time baseball. This noise wreaks havoc with image analysis techniques which rely on standard normalized correlation recognition, as these match pattern shapes, irrespective of the strength of the signal. Because noise shapes are random, in the course of several hundred thousand fields of video (or a typical three hour game), the chances of mistaking noise patterns for real patterns can be a major problem.
5. Field-to-Field Interlace
Television images, in both NTSC and PAL standards, are transmitted in two vertically interlaced fields which together make up a frame. This means that television is not a single stream of images, but two streams of closely related yet subtly different images. The problem is particularly noticeable in looking at narrow horizontal lines, which may be very evident in one field but not the other.
6. Illumination and Color Chance
Outdoor games are especially prone to illumination and color changes. Typically, a summer night baseball game will start in bright sunlight and end in floodlight darkness. An illumination change of a factor of more than two is typical in such circumstances. In addition the change from natural to artificial lighting changes the color of the objects in view. For instance, at Pro Player Park in Florida, the walls appear blue under natural lighting but green under artificial lighting.
7. Setup Differences
Cameras tend to be set up with small but detectable differences from night to night. For instance, camera tilt typically varies by up to plus or minus 1%, which is not immediately obvious to the viewer. However, this represents plus or minus 7 pixels and can be a problem to typical templates measuring 8 pixels by 8 pixels.
The advantages of camera sensors include the ability to be reasonably sure of which camera is being used and where it is pointing and at what magnification the camera is viewing the image. Although there may be inaccuracies in the camera sensor data due to inherent mechanical uncertainties, such as gear back-lash, these inaccuracies will never be large, a camera sensor system will, for instance, not miss-recognize an umpire as a goal post, or "think" that a zoomed out view of a stadium is a close up view of the back wall. It will also never confuse motion of objects in the foreground as being movement of the camera itself.
What is needed is a system that combines the advantages of both pattern recognition systems and camera sensor systems for searching and tracking scene motion while eliminating or minimizing the disadvantages of each. The primary difficulty in implementing a pattern recognition/camera sensor hybrid insertion system is the combining and/or switching between data obtained by the two completely different methods. If not done correctly, the combination or switch over gives unstable results which show up as the inserted image jerking or vibrating within the overall image. Overcoming this difficulty is crucial to making a hybrid system work well enough for broadcast quality.