Due to the employment of augmented reality technologies carried out by video insertion systems the experience of viewing many types of programs, including sporting events, has been enhanced by the ability to insert virtual enhancements (also referred to as elements, inserts, graphics, logos, or indicia) into a particular location of the video image that a viewer is watching on television. For instance, in football, a First Down Line (FDL) is inserted into the real time broadcast of a game to signify the point on the field that the team currently on offense must drive towards in order to be awarded four more downs. In another example, a Down and Distance (DnD) arrow denoting the play number and the distance left to reach the FDL. While these virtual elements' positions and appearances are determined live based on game progression, other virtual elements may be unrelated to the game events, such as advertising indicia inserted at various areas on the field of play or on various stadium structures like a stadium wall.
An insertion system is a system and method for inserting graphics (virtual elements) into a live video broadcast in a realistic fashion on a real time basis. Generally, the perspective of the camera is being continuously estimated so that graphical elements, either 2D or 3D, may be projected to the video image from the current camera's perspective as if these graphical elements were located at a pre-determined position and orientation in the scene.
Live broadcast Video Insertion Systems (VIS) were developed and are used commercially for the purpose of inserting advertising and other indicia into video sequences, including live broadcasts of sporting events. An example of such a live broadcast VIS is used commercially under the trade name L-VIS®. In further examples, live broadcast VIS are described in U.S. Pat. Nos. 5,264,933, 5,543,856 to. Rosser et al., and U.S. Pat. No. 5,491,517 to Kreitman et al., which are hereby incorporated by reference in their entirety. These VIS, to varying degrees of success, seamlessly and realistically incorporate indicia into the original video in real time. Realism is maintained even as the camera changes its perspective throughout the event coverage and moving elements in the scene that may occlude the inserted indicia are displayed over them.
FIG. 1 shows a top level block diagram of a typical VIS 100. The main VIS computing component 120 receives a live video feed 110 from a camera and then outputs, possibly with some latency, an enhanced video 115. In addition, the system includes a GUI component 150 with which an operator controls the system before and during an event and an indicia unit 170 where representations of the inserted virtual elements are stored.
Recognition and tracking module 125 performs a recognition process that analyzes the incoming video signal in order to recognize pre-selected landmarks in the image. Such landmarks correspond to prominent, unique features such as lines, conics, junctions, corners, etc. Based on their geometrical structure, appearance, or any other attributes their correspondence with landmarks in a scene model is determined. In order to facilitate the recognition of these video image landmarks, the frames of the incoming video signal may, prior to being searched, be decimated according to any suitable technique, for example, the Burt pyramid algorithm. This recognition process may be carried out every several frames.
Once the landmarks in the scene are recognized, the recognition and tracking module also tracks these landmarks on a frame by frame basis, in order to determine how the recognized landmarks are moving from frame to frame, which provides a measure of how the camera providing the video signal is moving. Typically, at least three landmarks are tracked, although that is not an absolute minimum requirement. By tracking the landmarks, the VIS is able to determine the incremental change in the camera's perspective, and thereby allows VIS 100 to adjust the projection of the logo in the scene onto the video frames.
Tracking in such systems involves tracking the background motion from frame-to-frame, which provides an indication of how the camera is moving from frame-to-frame according to, for example, its pan, tilt, zoom, and roll movements. This tracking is typically based on frame-to-frame comparisons of previously determined background features of the image, such as markings on the field (e.g., yard line markers on a football field), stadium walls, sidelines, or any other sharp, bold, and clear vertical, horizontal, diagonal, or corner features. That is, the system obtains movement information of the camera from a current image of a video image sequence by monitoring the motion of such background features. More specifically, the system, prior to the real-time insertion process, selects at least three visually distinctive landmarks (distinctive enough to survive decimation by the Burt pyramid pattern recognition algorithm, for instance), and then recognizes a single reference point in the image that is mathematically defined in relation to the landmarks. The tracking also involves the calculation of a transform model. A transform model defines how a reference 3D world (scene) model (which is independent of the camera's pose) spatially corresponds to the current image. A camera model is a specific type of transform model expressed in terms of camera parameters, e.g., pan, zoom, tilt, and roll. An example of a system that generates such camera models is taught by U.S. Pat. No. 6,741,725 to Astle, which is hereby incorporated by reference.
Next, based on the found landmarks, the current camera's model may be estimated using camera model estimator module 130. A camera's model is a mathematical operator that maps a 3D point from the scene space to its corresponding point in the video image space. The camera's model is composed of intrinsic parameters, such as focal length, and extrinsic parameters, such as the camera's position and orientation (pan, tilt, and rotation).
Having the current camera's model estimate, the warping unit 135 warps (projects) a given virtual element into the current video image space. For instance, a virtual element may be a logo. This logo may be represented in the indicia database 185 by its image (e.g. BMP or GIF format) and its desired location (insertion region) within the scene's 3D space. The warping unit 135 will then warp this logo's image, using the camera's model, into a new indicia image within the current video image space; this new indicia image is then ready to be rendered into the video image by the mixer 145. Note that, a virtual element is not limited to a 2D graphic, but may be any 3D structure. In this case, a 3D virtual element representative data in the indicia database 185 may be its 3D model (polygonal mesh or point-based representation), texture, and desired position, orientation, and scale in the scene. Similarly, knowledge of the current camera's model may be used to render this 3D element from this camera perspective.
Next, the occlusion mask generator 140 generates a transparency function or mask key, that is then applied to the insertion process at the mixer 145 to properly account for any obstacles that may be present in the insertion region. By performing an occlusion processing prior to insertion, VIS 100 ensures that the verisimilitude of the inserted logo into the video image is preserved when a physical element like a player steps into the insertion region. Rather than occlude the player with the inserted logo, the transparency function or mask key ensures that at every pixel location where an overlap occurs between the player and the logo, the pixel corresponding to the logo is suppressed in favor of the pixel of the image of the player. Hence, at the mixer 145 the warped indicia images are superimposed with the video image based on the occlusion mask.
An operator, via a GUI component 150, controls the insertion processing system 120. Before the game, the operator sets and trains the system, preparing it for the live event. Typically, the operator enters data regarding the scene usually via graphical interface. The operator defines the 3D coordinates of landmarks in the scene within a 3D coordinate system of the scene modeling unit 155. For example, in a sporting event the field structure will be entered. The operator may also train the system to recognize color characteristics of the dynamic foregrounds (players) and color characteristics of the static background (field) using color modeling unit 160. These data may be used later for occlusion mask generation. Other information the operator typically enters into the system is the desired insertion 3D location and orientation using the indicia positioning unit 165 of each virtual element stored in the indicia database 185. As will be explained below, depending on the type of indicia, this information may be entered during pre-event setting or during the game.
Another type of augmented reality system involves the tracking of dynamic elements of the video image sequence. For instance, a fan watching a hockey game on television may have trouble visually following a hockey puck during a game, because of its small size and high velocity. Other sports, like football, involve large amounts of players on each team, with each player having his movements restricted according to the rules of the game (e.g., lineman must stay behind the line of scrimmage in a pass play until the quarterback throws the pass). Visually tracking which player is legally beyond the line of scrimmage or not may prove too difficult for the typical viewer, given the large number of players involved and their largely unpredictable movements. An exemplary system that tracks players and game objects like pucks, balls, hockey sticks, etc., are taught in U.S. Published Patent Appln. No. 2011/0013836, which is hereby incorporated by reference.
The ability to distinguish between foreground and background elements of a scene is critical for the extraction of an accurate camera model and positional data of dynamic objects in systems like the VIS and dynamic object tracking systems described above. Certain prior systems for tracking dynamic foreground objects have relied on sensors to determine the position and orientation of a dynamic foreground object like a player or ball. For instance, U.S. Pat. No. 7,116,342 to Dengler et al. is a system for inserting perspective correct content into an image sequence. According to the '342 patent, before an insert is inserted into an image sequence (e.g., inserting a logo onto the jersey of a player moving on the field), it is transformed according to orientation and size so that it appears realistically when inserted as part of a player's uniform. The tracking of the player is performed by sensors and without reference to any content contained in the image sequence.
Another sensor-based system is described in U.S. Pat. No. 5,912,700 to Honey et al. The '700 patent describes a system that includes one or more sensors to determine the location of a dynamic object to have its appearance enhanced. In the preferred embodiment of the '700 patent, sensors embedded in a hockey puck communicate with receivers that are deployed around the arena to track the movement of the puck during a game. Because the video production system determines the location of the puck in a video frame during game play, the system of the '700 patent can graphically enhance the image of the puck in order to make it more visible to a person watching the game on television.
The '342 and '700 patents describe systems that employ sensor-based techniques to track the movement of dynamic foreground objects. However, using sensors attached to the dynamic objects may not be an option or may not be providing the required data spatial and temporal resolution. For example, attaching sensors to the players' body requires concession from the team and the league, and usually not in the discretion of the broadcasting company. In addition, these methods cannot be applied to post-production processing where corresponding sensory data are not available. In contrast, methods that are vision-based offer more flexibility and may be applied to either live or post-production video streams without being dependent on the availability of sensory data. Generally, vision-based processing yields high temporal resolution (e.g. 30 msec) and spatial resolution that is as high as the video image resolution.