Due to the employment of augmented reality technologies carried out by video insertion systems the experience of viewing many types of programs, including sporting events, has been enhanced by the ability to insert virtual enhancements (also referred to as inserts, graphics, logos, or indicia) into a particular location of the video image that a viewer is watching on television. For instance, in football, a First Down Line (FDL) is inserted into the real time broadcast of a game to signify the point on the field that the team currently on offense must drive towards in order to be awarded four more downs. In another example, a Down and Distance (DnD) arrow denoting the play number and the distance left to reach the FDL is inserted. While these virtual elements' positions and appearances are determined live based on game progression, other virtual elements may be unrelated to the game events, such as advertising indicia inserted at various areas on the field of play or on various stadium structures like a stadium wall.
An insertion system is a system and method for inserting graphics (virtual elements) into a live video broadcast in a realistic fashion on a real time basis. Generally, the perspective of the camera is being continuously estimated so that graphical elements, either 2D or 3D, may be projected to the video image from the current camera's perspective as if these graphical elements were located at a pre-determined position and orientation in the scene.
Live broadcast Video Insertion Systems (VIS) were developed and are used commercially for the purpose of inserting advertising and other indicia into video sequences, including live broadcasts of sporting events. An example of such a live broadcast VIS is used commercially under the trade name L-VIS®. In further examples, live broadcast VIS are described in U.S. Pat. Nos. 5,264,933, 5,543,856 to Rosser et al., and U.S. Pat. No. 5,491,517 to Kreitman et al., which are hereby incorporated by reference in their entirety. These VIS, to varying degrees of success, seamlessly and realistically incorporate indicia into the original video in real time. Realism is maintained even as the camera changes its perspective throughout the event coverage and taking into account moving elements in the scene that may occlude the inserted indicia.
FIG. 1 shows a top level block diagram of a typical VIS 100. The main VIS computing component 120 receives a live video feed 110 from a camera and then outputs, possibly with some latency, an enhanced video 115. In addition, the system includes a GUI component 150 with which an operator controls the system before and during an event and an indicia unit 170 where representations of the inserted virtual elements are stored.
At the heart of each insertion system is the capability to associate a point 226 in the scene to its projection in the video image space 246, as illustrated in FIG. 2. Generally, the scene's model is known. For example, a football field's dimensions are defined within 3D coordinate space 210, and its scene's model includes the 3D location of each distinctive landmark (e.g., lines 225, junction points 226, etc.) in the field. The field's X-Y plane 210 shows an insertion region denoted by the 3D coordinates—P1, P2, P3, and P4. This insertion region is associated with a virtual element (e.g., 2D graphic) that is to be inserted 240 into the current video image 230 from the current camera perspective. Hence, a camera projects the scene into its image space 230, with a projection dictated by the camera's parameters (e.g. focal length, position, orientation, etc.). Once this camera's parameters are known, any region 220 within the real-world space 210 may be projected into the camera's image space 240. Estimation of the camera's parameters, in turn, requires knowledge of fiducial points (landmarks in the scene, e.g. 225 and 226, and their corresponding points in the image, e.g. 245 and 246). The way in which a typical VIS, continuously and in real time, estimates the current camera's parameters (referred to herein as the camera's model) and uses it to virtually insert indicia is described in detail below.
Finding pairs of corresponding points, where landmarks in the field are matched with their projections in the current video frame, starts with the recognition process as performed by recognition and tracking module 125. Via processing of the current video image, unique features such as lines, conics, junctions, corners, etc., are detected. Based on their geometrical structure, appearance, or any other attributes their correspondence with landmarks in the scene model is determined. This recognition process may be carried out every several frames. For the frames that occur between the recognition process, tracking of the detected features by the recognition and tracking module 125 may maintain their correspondence with scene's landmarks. Next, based on the found corresponding pairs, the current camera's model may be estimated using camera model estimator module 130. As mentioned before, a camera's model is a mathematical operator (matrix) that maps a 3D point from the scene space 210 to its corresponding point in the video image space 230. The camera's model is composed of intrinsic parameters, such as focal length, and extrinsic parameters, such as the camera's position and orientation (pan, tilt, and rotation).
Having the current camera's model estimate, the warping unit 135 warps (projects) a given virtual element at a given 3D pose into the current video image space 230. For instance, a virtual element may be a logo. This logo may be represented in the indicia database 185 by its image (e.g. BMP or GIF format) and its desired location (insertion region) within the scene's 3D space: P1, P2, P3, and P4. The warping unit 135 will then warp this logo's image, using the camera's model, into a new indicium image within the current video image space: C1, C2, C3, and C4; this new indicium image is then ready to be rendered into the video image by the mixer 145. Note that, a virtual element is not limited to a 2D graphic, but may be any 3D structure. In this case, a 3D virtual element representative data in the indicia database 185 may be its 3D model (polygonal mesh or point-based representation), texture, and desired position, orientation, and scale in the scene. Similarly, knowledge of the current camera's model may be used to render this 3D element from this camera perspective.
Next, the occlusion mask generator 140 generates a transparency function or mask key, that is then applied to the insertion process at the mixer 145 to properly account for any obstacles that may be present in the insertion region. By performing an occlusion processing prior to insertion, VIS 100 ensures that the verisimilitude of the inserted logo into the video image is preserved when a physical element like a player steps into the insertion region. Rather than occlude the player with the inserted logo, the transparency function or mask key ensures that at every pixel location where an overlap occurs between the player and the logo, the pixel corresponding to the logo is suppressed in favor of the pixel of the image of the player. Hence, at the mixer 145 the warped indicia images are superimposed with the video image based on the occlusion mask.
An operator, via a GUI component 150, controls the insertion processing system 120. Before the game, the operator sets and trains the system, preparing it for the live event. Typically, the operator enters data regarding the scene usually via graphical interface. The operator defines the 3D coordinates of landmarks in the scene within a 3D coordinate system of the scene modeling unit 155. For example, in a sporting event the field structure 210 will be entered. The operator may also train the system to recognize color characteristics of the dynamic foregrounds (players) and color characteristics of the static background (field) using color modeling unit 160. This data will be used later for occlusion mask generation. Other information the operator typically enters into the system is the desired insertion 3D location and orientation using the indicia positioning unit 165 of each virtual element stored in the indicia database 185. As will be explained below, depending on the type of indicia, this information may be entered during pre-event setting or during the game.
In the VIS systems described above, particularly, though not exclusively, in VIS systems capable of inserting a virtual element into a dynamically determined insertion region, a problem arises when this element is inserted into an area of the video image that should not be occluded. For instance, in a football game, a virtual logo should not occlude the name of the team in the end zone, or, in a baseball game, the logo should not cover an actual (as opposed to virtual) advertisement on the stadium wall. In prior systems, the responsibility for doing so fell on the manual operator who had to reposition the inserted logo so as to not interfere with any portion of the image that ought to remain visible during the broadcast. Such manual repositioning has the unfortunate side effect of delaying the insertion of the logo in its desired position to such an extent that the viewer notices the sudden appearance of the logo in the image as if out of nowhere. Such a visually noticeable delay destroys the seamlessness and realism that are the intended hallmarks of VIS's.