The present invention is a continuation in part of U.S. patent application Ser. No. 10/006,444, filed on Nov. 20, 2001 entitled “Optimizations for Live Event, Real-Time, 3-D Object Tracking” and issued as U.S. Pat. No. 7,483,049.
Furthermore, the present invention incorporates by reference and claims the benefit of priority of the U.S. Provisional Application 60/563,091, filed on Apr. 14, 2004, entitled Automatic Sports Broadcasting System, with the same named inventors.
By today's standards, a multi-media sporting event broadcast that might typically be viewed through a television includes at least the following information:                video of the game, preferably spliced together from multiple views;        replay's of key events;        audio of the game;        graphic overlays of key statistics such as the score and other basic game metrics;        ongoing “play-by-play” audio commentary;        graphic overlays providing game analysis and summaries, and        advertisements inserted as clips during game breaks or as graphic overlays during play.        
Furthermore, after or while this information is collected, generated and assembled, it must also be encoded for transmission to one or more remote viewing devices such as a television or computer; typically in real-time. Once received on the remote viewing device, it must also be decoded and therefore returned to a stream of visual and auditory output for the viewer.
Any manual, semi-automatic or automatic system designed to create this type of multi-media: broadcast, must at least be able to:                track official game start/stop times, calls and scoring;        track participant and game object movement;        collect game video and audio;        analyze participant and game object movement;        create game statistics and commentary based upon the game analysis;        insert advertisements as separate video/audio clips or graphic overlays;        encode and decode a broadcast of the streams of video, audio, and game metric information;        
The present inventors are not aware of any fully automatic systems for creating sports broadcasts. There are many drawbacks to the current largely manual systems and methodologies some of which are identified as follows:                the cost of creating such broadcasts are significant both in terms of equipment and labor and therefore excludes smaller markets such as amateur and youth sports;        for practical reasons such as equipment and labor costs, the number of filming cameras is limited,        the typical broadcaster relies upon manually operated filming cameras to anticipate and follow the game action, but in practice it is difficult to consistently capture the more important and interesting events from the most desirable angles;        there is currently no practical means of creating a complete overhead view of the ongoing game that can be best used for game analysis and explanation;        current videoing technology is synchronized to the broadcast standards, such as NTSC, which regulate the frequency of image capture to be 29.97 frames per second which is consequently out-of-sync with typical indoor high-wattage lighting systems that fluctuate at intervals of 120 times per second, thus causing inconsistent lighting conditions per individual image frame;        current filming technology is all based in visible light and does not take advantage of potential information collection that is possible in the non-visible spectrums;        while some current systems can follow the game object, such as a puck, they cannot also automatically identify and track all participants, determining their locations and orientation throughout the entire contest;                    while some systems can automatically film the game centered around the detected location of the game object, they cannot additionally anticipate action based upon the knowledge of tracked participants or direct other cameras to follow these tracked participants;                        current systems cannot automatically track key spectators such as coaches, family members and other VIP so as to automatically film them during or after key game action;        game analysis, especially for more dynamic and fast moving sports such as ice hockey, can require hundreds to thousands of ongoing observations which are extremely difficult for manual systems to accurately record, let alone interpret in real-time;        there are currently no systems capable of creating a flow of tokens to describe game action that can be used to automatically direct synthesized and pre-recorded speech adding commentary to the ongoing game;        while inserting advertisements as clips into the ongoing game feed is relatively straightforward, adding overlaid graphics to the game action video is more problematic and requires greater forms of automation;        current practice typically does not automate the interface between the official game start and stop times in order to help automatically regulate the broadcast stream of live action, replays, commentary video and advertisements;        current practice typically does not automate the interface between the official scorekeeper in order to help automatically determine official game scoring, penalties and other rulings;        current systems have no way of delineating game events based upon tracked participants and information collected from an interface with the official scoring and ruling system;        current broadcasts are primarily designed to be output through a television and are therefore limited especially to the tv's display and computational shortcomings as well as its smaller broadcast bandwidths that constrain the total amount of presentable information;        while targeted for television output, broadcasts are not designed to take advantage of current computer technology that is now able to generate realistic graphic renderings of both the human form and surrounding environments in real-time;        current broadcasts are not interactive thereby allowing the viewer to dynamically select between multiple video feeds to be viewed either singularly or in combination;        current encoding techniques do not take advantage of newer video and audio compression technologies or possibilities therefore wasting bandwidth that could be used to either provide additional information or to conserve broadcaster capacity;        
Traditionally, professional broadcasters have relied upon a team of individuals working on various aspects of this list of tasks. For instance, a crew of cameramen would be responsible for filming a game from various angles using fixed and/or roving cameras. These cameras may also collect audio from the playing area and/or crew members would use fixed and/or roving microphones. Broadcasters would typically employ professional commentators to watch the game and provide both play-by-play descriptions and ongoing opinions and analysis. These commentators have access to the game scoreboard and can both see and hear the officials and referees as they oversee the game. They are also typically supported by statisticians who create meaningful game analysis summaries and are therefore able to provide both official and unofficial game statistics as audio commentary. Alternatively, this same information may be presented as graphic overlays onto the video stream with or without audible comment. All of this collected and generated information is then presented simultaneously to a production crew that selects the specific camera views and auditory streams to meld into a single presentation. This production crew has access to official game start and stop times and uses this information to control the flow of inserted advertisements and game action replays. The equipment used by the production team automatically encodes the broadcast into a universally accepted form which is than transmitted, or broadcast, to any and all potential viewing devices. The typical device is already built to accept the broadcaster's encoded stream and to decode this into a set of video and audio signals that can be presented to the viewers through appropriate devices such as a television and/or multi-media equipment.
Currently, there are no fully, or even semi-automatic systems for creating a video and/or audio broadcast of a sporting event. The first major problem that must be solved in order to create such a system is:
How Does an Automated System Become “Aware” of the Game Activities?
Any fully automated broadcast system would have to be predicated on the ability of a tracking system to continuously follow and record the location and orientation of all participants, such as players and game officials, as well as the game object, such as a puck, basketball or football. The present inventors taught a solution for this requirement in their first application entitled “Multiple Object Tracking System.” Additional novel teachings were disclosed in their continuing application entitled “Optimizations for Live Event, Real-Time, 3-D Object Tracking.” Both of these applications specified the use of cameras to collect video images of game activities followed by image analysis directed towards efficiently determining the location and orientation of participants and game objects. Important techniques were taught including the idea of gathering overall object movement from a grid of fixed overhead cameras that would then automatically direct any number of calibrated perspective tracking and filming cameras.
Other tracking systems exist in the market such as those provided by Motion Analysis Corporation. Their system, however, is based on fixed cameras placed at perspective filming angles thereby creating a filled volume of space in which the movements of participants could be adequately detected from two or more angles at all times. This approach has several drawbacks including the difficult nature of uniformly scaling the system in order to encompass the different sizes and shapes of playing areas. Furthermore, the fixed view of the perspective cameras is overly susceptible to occlusions as two or more participants fill the same viewing space. The present inventors prefer first determining location and orientation based upon the overhead view which is almost always un-blocked regardless of the number of participants. While the overhead cameras cannot sufficiently view the entire body, the location and orientation information derived from their images is ideal for automatically directing a multiplicity of calibrated perspective cameras to minimize player occlusions and maximize body views. The Motion Analysis system also relied upon visible, physically intrusive markings including the placement of forty or more retroreflective spheres attached to key body joints and locations. It was neither designed nor intended to be used in a live sporting environment. A further drawback to using this system for automatic sports broadcasting is its filtering of captured images for the purposes of optimizing tracking marker recognition. Hence, the resulting image is insufficient for broadcasting and therefore a complete second set of cameras would be required to collect the game film.
Similarly, companies such as Trakus, Inc. proposed solutions for tracking key body points, (in the case of ice hockey a player's helmet,) and did not simultaneously collect meaningful game film. The Trakus system is based upon the use of electronic beacons that emit pulsed signals that are then collected by various receivers placed around the tracking area. Unlike the Motion Analysis solution, the Trakus system could be employed in live events but only determines participant location and not orientation. Furthermore, their system does not collect game film, either from the overhead or perspective views.
Another beacon approach was also employed in Honey et. al.'s U.S. Pat. No. 5,912,700 assigned to Fox Sports Productions, Inc. Honey teaches the inclusion of infrared emitters in the game object to be tracked, in their example a hockey puck. A series of two or more infrared receives detects the emissions from the puck and passes the signals to a tracking system that first triangulates the puck's location and second automatically directs a filming camera to follow the puck's movement.
It is conceivable that both the Trakus and Fox Sports systems could be combined forming a single system that could continuously determine the location of all participants and the game object. Furthermore, building upon techniques taught in the Honey patent, the combined system could be made to automatically film the game from one or more perspective views. However, this combined system would have several drawbacks. First, this system can only determine the location of each participant and not their orientation that is critical for game analysis and automated commentary. Second, the beacon based system is expensive to implement in that it requires both specially constructed (and therefore expensive) pucks and to have transmitters inserted into player's helmets. Both of these criteria are impractical at least at the youth sports levels. Third, the tracking system does not additionally collect overhead game film that can be combined to form a single continuous view. Additionally, because these solutions are not predicated on video collection and analysis, they do not address the problems attendant to a multi-camera, parallel processing image analysis system.
Orad Hi-Tech Systems, is assigned U.S. Pat. No. 5,923,365 for a Sports Event Video manipulating system. In this patent by inventor Tamir, a video system is taught that allows an operator to select a game participant for temporary tracking using a video screen and light pen. Once identified, the system uses traditional edge detection and other similar techniques to follow the participant from frame-to-frame. Tamir teaches the use of software based image analysis to track those game participants and objects that are viewable anywhere within the stream of images being captured by the filming camera. At least because the single camera cannot maintain a complete view of the entire playing area at all times throughout the game, there are several difficulties with this approach. Some of these problems are discussed in the application including knowing when participants enter and exit the current view or when they are occluding each other. The present inventors prefer the use of a matrix of overhead cameras to first track all participants throughout the entire playing area and with this information to then gather and segment perspective film—all without the user intervention required by Tamir.
Orad Hi-Tech Systems, is also assigned U.S. Pat. No. 6,380,933 B1 for a Graphical Video System. In the patent, inventor Sharir discloses a system for tracking the three-dimensional position of players and using this information to drive pre-stored graphic animations enabling remote viewers to view the event in three dimensions. Rather than first tracking the players from an overhead or substantially overhead view as preferred by the present inventors, in one embodiment Sharir relies upon a calibrated theodolite that is manually controlled to always follow a given player. The theodolite has been equipped to project a reticle, or pattern, that the operator continues to direct at the moving player. As the player moves, the operator adjusts the angles of the theodolite that are continuously and automatically detected. These detected angles provide measurements that can locate the player in at least the two dimensions of the plane orthogonal to the axis of the theodolite. Essentially, this information will provide information about the player's relative side-to-side location but will not alone indicate how far they are away from the theodolite. Sharir anticipated having one operator/theodolite in operation per player and is therefore relying upon this one-to-one relationship to indicate player identity. This particular embodiment has several drawbacks including imprecise three-dimensional location tracking due to the single line-of-sight, no provision for player orientation tracking as well the requirement for significant operator interaction.
In a different embodiment in the same application, Sharir describes what he calls a real-time automatic tracking and identification system that relies upon a thermal imager boresighted on a stadium camera. Similar to the depth-of-field problem attendant to the theodolite embodiment, Sharir is using the detected pitch of the single thermal imaging camera above the playing surface to help triangulate the player's location. While this can work as a rough approximation, unless there is an exact feature detected on the player that has been calibrated to the player's height, than the estimation of distance will vary somewhat based upon how far away the player truly is and what part of the player is assumed to be imaged. Furthermore, this embodiment also requires potentially one manually operated camera per player to continuously track the location of every player at all times throughout the game. Again, the present invention is “fully” automatic especially with respect to participant tracking. In his thermal imaging embodiment, Sharir teaches the use of a laser scanner that “visits” each of the blobs detected by the thermal imager. This requires each participant to wear a device consisting of an “electro-optical receiver and an RF transmitter that transmits the identity of the players to an RF receiver.” There are many drawbacks to the identification via transmitter approach as previously discussed in relation to the Trakus beacon system. The present inventors prefer a totally passive imaging system as taught in prior co-pending and issued applications and further discussed herein.
And finally, in U.S. Pat. Nos. 5,189,630 and 5,526,479 Barstow et. al. discloses a system for broadcasting a stream of “computer coded descriptions of the (game) sub-events and events” that is transmitted to a remote system and used to recreate a computer simulation of the game. Barstow anticipates also providing traditional game video and audio essentially indexed to these “sub-events and events” allowing the viewer to controllably recall video and audio of individual plays. With respect to the current goals of the present application, Barstow's system has at least two major drawbacks. First, these “coded descriptions” are detected and entered into the computer database by an “observer who attends or watches the event and monitors each of the actions which occurs in the course of the event.” The present inventors prefer and teach a fully automated system capable of tracking all of the game participants and objects thereby creating an on going log of all activities which may then be interpreted through analysis to yield distinct events and outcomes. The second drawback is an outgrowth of the first limitation. Specifically, Barstow teaches the pre-establishment of a “set of rules” defining all possible game “events.” He defines an “event” as “a sequence of sub-events constituted by a discrete number of actions selected from a finite set of action types . . . . Each action is definable by its action type and from zero to possibly several parameters associated with that action type.” In essence, the entire set of “observations” allowable to the “observer who attends or watches” the game must conform to this pre-established system of interpretation. Barstow teaches that “the observer enters associated parameters for each action which takes place during the event.” Of course, as previously stated, human observers are extremely limited in their ability to accurately detect and timely record participant location and orientation data that is of extreme importance to the present inventor's view of game analysis. Barstow's computer simulation system builds into itself these very limitations.
Ultimately, this stream of human observations that has been constrained to a limited set of action types is used to “simulate” the game for a remote viewer.
With respect to an automated system capable of being “aware” of the game activities, only the teachings of the present inventors address an automatic system for:                collecting overhead film that can be dually used for both tracking and videoing;        specifying how this mosaic of overlapping, overhead film can be combined into a single contiguous and continuous video stream;        analyzing the video stream to determine both the location and orientation of the participants and game objects;                    determining three dimensional information including the height of the game object off of the playing surface;                        analyzing the film to determine the identity of participants who are wearing unique affixed markings such as encoded helmet stickers;        directing perspective ID cameras to follow detected participants for the purposes of collecting isolated images of their jersey number and other existing identifying marks;                    alternatively determining participant identification by performing pattern recognition on these key isolated images of participant jersey numbers and other identifying marks;                        directing perspective filming cameras to collect additional video and locate additional body points;        additionally collecting overhead and perspective video from the non-visible spectrum including ultraviolet and infrared frequencies that can be used to locate specially placed non-visible markings placed on a given participants key body locations;                    dynamically creating a three-dimensional kinetic body model of participants using the tracked locations of the non-visible markings;                        creating separate film and tracking databases from these continuous streams of overhead and perspective images;        analyzing the tracking database in real-time to detect and classify individual game events;        directing perspective videoing cameras to follow detected unfolding events of current or potential significance from camera angles anticipated to best reveal the game action, and                    directing these same perspective cameras that might normally capture images at roughly 30 frames per second to occasionally capture higher 60, 90, 120 or more frames when selected key events are unfolding thereby supporting slow and supper-slow motion replays.                        
In order to create a complete automatic broadcasting system, additional problems needed to be resolved such as:
How can a System Filming High Speed Motion that Requires Fast Shutter Speeds Synchronize Itself to the Lighting System?
The typical video camera captures images at the NTSC Broadcast standard of 29.97 frames per second. Furthermore, most often they use what is referred to as full integration which means that each frame is basically “exposed” for the maximum time between frames. In the case of 29.97 frames per second, the shutter speed would be roughly 1/30th of a second. This approach is acceptable for normal continuous viewing but leads to blurred images when a single frame is frozen for “stop action” or “freeze frame” viewing. In order to do accurate image analysis on high-speed action, it is both important to capture at least 30 if not 60 frames per second and that each frame be captured with a shutter speed of 1/500th to 1/1000th of a second. Typically, image analysis is more reliable if there is less image blurring.
Coincident with this requirement for faster shutter speeds to support accurate image analysis, is the issue of indoor lighting at a sport facility such as an ice hockey rink. A typical rink is illuminated using two separate banks of twenty to thirty metal halide lamps with magnetic ballasts. Both banks, and therefore all lamps, are powered by the same alternating current that typically runs at 60 HZ, causing 120 “on-off” cycles per second. If the image analysis cameras use a shutter speed of 1/120th or greater, for instance 1/500th or 1/1000th of a second, then it is possible that the lamp will essentially be “off” or discharged when the cameras sensor is being exposed. Hence, what is needed is a way to synchronize the camera's shutter with the lighting to be certain that it only captures images when the lamps are discharging. The present application teaches the synchronization of the high-shutter-speed tracking and filming cameras with the sports venue lighting to ensure maximum, consistent image lighting.
How can a Practical, Low-Cost System be Built to Process the Simultaneous Image Flow from Approximately Two Hundred Cameras Capturing Thirty to One Hundred and Twenty Images Per Second?
Current technology such as that provided by Motion Analysis Corporation, typically supports up to a practical maximum of thirty-two cameras. For an indoor sport such as youth ice hockey, where the ceiling is only twenty-five to thirty-feet off the ice surface, the present inventors prefer a system of eighty or more cameras to cover the entire tracking area. Furthermore, as will be taught in the present specification, it is beneficial to create two to three separate and complete overlapping views of the tracking surface so that each object to be located appears in at least two views at all times. The resulting overhead tracking system preferably consists of 175 or more cameras. At 630×630 pixels per image and three bytes per pixel for encoded color information amounting to 1 MB per frame, the resulting data stream from a single camera is in the range of 30 MB to 60 MB per second. For 175 cameras this stream quickly grows to approximately 125 GB per second for a 60 frames per second system. Current PC's can accept around 1 GB per second of data that they may or may not be able to process in real-time.
In any particular sporting event, and especially in ice hockey, the majority of the playing surface will be empty of participants and game objects at any given time, especially when viewed from overhead. For ice hockey, any single player is estimated to take up approximately five square feet of viewing space. If there are on average twenty players per team and three game officials, then the entire team could fit into 5 sq. ft. ×23 players=115 sq. ft./all players. A single camera in the present specification is expected to cover 18 ft. by 18 ft. for a total of 324 sq. ft. Hence, all of the players on both teams as well as the game officials could fit into the equivalent of a single camera view, and therefore generate only 30 MB to 60 MB per second of bandwidth. This is a reduction of over 200 times from the maximum data stream and would enable a conventional PC to process the oncoming stream.
What is needed is a system capable of extracting the moving foreground objects, such as participants and game objects, in real-time creating a minimized video image dataset. This minimized dataset is then more easily analyzed in real-time allowing the creation of digital metrics that symbolically encode participant locations, orientations, shapes and identities. Furthermore, this same minimized dataset of extracted foreground objects may also be reassembled into a complete view of the entire surface as if taken by a single camera. The present invention teaches a processing hierarchy including a first bank of overhead camera assemblies feeding full frame data into a second level of intelligent hubs that extract foreground objects and creating corresponding symbolic representations. This second level of hubs then passes the extracted foreground video and symbolic streams into a third level of multiplexing hubs that joins the incoming data into two separate streams to be passed off to both a video compression and a tracking analysis system, respectively.
What is the Correct Configuration of Overhead Filming Cameras Necessary to Accurately Locate Participants and Game Objects in Three Dimensions without Significant Image Distortion?
The approach of filming a sporting event from a fixed overhead view has been the starting point for other companies, researcher's and patent applications. One such research team is the Machine Vision Group (MVG) based out of the Electrical Engineering Department of the University of Ljubljana, of Slovenia. Their approach implemented on a handball court, uses two overhead cameras with wide angle lenses to capture a roughly one hour match at 25 frames per second. The processing and resulting analysis is done post-event with the help of an operator, “who supervises the tracking process.” By using only two cameras, both the final processing time and the operator assistance are minimized. However, this savings on total acquired image data necessitated the use of the wide angle lens to cover the larger area of a half court for each single camera. Furthermore, significant computer processing time is expended to correct for the known distortion created by the use of wide angle lenses. This eventuality hinders the possibility for real-time analysis. Without real-time analysis, the overhead tracking system cannot drive one or more perspective filming cameras in order to follow the game action. What is needed is a layout of cameras that avoids any lens distortion that would require image analysis to correct. The present invention teaches the uses of a grid of cameras, each with smaller fields-of-view and therefore no required wide-angle lenses. However, as previously mentioned the significantly larger number of simultaneous video streams quickly exceeds existing computer processing limits and therefore requires novel solutions as herein disclosed. The system proposed by the MVG also appears to be mainly focused on tracking the movements of all the participants. It does not have the additional goal of creating a viable overhead-view video of the contest that can be watched similar to any traditional perspective-view game video. Hence, while computer processing can correct for the severe distortion caused by the camera arrangement choices, the resulting video images are not equivalent to those familiar to the average sports broadcast viewer. What is needed is an arrangement of cameras that can provide minimally distorted images that can be combined to create an acceptable overhead video. The present invention teaches an overlapping arrangement of two to three grids of cameras where each grid forms a single complete view of the tracking surface. Also taught is the ideal proximity of adjacent cameras in a single grid, based upon factors such as the maximum player's height and the expected viewing area comprised by a realistic contiguous grouping of players. The present specification teaches the need to have significant overlap in adjacent camera views as opposed to no appreciable overlap such as with the MVG system.
Furthermore, because of the limited resolution of each single camera in the MVG system, the resulting pixels per inch of tracking area is insufficient to adequately detect foreground objects the size of a handball or identification markings affixed to the player such as a helmet sticker. What is needed is a layout of cameras that can form a complete view of the entire tracking surface with enough resolution to sufficiently detect the smallest anticipated foreground object, such as the handball or a puck in ice hockey. The present invention teaches just such an arrangement that in combination with the smaller fields of view per individual camera and the overlapping of adjacent fields-of-view, in total provides an overall resolution sufficient for the detection of all expected foreground objects.
Similar to the system proposed by MVG, Larson et al. taught a camera based tracking system in U.S. Pat. No. 5,363,297 entitled “Automated Camera-Based Tracking System for Sports Contests.” Larson also proposed a two camera system but in his case one camera was situated directly above the playing surface while the other was on a perspective view. It was also anticipated that an operator would be necessary to assist the image analysis processor, as with the MVG solution. Larson further anticipated using beacons to help track and identify participants so as to minimize the need for the separate operator.
How can Perspective Filming Cameras be Controlled so that as they Pan, Tilt and Zoom their Collected Video can be Efficiently Processed to Extract the Moving Foreground from the Fixed and Moving Background and to Support the Insertion of Graphic Overlays?
As with the overhead cameras, the extraction of moving foreground objects is of significant benefit to image compression of the perspective film. For instance, a single perspective filming camera in color at VGA resolutions would fill up approximately 90% of a single side of a typical DVD. Furthermore, this same data stream would take up to 0.7 MB per second to transmit over the Internet, far exceeding current cable modem capacities. Therefore, the ability to separate the participants moving about in the foreground from the playing venue forming the background is of critical issue for any broadcast intended especially to be presented over the Internet and/or to include multiple simultaneous viewing angles. However, this is a non-trivial problem when considering that the perspective cameras are themselves moving thus creating the effect even the fixed aspects of the background are moving in addition to the moving background and foreground.
As previously mentioned, the present inventors prefer the use of automated perspective filming cameras whose pan and tilt angles as well as zoom depths are automatically controlled based upon information derived in real-time from the overhead tracking system. There are other systems, such as that specified in the Honey patent, that employ controlled pan/tilt and zoom filming cameras to automatically follow the game action. However, the present inventors teach the additional step of limiting individual frame captures to only occur at a restricted set of allow camera angles and zoom depths. For each of these allowed angles and depths, a background image will be pre-captured while no foreground objects are present; for example at some time when the facility is essentially empty. These pre-captured background images are then stored for later recall and comparison during the actual game filming. As the game is being filmed by each perspective camera, the overhead system will continue to restrict images to the allowed, pre-determined angles and depths. For each current image captured, the system will look up the appropriate stored background image matching the current pan/tilt and zoom settings. This pre-stored, matched background is then subtracted from the current image thereby efficiently revealing any foreground objects, regardless of whether or not they are moving. In effect, it is as if the perspective cameras were stationary similar to the overhead cameras.
While typical videoing cameras maintain their constant NTSC broadcast rate of 29.97 frames per second, or some multiple thereof, the perspective cameras in the present invention will not follow this standardized rate. In fact, under certain circumstances they will not have consistent, fixed intervals between images such as 1/30th of a second. The actual capture rate is a dependent upon the speed of pan, tilt and zoom motions in conjunction with the allowed imaging angles and depths. Hence, the present inventors teach the use of an automatically controlled videoing camera that captures images at an asynchronous rate. In practice, these cameras are designed to maintain an average number of images in the equivalent range such as 30, 60 or 90 frames per second. After capturing at an asynchronous rate, these same images are then synchronized to the desired output standard, such as NTSC. The resulting minimal time variations between frames are anticipated to be unintelligible to the viewer. The present inventors also prefer synchronizing these same cameras to the power lines driving the venue lighting thereby supporting higher speed image captures. These higher speed captures will result in crisper images, especially during slow or freeze action and will also support better image analysis.
The present inventors also teach a method for storing the pre-captured backgrounds from the restricted camera angles and zoom depths as a single panoramic. At any given moment, the current camera pan and tilt angles as well as zoom depth can be used to index into the panoramic dataset in order to create a single-frame background image equivalent to the current view. While the panoramic approach is expected to introduce some distortion issues it has the benefit of greatly reducing the required data storage for the pre-captured backgrounds.
In addition to removing the fixed background from every current image of a perspective camera, there will be times when the current view includes a moving background such as spectators in the surrounding stands. Traditional methods for removing this type of background information include processing and time extensive intra and inter-frame image analysis. The present inventors prefer segmenting each captured image from a perspective camera into one to two types of background regions based upon a pre-measured three-dimensional model of the playing venue and the controlled angles and depth of the current image. Essentially, by knowing where each camera is pointed with respect to the three-dimensional model at any given moment, the system can always determine which particular portion of the playing venue is in view. In some cases, this current view will be pointed wholly onto the playing area of the facility as opposed to some portion of the playing area and surrounding stands. In this case, the background is of the fixed type only and simple subtraction between the pre-stored background and the current image will yield the foreground objects. In the alternate case, were at least some portion of the current view includes a region outside of the playing area, than the contiguous pixels of the current image corresponding to this second type of region can be effectively determined in the current image via the three-dimensional model. Hence, the system will know which portion of each image taken by a perspective filming camera covers a portion of the venue surrounding the playing area. It is in the surrounding areas that moving background objects, such as spectators may be found.
The present inventors further teach a method for employing the information collected by the overhead cameras to create a topological three-dimensional profile of any and all participants who may happen to be in the same field-of-view of the current image. This profile will serve to essentially cut out the participants profile as it overlays the surrounding area that may happen to be in view behind them. Once this topological profile is determined, all pixels residing in the surrounding areas that are determined to not overlap a participant (i.e. they are not directly behind the player,) are automatically dropped. This “hardware” assisted method of rejecting pixels that are not either a part of the fixed background or a tracked participant, offers considerable efficiency over traditional software methods.
After successfully removing, or segmenting, the image foreground from its fixed and moving backgrounds, the present inventors teach the limited encoding and transmission of just the foreground objects. This reduction in overall information to be transmitted and/or stored yields expected Internet transfer rates of less than 50 KB and full film storage of 0.2 GB, or only 5% of today's DVD capacity. Upon decoding, several options are possible including the reinstatement of the fixed background from a panoramic reconstruction pre-stored on the remote viewing system. It is anticipated that the look of this recombined image will be essentially indistinguishable from the original image. All that will be missing is minor background surface variations that are essentially insignificant and images of the moving background such as the spectators. The present inventors prefer the use of state of the art animation techniques to add a simulated crowd to each individual decoded frame. It is further anticipated that these same animation techniques could be both acceptable and preferable for recreating the fixed background as opposed to using the pre-transmitted panoramic.
With respect to the audio coinciding to the game film, the present inventors anticipate either transmitting an authentic capture or alternatively sending a synthetic translation of the at least the volume and tonal aspects of the ambient crowd noise. This synthetic translation is expected to be of particular value for broadcasts of youth games where there tends to be smaller crowds on hand. Hence, as the game transpires, the participants are extracted from the playing venue and transmitted along with an audio mapping of the spectator responses. On the remote viewing system, the game may then be reconstructed with the original view of the participants overlaid onto a professional arena, filled with spectators whose synthesized cheering is driven by the original spectators.
With respect to the recreation of the playing venue background on the remote viewing system, both the “real-image” and “graphically-rendered” approaches have the additional advantage of being able to easily overlay advertisements. Essentially, after recreating the background using either actual pre-stored images of the venue or graphic animations, advertisements can be placed in accordance with the pre-known three-dimensional map and transmitted current camera angle being displayed. After this, the transmitted foreground objects are overlaid forming a complete reconstruction. There are several other inventors who have addressed the need for overlaying advertisements onto sports broadcasts. For instance, there are several patents assigned to Orad Hi-Tech Systems, LTD including U.S. Pat. Nos. 5,903,317, 6,191,825 B1, 6,208,386 B1, 6,292,227 B1, 6,297,853 B1 and 6,384,871 B1. They are directed towards “apparatus for automatic electronic replacement of a billboard in a video image.” The general approach taught in these patents limits the inserted advertisements to those areas of the image determined to already contain existing advertising. Furthermore, these systems are designed to embed these replacement advertisements in the locally encoded broadcast that is then transmitted to the remote viewer. This method naturally requires transmission bandwidth for the additional advertisements now forming a portion of the background (which the present inventors do not transmit.)
The present inventors prefer to insert these advertisements post transmission on the remote viewing device as a part of the decoding process. Advertisements can be placed anywhere either in the restored life-like or graphically animated background. If it is necessary to place a specific ad directly on top of an existing ad in the restored life-like image, the present inventors prefer a calibrated three-dimensional venue model that describes the player area and all important objects, hence the location and dimensions of billboards. This calibrated three-dimensional model is synchronized to the same local coordinate system used for the overhead and perspective filming cameras. As such, the camera angle and zoom depth transmitted with each sub-frame of foreground information not only indicates which portion of the background must be reconstructed according to the three-dimensional map, but also indicates whether or not a particular billboard is in view and should be overlaid with a different ad.
Other teachings exist for inserting static or dynamic images into a live video broadcast which covers a portion of the purposes of the present Automated Sports Broadcasting System. For instance, in U.S. Pat. No. 6,100,925 assigned to Princeton Video Image, Inc., Rosser et al. discloses a method that relies upon a plurality of pre-known landmarks within a given venue that have been calibrated to a local coordinate system in which the current view of a filming camera can be sensed and calculated. Hence, as the broadcast camera freely pans, tilts and zooms to film a game, its current orientation and zoom depth is measured and translated via the local coordinate system into an estimate of its field-of-view. By referring to the database of pre-known landmarks, the system is able to predict when and where any given landmark should appear in any given field-of-view. Next, the system employs pattern matching between the pixels in the current image anticipated to represent a landmark and the pre-known shape, color and texture of the landmark. Once the matching of one or more landmarks is confirmed, the system is then able to insert the desired static or dynamic images. In an alternative embodiment, Rosser suggest using transmitters embedded in the game object in order to triangulate position in essence creating a moving landmark. This transmitter approach for tracking the game object is substantially similar to at least that of Trakus and Honey.
Like the Orad patents for inserting advertisements, the teachings of Rosser differ from the present invention since the inserted images are added to the encoded broadcast prior to transmission, therefore taking up needed bandwidth. Furthermore, like the Trakus and Honey solutions for beacon based object tracking, Rosser's teachings are not sufficient for tracking the location and orientation of multiple participants. At least these, as well as other drawbacks, prohibit the Rosser patent from use as an automatic broadcasting system as defined by the present inventors.
With the similar purpose of inserting a graphic into live video, in U.S. Pat. No. 6,597,406 B2 assigned to Sportvision, Inc., inventor Gloudeman teaches a system for combining a three-dimensional model of the venue with the detected camera angle and zoom depth. An operator could then interact with the three-dimensional model to select a given location for the graphic to be inserted. Using the sensed camera pan and tilt angles as well as zoom depth, the system would then transform the selected three-dimensional location into a two-dimensional position in each current video frame from the camera. Using this two-dimensional position, the desired graphic is then overlaid onto the stream of video images. As with other teachings, Gloudeman's solution inserts the graphic onto the video frame prior to encoding; again taking up transmission bandwidth. The present inventors teach a method for sending this insertion location information along with the extracted foreground and current camera angles and depths associated with each frame or sub-frame. The remote viewing system then decodes these various components with pre-knowledge of both the three-dimensional model as well as the background image of the venue. During this decode step, the background is first reconstructed from a saved background image database or panorama, after which advertisements and/or graphics are either placed onto pre-determined locations or inserted based upon some operator input. And finally, the foreground is overlaid creating a completed image for viewing. Note that the present inventors anticipate that the information derived from participant and game object tracking will be sufficient to indicate where graphics should be inserted thereby eliminating the need for operator input as specified by Gloudeman.
How can a System Track and Identify Players Without Using any Special Markings?
The governing bodies of many sports throughout the world, especially at the amateur levels, do not allow any foreign objects, such as electronic beacons, to be placed upon the participants. What is needed is a system that is capable of identifying participants without the use of specially affixed markings or attached beacons. The present inventors are not aware of any systems that are currently able to identify participants using the same visual markings that are available to human spectators, such as a jersey team logo, player number and name. The present application builds upon the prior applications included by reference to show how the location and orientation information determined by the overhead cameras can be used to automatically control perspective view cameras so as to capture images of the visual markings. Once captured, these markings are then compared to a pre-known database thereby allowing for identification via pattern matching. This method will allow for the use of the present invention in sports where participants do not wear full equipment with headgear such as basketball and soccer.
How Can a Single Camera be Constructed to Create Simultaneous Images in the Visible and Non-Visible Spectrums to Facilitate Tire Extraction of the Foreground Objects Followed by Tie Efficient Locating of any Non-Visible Markings?
As was first taught in prior applications of the present inventors, it is possible to place marks in the form of coatings onto surfaces such as a player's uniform or game equipment. These coatings can be specially formulated to substantially transmit electromagnetic energy in the visible spectrum from 380 nm to 770 nm while simultaneously reflecting or absorbing energies outside of this range. By transmitting the visible spectrum, these coatings are in effect “not visually apparent” to the human eye. However, by either absorbing or reflecting the non-visible spectrum, such as ultraviolet or infrared, these coatings can become detectable to a machine vision system that operates outside of the visible spectrum. Among other possibilities, the present inventors have anticipated placing these “non-apparent” markings on key spots of a player's uniform such as their shoulders, elbows, waist, knees, ankles, etc. Currently, machine vision systems do exist to detect the continuous movement of body joint markers at least in the infrared spectrum. Two such manufacturers known to the present inventors are Motion Analysis Corporation and Vicon. However, in both company's systems, the detecting cameras have been filtered to only pass the infrared signal. Hence, the reflected energy from the visible spectrum is considered noise and eliminated before it can reach the camera sensor.
The present inventors prefer a different approach that places what is known as a “hot mirror” in front of the camera lens that acts to reflect the infrared frequencies above 700 nm off at a 45° angle. The reflected infrared energy is then picked up by a second imaging sensor responsive to the near-infrared frequencies. The remaining frequencies below 700 nm pass directly through the “hot mirror” to the first imaging sensor. Such an apparatus would allow the visible images to be captured as game video while simultaneously creating an exactly overlapping stream of infrared images. This non-visible spectrum information can then be separately processed to pinpoint the location of marked body joints in the overlapped visible image. Ultimately, this method is an important tool for creating a three-dimensional kinetic model of each participant. The present inventors anticipate optionally including these motion models in the automated broadcast. This kinetic model dataset will require significantly less bandwidth than the video streams and can be used on the remote system to drive an interactive, three-dimensional graphic animation of the real-life action.
How Can Spectators be Tracked and Filmed, and the Playing Venue be Audio Recorded in a Way that Allows this Additional Non-Participant Video and Audio to be Meaningfully Blended into the Game Broadcast?
For many sports, especially at the youth levels where the spectators are mostly parents and friends, the story of a sporting event can be enhanced by recording what is happening around and in support of the game. As mentioned previously, creating a game broadcast is an expensive endeavor and that is typically reserved for professional and elite level competition. However, the present inventors anticipate that a relatively low cost automated broadcast system that delivered its content over the Internet could open up the youth sports market. Given the fact that most youth sports are attended by the parents and guardians of the participants, the spectator base for a youth contest represents a potential source of interesting video and audio content. Currently, no system exists that can automatically associate the parent with the participant and subsequently track the parents location throughout the contest. This tracking information can then be used to optionally video any given parent(s) as the game tracking system becomes aware that their child/participant is currently involved in a significant event.
Several companies have either developed or are working on radio frequency (RF) and ultra-wide band (UWB) wearable tag tracking systems. These RF and UWB tags are self-powered and uniquely encoded and can, for instance, be worn around an individual spectator's neck. As the fan moves about in the stands or area surrounding the game surface, a separate tracking system will direct one or more automatic pan/tilt/zoom filming cameras towards anyone, at any time. The present inventors envision a system where each parent receives a uniquely encoded tag to be worn during the game allowing images of them to be captured during plays their child is determined to be involved with. This approach could also be used to track coaches or VIP and is subject to many of the same novel apparatus and methods taught herein for filming the participants.
How Can the Official Indications of Game Clock Start and Stop Times be Detected to Allow for the Automatic Control of the Scoreboard and for Time Stamping of the Filming and Tracking Databases?
The present invention for automatic sports broadcasting is discussed primarily in relation to the sport of ice hockey. In this sport as in many, the time clock is essentially controlled by the referees. When the puck is dropped on a face-off, the official game clock is started and whenever a whistle is blown or a period ends, the clock is stopped. Traditionally, especially at the youth level, a scorekeeper is present monitoring the game to watch for puck drops and listen for whistles. In most of the youth rinks this scorekeeper is working a console that controls the official scoreboard and clock. The present inventors anticipate interfacing this game clock to the tracking system such that at a minimum, as the operator starts and stops the time, the tracking system receives appropriate signals. This interface also allows the tracking system to confirm official scoring such as shots, goals and penalties. It is further anticipated that this interface will also accept player numbers indicating official scoring on each goal and penalty. The present inventors are aware at least one patent proposing an automatic interface between a referee's whistle and the game scoreboard. In U.S. Pat. No. 5,293,354, Costabile teaches a system that is essentially tuned to the frequency of the properly blown whistle. This “remotely actuatable sports timing system” includes a device worn by a referee that is capable of detecting the whistle's sound waves and responding by sending off its own RF signal to start/stop the official clock. At least four drawbacks exist to Costabile's solution. First, the referee is required to wear a device which, upon falling could cause serious injury to the referee. Second, while this device can pick up the whistle sound, it is unable to distinguish which of up to three possible referees actually blew the whistle. Third, if the whistle if the airflow through the whistle is not adequate to create the target detection frequencies, then Costabile's receiver may “miss” the clock stoppage. And finally, it does include a method for detecting when a puck is dropped, which is how the clock is started for ice hockey.
The present inventors prefer an alternate solution to Costabile that includes a miniturized air-flow detector in each referees whistle. Once air-flow is detected, for instance as it flows across an internal pinwheel, a unique signal is generated and automatically transmitted to the scoreboard interface thereby stopping the clock. Hence, the stoppage is accounted to only one whistle and therefore referee. Furthermore, the system is built into the whistle and carries no additional danger of harm to the referee upon falling. In tandem with the air-flow detecting whistle, the present inventors prefer using a pressure sensitive band worn around two to three fingers of the referee's hand. Once a puck is picked up by the referee and held in his palm, the pressure sensor detects the presence of the puck and lights up a small LED for verification. After the referee sees the lit LED, he then is ready and ultimately drops the puck. The pressure on the band is released and a signal is sent to the scoreboard interface starting the official clock.
By automatically detecting clock start and stops times as well as picking up official game scoring through a scoreboard interface, the present invention uses this information to help index the captured game film.
How can Tracking Data Determined by Video Image Analysis be Used to Create Meaningful Statistics and Performance Metrics that can be Compared to Subjective Observation thereby Providing for Positive Feed-Back to Influence the Entire Process?
Especially for the ice hockey, many of the player movements in sports are too fast and too numerous to quantify by human based observation. In practice, game observers will look to quantify a small number of well-defined, easily observed events such as “shots” or “hits.” Beyond this, many experienced observers will also make qualitative assessments concerning player and team positioning, game speed and intensity, etc. This former set of observations comes without verifiable measurement. At least the Trakus and Orad systems have anticipated the benefit of a stream of verifiable, digitally encoded measurements. This stream of digital performance metrics is expected to provide the basis for summarization into a newer class of meaningful statistics. However, not only are there significant drawbacks to the apparatus and methods proposed by Trakus and Orad for collecting these digital metrics, there is at least one key measurement that is missing. Specifically, the present inventors teach the collection of participant orientation in addition to location and identity. Furthermore, the present inventors are the only system to teach a method applicable to live sports for collecting continuous body joint location tracking above and beyond participant location tracking.
This continuous accumulation of location and orientation data recorded by participant identity thirty times or more per second yields a significant database for quantifying and qualifying the sporting event. The present inventors anticipate submitting a continuation of the present invention teaching various methods and steps for translating these low level measurements into meaningful higher level game statistics and qualitative assessments. While the majority of these teachings will be not addressed in the present application, what is covered is the method for creating a feed-back loop between a fully automated “objective” game assessment system and a human based “subjective” system. Specifically, the present inventors teach a method of creating “higher level” or “judgment-based” assessments that can be common to both traditional “subjective” methods and newer “objective” based methods. Hence, after viewing a game, both the coaching staff and the tracking system rate several key aspects of team and individual play. Theoretically, both sets of assessments should be relatively similar. The present inventors prefer capturing the coaches “subjective” assessments and using them as feed-back to automatically adjust the weighting formulas used to drive the underlying “objective” assessment formulas.
Most of the above listed references are addressing tasks or portions of tasks that support or help to automate the traditional approach to creating a sports broadcast. Some of the references suggest solutions for gathering new types of performance measurements based upon automatic detection of player and/or game object movements. What is needed is an automatic integrated system combining solutions to the tasks of:                tracking official game start/stop times, calls and scoring;        automatically tracking participant and game object movement using a multiplicity of substantially overhead viewing cameras;        automatically assembling a single composite overhead view of the game based upon the video images captured by the tracking system;        collecting video from one or more perspective view cameras that are automatically directed to follow the game action based upon the determined participant and game object movement;        automatically collecting game audio and creating matched volume and tonal mappings;        analyzing participant and game object movement to create game statistics and performance measurements forming a stream of game metrics;        automatically creating performance descriptor tokens based upon the game metrics describing the important game activities;        dynamically assembling combinations of the video, game metrics, performance tokens and audio information into an encoded broadcast based upon remote viewer directives;        transmitting the broadcast and receiving back interactive viewer directives;        decoding the broadcast into a stream of video and audio signals capable of being presented on the viewing device, where        the background may be chosen by the viewer to match either the original or a different facility, in either “natural” or “animated” formats;        the overhead game view and a multiplicity of perspective views are available under user direction in either video, gradient “colorized line-art” or symbolic formats;        standard and custom advertisements are inserted, preferably based upon the known profile of the viewer, as separate video/audio clips or graphic overlays;        statistics, performance measurements and other game analysis are graphically overlaid onto the generated video;        audio game commentary is automatically synthesized based upon the performance tokens, and        crowd noise is automatically synthesized based upon the matched volume and tonal mappings as an alternative to the “natural” recorded game audio.        
When taken together, the individual sub-systems for performing these tasks become an Automatic Event Videoing, Tracking and Content Generation System.
Given the current state of the art in CMOS image sensors, Digital Signal Processors (DSP's), Field Programmable Arrays (FPGA's) and other digital electronic components as well as general computing processors, image optics, and software algorithms for performing image segmentation and analysis it is possible to create a massively parallel, reasonably priced machine vision based sports tracking system. Also, given the additional state of the art in mechanical pan/tilt and electronic zoom devices for use with videoing cameras along with algorithms for encoding and decoding highly segmented and compressed video, it is possible to create a sophisticated automatic filming system controlled by the sports tracking system. Furthermore, given state of the art low cost computing systems, it is possible to breakdown and analyze the collected player and game object tracking information in real-time forming a game metrics and descriptor database. When combined with advancements in text-to-speech synthesis, it is then possible to create an Automatic Event Videoing, Tracking and Content Generation System capable of recording, measuring, analyzing, and describing in audio the ensuing sporting event in real-time. Using this combination of apparatus and methods provides opportunities for video compression significantly exceeding current standards thereby providing opportunities for realistically distributing the resulting sports broadcast over non-traditional mediums such as the Internet.
While the present invention will be specified in reference to one particular example of sports broadcasting, as will be described forthwith, this specification should not be construed as a limitation on the scope of the invention, but rather as an exemplification of the preferred embodiments thereof. The inventors envision many related uses of the apparatus and methods herein disclosed only some of which will be mentioned in the conclusion to this applications specification. For purposes of teaching the novel aspects of the invention, the example of a sport to be automatically broadcast is that of an ice-hockey game. Accordingly, the underlying objects and advantages of the present invention are to provide sub-systems in support of, and comprising an Automatic Event Videoing, Tracking and Content Generation System with the following capabilities:    1. tracking official game start/stop times, calls and scoring through:            the use of a referees whistle capable of transmitting a uniquely encoded identification signal upon the detection of airflow;        the use of a band to be worn over the fingers that is capable of transmitting a uniquely encoded identification signal upon the sensing of pressure when the game object, such as a puck, is either picked up or released, and        the interfacing of the official game scoring data collection device that is typically used to control the scoreboard.            2. automatically tracking participant and game object movement using a multiplicity of substantially overhead viewing cameras:            by first detecting and following the participant and game object shapes from a substantially overhead, fixed camera matrix capable of both tracking and filming, and:                    synchronizing these tracking and filming cameras to the power cycles of the venue lighting system in order to ensure maximum, consistent image-to-image lighting;            where the fixed overhead filming cameras first capture an image of the background known to be absent of foreground objects, the background image of which can then be used during game filming to support the real-time extraction of any participants and game objects (collectively referred to as foreground objects) that may be traversing the background so that they may be efficiently analyzed;                            where the fixed overhead cameras stream their data into image extracting hubs whose purpose is at least to perform this extraction of the foreground from the background, also referred to as segmentation, in real-time prior to multiplexing the resulting extracted foreground objects into a single minimal stream to be passed on to an analysis computer;                                    so that the larger stream of video data emanating from the multiplicity of overhead cameras can be reduced in total pixel area to a volume of data capable of being received and processed by a typical computer system;                                                where a multiplicity of image extracting hubs stream their data into multiplexing hubs whose purpose is to join together the incoming streams of extracted foreground objects into a single stream for presentation to another multiplexing hub or an analysis computer;                                    so that the analysis computer is capable of receiving the total multiplicity of streams as a reduced number of streams acceptable into its typical number of input paths;                                                                    where the tracking information determined for these foreground objects at least includes the continuous location and orientation of each participant and game object while they are within the field of play;            using markings such as uniquely encoded helmet stickers in order to identify individual participants coincident with the tracking of their shapes;            using non-visible coatings to mark selected body points on each participant and by directing the reflected non-visible frequencies entering the overhead filming cameras to a separate sensor;                            analyzing these coincident non-visible images to identify and track specific body points on each participant, and                                    creating a grid of overhead cameras whose views overlap so a to collectively form a single view of the tracking surface below;                            where the area covered by the overlap between any adjacent cameras is enough to ensure that any foreground object that transverses the junction remains within all views for a minimal distance;                                    where this minimal distance at least includes the size of any player identification marks such as a helmet sticker;                    where this minimal distance preferably includes enough area to keep a single participant in view while standing;                                                                    creating an overhead matrix comprising at least two overhead grids, offset to each other, such that any foreground object is always in view of at least two cameras, one from each of the two grids, at all times;                            so that image analysis of these foreground objects from the two separate views can create three dimensional tracking information;                                    preferably adding a third overhead grid to the overhead matrix such that any foreground object remains in the view of at least three cameras, one from each of the three grids, at all times;                            so that more than one camera must malfunction before a foreground object is no longer seen by two cameras, and                so that composite images created of the foreground objects may have minimal distortion by always selecting the one view from any of the three viewing cameras that is the most centered;                                                by using the tracking location and orientation information concerning each participant to automatically direct a plurality of ID filming cameras affixed from a perspective view throughout the venue to controllably capture images of selected participants including identifying portions of their uniforms such as their jersey numbers;                    to use the captured images of a selected participant's uniform, preferably including their jersey number, to compare and pattern match against a pre-known database thereby allowing for participant identification without necessitating the use of an added marking such as a helmet sticker, and                        by using a wireless handheld device to allow coaches to indicate, in real-time, game moments for review, where these moments are stored as time markers and cross indexed to both the indicating coach and the plurality of tracked data and collected film.            3. automatically assembling a single composite overhead view of the game based upon the video images captured by the tracking system:            where an automatic video content assembly and compression computer system ultimately sorts and combines the video information of the extracted foreground objects contained in all of the incoming streams being received from one or more multiplexing hubs, themselves receiving from other multiplexing hubs or extractions hubs, themselves receiving from all cameras within all the overhead grids comprising the overhead matrix;                    where any foreground object determined to have been touching one or more edges of its capturing camera's view, is first combined with any extracted foreground objects from adjacent cameras within the same overhead grid that are overlapping along one or more equivalent physical pixel locations,                            so that a multiplicity of contiguous foreground objects, from a single overhead grid, are first constructed from the pieces captured by adjacent cameras within that grid;                                    where each constructed or otherwise already contiguous foreground object captured within a single grid is then compared to the foreground objects, determined to be occupying the same physical space, that were captured from the one or preferably two other overhead grids;                            where the result of the comparison is to select the one view of the foreground object that contains the least image distortion;                                    a where each minimally distorted contiguous foreground object may comprise one or more participants;                            where these foreground objects may be determined to contain more than one participant by detecting the presence of more than one helmet sticker or other identifying mark, or                where the total pixel mass of the contiguous foreground object is determined to be that reasonably expected for a given number of participants greater than one;                                    where contiguous foreground objects determined to comprise more than one participant are then preferably broken into separate smaller foreground objects centered about the best estimated location of each detected participant;                            where the separate smaller objects are thought to contain only a single participant and are indexed at least according to the identity of that participant, and                where it is immaterial that body portions of one participant are included in the separated smaller objects of an adjoining participant, if at least the total video information contained in the forcibly separated smaller objects equals the total video information of the original contiguous (larger) foreground object.                                    so that a single collection of the least distorted views of all participants, broken up and indexed by participant and game objects as best as is possible, is created with minimal delay from real-time for each beat of image capture across all cameras in the overhead matrix;                            where the expected beats of image capture might be every 1/30th, 1/60th or 120th of a second and faster;                where the same separate participant or game object images are then sorted into distinct streams within the time (or temporal) domain as each successive beat of the capturing cameras creates an additional single collection of least distorted views, and                where any unidentifiable objects from a single collection form their own distinct temporal stream with any other unidentifiable objects, determined to overlap the same physical local, from the next single collection.                                                    4. collecting video from one or more perspective view cameras that are automatically directed to follow the game action based upon the determined participant and game object movement;            by using the tracking location and orientation information concerning each participant and the game object to automatically direct a plurality of game filming cameras affixed from distinct perspective views throughout the venue;                    where the pan/tilt and zoom settings of each perspective filming camera are automatically controlled and the capturing of images is restricted to distinct combinations of these settings rather than a particular fixed time beat such as 1/30th or ⅙th of a second;            where for each possible distinct combination of pan/tilt and zoom settings, an image is first captured when the venue background is known to be absent of foreground objects, the background image of which can then be used during game filming to support the real-time extraction of foreground objects as they traverse the background thereby supporting image compression;                            where the total collection of background images for a given perspective camera, covering all possible distinct combinations of pan/tilt and zoom (P/T/Z) settings, are additional combined to form a single larger background panoramic;                                    where this panoramic can be queried based upon the current P/T/Z settings of the associated filming cameras in order to extract the equivalent original venue background overlapping the current image;                                                                    where the extracted foreground objects from each current frame of each perspective filming camera are broken into separate streams by participant in a manner similar to that taught for the overhead filming system, based upon tracking information determined by the overhead system;            where a table of pre-known color tones are established for all participant skin complexions as well as home and away uniforms, such that each pixel in the extracted foreground images can be encoded to represent one of these color tones less a grayscale overlay thereby increasing image compression;            using non-visible coatings to mark selected body points on each participant and directing the reflected non-visible frequencies entering the perspective filming cameras to a separate sensor;                            analyzing these coincident non-visible images to identify and track specific body points on each participant;                                                by using transponders to track the location and orientation of one or more roving, manually operated filming cameras so as to align its captured film with the determined location and orientation of the participants and game objects, and        by using transponders to track the location of selected spectators and to controllably direct spectator filming cameras based upon the determined game actions of the participants and their relationship to the tracked spectators.            5. automatically collecting game audio and creating matched volume and tonal mappings;            by using audio recorders placed throughout the venue to capture a three-dimensional soundscape of the game that is stored both in traditional audio formats, and        by sampling the traditional audio recording in order to create compressed volume and tonal maps that may be used to drive a synthesized rendering of crowd noise.            6. analyzing participant and game object movement to create game statistics and performance measurements forming a stream of game metrics:            where the continuum of tracked locations, orientations and identities of the participants and the game object is interpreted as a series of distinct and overlapping events, where each event is categorized and associated at least by time sequence with the tracking and filming databases;                    where any given overhead or perspective filming camera may be operated at some multiple of the standard motion frame rate, typically 30 fps, in order to capture enough video to support slow and super-slow motion playback, and                            where the critically of a given event determined to be in view of a given filming camera is used to automatically determine if these extra multiple of video frames should be kept or discarded;                                    by using these interpreted events to automatically accumulate basic game statistics;                        including the capturing of subjective assessments of participant performance, typically from the coaching staff after the game has completed, where the assessments of which are comparable to those made objectively based upon the automatically interpreted events and statistics, thereby forming a feedback loop provided to both the subjective and objective analysis sources in order to help refine their respective assessment methods.            7. automatically creating performance descriptor tokens based upon the game metrics describing the important game activities:            by creating a three-dimensional venue model that calibrates the tracking and filming cameras into a single local coordinate system, from which the interpreted events can be translated in combination with predefined game rules into at least the recording of game scoring and other traditional statistics, and        by using participant and game object movements as calibrated to the playing venue along with the interpreted events, scoring and other statistics to generate a continuous output of descriptive tokens that themselves can be used as input into a text-to-speech synthesis module for the automatic creation of game commentary.            8. dynamically assembling combinations of the video, game metrics, performance tokens and audio information into an encoded broadcast based upon remote viewer directives;            where the assembled video stream may compose:                    the single composite overhead view of the game encoded as a traditional stream of current images;            one or more perspective views of the game encoded as a traditional stream of current images;            either or both of the overhead and perspective views alternatively encoded as a derivative of the traditional streams of current images encoded as:                            streams of extracted blocks minimally containing all of the relevant foreground objects,                                    where the pan/tilt and zoom settings associated with each and every image in the current stream, for each perspective view camera, are also transmitted;                                                “localized” sub-streams of extracted blocks further sorted in the spatial domain based upon the identification of the player primarily imaged in the block;                “normalized” sub-streams of “localized” extracted blocks further expanded and rotated so as to minimize expected player image motion within the temporal domain;                “localized” and “normalized” sub-streams further separated into face and non-face regions;                separated non-face regions further separated into color underlay and grayscale overlay images, and                color underlay images encoded as color tone regions.                                    any of the derivative forms of the traditional streams alternately encoded as gradient images;            the single composite overhead view represented in a symbolic, rather than video or gradient format;                        where the assembled metrics stream may compose:                    an ongoing accumulation of performance measurements and analysis derived from the continuous stream of participant and game object tracking information created via image analysis of the single composite overhead view;                        where the assembled audio stream may compose:                    the traditional ambient audio recordings of the venue surroundings, or,                            compressed volume and tonal maps derived from the ambient audio recordings that may be used to direct the automatic generation of synthesized crowd noise;                                    a stream of tokens encoding a description of the game activities that may be used to direct the automatic generation of synthesized game commentary;                        by using the determined game stop and re-start times along with the interpreted events to selectively alter the contents of the video stream;                    where alternative perspective view angles may be added to the stream based upon the measured game activities in order to serve as replays;            where additional captured images greater than the traditional 30 frames per second may be transmitted and then added to the prior transmitted original 30 frames per second in order to all for slow motion replays;                        by receiving user profile and preferences along with direct interactive user feedback in order to change any portion of the video, metrics or audio streams.            9. transmitting the broadcast and receiving back interactive viewer directives;            using current standards such as broadcast video for television and MPEG-4 or H.264 for the Internet, or        using variations of current standards designed to take advantage of the additional information created by the present application that support higher levels of broadcast stream compression.            10. decoding the transmitted broadcast into a stream of video and audio signals capable of being presented on the viewing device, where:            selected information is transmitted, or otherwise provided to the decoding system prior to receiving the transmitted broadcast including:                    a 3-D model of the venue in which the contest is being played;            a database of “natural” background images, one image for each allowed pan/tilt and zoom setting for each perspective view camera;                            a panoramic background for each perspective view camera representing a compressed compilation of the database of “natural” background images;                                    a database of advertisement images mapped to the 3-D venue model;            a color tone table representing the limited number of possible skin tones, uniform and game equipment colors to be used when decoding the video stream;            a database of standard poses of the participants expected to play in the broadcasted game cross-indexed at least by participant identification and also by pose information including orientation and approximately body pose;                            where the standard poses for each participant are pre-captured in the same uniforms and equipment they are expected to be wearing and using during the broadcasted contest;                                    a database of translation rules controlling how the stream of tonal and volume map information is to be converted into synthesized crowd noise;            a database of translation rules controlling how the stream of tokens encoding the game activities are to be converted into text for subsequent translation from text-to-speech;                        selected information is accepted locally, on the decoding system, for use in directing what information is included in the broadcast and how this information is presented, such as:                    a viewer profile and preferences database that is established prior to the broadcast and includes information such as:                            the viewers name, age, address, relationship to the event as well as other traditional demographic data;                the viewers preferences, at least including indicators for:                                    using natural or animated backgrounds;                    using the background from the actual or a substitute facility;                    using natural or synthesized crowd noise;                    the voices to be used for the synthesized audio game commentary, and                    the style of presentation.                                                                    the same viewer profile and preferences database that is amended before and during the broadcast in include viewer indications of:                            the distinct overhead and perspective views to be transmitted;                the format of the transmitted overhead stream such as natural, gradient or symbolic;                the format of each of the transmitted perspective streams such as natural or gradient;                the detail of the metrics stream;                the inclusion of the performance tokens necessary to automate the synthesized game commentary, and                the format of the audio stream such as natural or synthesized (and therefore based upon the volume and tonal maps).                                                selected portions of the transmitted broadcast are saved off into a historical database for use in the present and future similar broadcasts, the information including:                    a database of captured game poses of the participants playing in the broadcast event stored and cross-indexed at least by participant identification and also by pose information including orientation and approximately body pose;            a database of accumulated performance information concerning the teams and participants of the current broadcast, and            a database of the automatically chosen translations of descriptive tokens used to drive the synthesized game commentary.                        decoding is based upon current standards such as broadcast video for television and MPEG-4 or H.264 for the Internet, including additional optional steps for:                    recreating natural and/or animated backgrounds;            overlaying advertisements onto the recreated background;            overlaying graphics of game performance statistics, measurements and analysis onto the recreated background;                            where the above steps of recreating the background and overlaying advertisements and other graphics are based primarily upon information including:                                    the three-dimensional venue layout,                    the relative location of the associated perspective filming camera,                    the transmitted pan/tilt and zoom settings for each current image, and                    the information available in the viewer preferences and profile dataset;                                                                    translating the decoded pixels of the foreground participants via the pre-known color tone table into true color representations to be mixed with the separately decoded grayscale overlay information;            overlaying the decoded extracted blocks of foreground participants and game objects onto the recreated background based upon the transmitted relative location, orientation and/or rotation of the extracted blocks;            adding the actual venue recordings or creating synthesized crowd noise based upon the transmitted volume and tonal maps,            creating synthesized game commentary based upon the transmitted game descriptive tokens derived from the interpretation of tracking data, and            inserting advertisement video/audio clips interwoven with the transmitted game activities based upon the tracked and determined game stop and re-start times.                        
Many of the above stated objects and advantages are directed towards subsystems that have novel and important uses outside of the scope of an Automatic Event Videoing, Tracking and Content Generation System, as will be understood by those skilled in the art. Furthermore, the present invention provides many novel and important teachings that are useful, but not mandatory, for the establishment of an Automatic Event Videoing, Tracking and Content Generation System. As will be understood by a careful reading of the present and referenced applications, any automatic event videoing, tracking and content generation system does necessarily need to include all of the teachings of the present inventors but preferably includes at least those portions in combinations claimed in this and any subsequent related divisional or continued applications. Still further objects and advantages of the present invention will become apparent from a consideration of the drawings and ensuing descriptions.