A live broadcast such as one shown on television includes at least one video camera that is directed toward a scene. Conventional systems provide for a camera operator to manually operate the camera. Accordingly, the field of view of the camera is determined by the camera operator based upon the discretion of the camera operator. However, there are ways of improving a selection of the field of view of the camera in an automated manner. A robotic camera may be used to record live events autonomously with the potential to streamline and improve current broadcasting models. For example, an optimal shooting location may be impractical for manual human operation. Hiring a professional camera operator increases costs and may not justify the benefit of an additional perspective. In these instances, conventional methods to improve or automate the control of the robotic camera is extremely beneficial. Automatically planning where the camera should look is a key challenge especially when the source information is noisy sensor data. In practice, the camera must be controlled to ensure that it actually looks at the intended target. Although path planning and camera control are separate tasks, the two are highly correlated. That is, it is futile to plan a motion path that the camera is physically unable to follow.
When automated systems are used to control the manner in which the camera operates, the camera should move smoothly and purposefully to avoid disorienting the viewer. A simple solution for this is to place a setting in a controller for the automated system to limit changes in acceleration of the camera. However, such an approach neglects how well the camera continues to follow the area of interest or whether it oscillates around a target fixation point. Accordingly, the conventional systems do not provide for the autonomous camera to plan a trajectory which balances smooth motion against tracking error. That is, the conventional system for automating camera control does not provide planning for anticipating object motion such that the conventional system is able to predict the manner in which the camera is to react or move.
In order to anticipate object motion, an online, realtime system may be used, particularly when an event is being broadcast live. Specifically, sensor data is generated and processed in realtime such that a robotic pan-tilt-zoom (PTZ) camera follows the subject of interest and captures high-resolution images using all pixels from its image sensor. However, to capture aesthetic video, the online, realtime system must also be able to anticipate future object locations so that an optimally smooth trajectory may be planned. With regard to an event having high dynamic motion associated with subjects of interest, online autonomous robotic camera systems may be inadequate as they are more practical in environments with limited dynamic motion such as lecture halls and video conference facilities.
An event such as one involving team sports has highly dynamic object motions. Specifically, an object of the sport (e.g., ball) and the players are continuously moving. Since the conventional online realtime system does not provide the necessary system requirements for such a highly dynamic event, conventional systems utilize non-realtime offline resampling approaches. In the resampling framework, one or more high-definition stationary cameras initially capture the live action. Subsequently, video from a virtual camera is synthesized after the fact by resampling the pixels from the fixed cameras. For example, the resampling may be a simple cropping of a rectangular subregion within an entire field of view of the camera. FIG. 1 illustrates a conventional cropping process in which an entire field of view 100 of the camera is used to generate a cropped view 105. Those skilled in the art may refer to such a process as a traditional pan and scan technique in which the cropped view 105 is a rectangle of a particular size centered at a particular point of interest determined from the non-realtime, offline approach. Thus, a smaller video is extracted from a larger video (e.g., creating a 4:3 version from a 16:9 source material)
With an event including highly dynamic objection motions, the offline approach is attractive because complex non-realtime algorithms may be used to plan the trajectory of the virtual (resampling) camera. More importantly, the offline aspect in which realtime images are captured but a non-realtime image is broadcast eliminates the need to accurately anticipate future object motions because the true future motion information is readily available. Similarly, there are no control issues because the virtual camera may move anywhere immediately (because it is has no mass and can move infinitely fast). Despite these advantages, the non-realtime, offline approach has its own drawbacks. Specifically, the resampling includes only a fraction of a resolution of the system in the output image that is broadcast. For example, in a sport like basketball, all players typically occupy only half the court at any given time. Therefore, with a set of fixed cameras which cover the entire court, at least half of the recorded pixels are never used in the output video. In addition, it is impossible to gain high-resolution close-up images as the broadcast image is generated based upon a “zoomed out” realtime image.
Furthermore, the image that is eventually shown in a broadcast using the cropping process of FIG. 1 is the cropped view 105. The cropped view 105 is a sub-area of the entire area comprising the field of view of the camera or from a mosaic of cameras. However, the camera itself is stationary or provides a limited set of angles in which the cropped view 105 may be extracted from the field of view of the camera. Therefore, the cropped view 105 being broadcast may include a skewed or distorted look in the eyes of the viewer. In addition, the cropped view 105 being extracted from the entire field of view 100 may also entail the cropped view 105 being generated by traversing across the entire field of view 100. However, this often results in unintended visual artifacts being introduced such as appearing to a viewer that the camera itself is physically translating.
In order to properly determine the cropped view 105 within the entire field of view 100, conventional systems include autonomous cameras utilizing the above described systems and path planning/control. A conventional autonomous camera system for sports production has employed a common framework in which one or more high-resolution fixed cameras capture the game and features such as player and ball locations are extracted offline. The output broadcast using the cropping approach is then generated afterwards by determining the optimal subregion of the appropriate fixed camera at each time instant. Within conventional autonomous camera systems, there has been significant variety in how the optimal subregion is determined at each time instant. In a first example, one approach augments player features with audience gaze angles. The images of three fixed cameras are stitched together using a cylindrical projection and a rotational off-set based on player and audience gaze features. In a second example, another approach considers three different shot sizes depending on the estimated game situations. A smooth path is achieved using a Schmitt trigger which only puts the camera in motion when the ball nears the edge of the frame. In a third example, a further approach generates a virtual camera trajectory for a basketball using a Markov chain to balance smoothness against deviating from the optimal virtual camera state at each time instant.
In addition to sports, autonomous camera systems have been deployed in lecture halls, video conferences, and television production stages. In these situations, the motion of subjects is quite predictable (possibly scripted and rehearsed) which allows for a range of camera solutions to be employed. In a first example, one approach demonstrates how a user-supervised autonomous camera system automatically frames shots for a cooking show. Various vision algorithms are deployed depending on the type of shot as requested by the human director. In a second example, another approach uses a fixed 1080i camera to record a lecturer. A cropping window is computed from frame differencing and both bilateral filtering and human specified control points for a learned acceleration model smoothes the noisy input signal. In a third example, a further approach controls a virtual camera to record a lecturer. The motion of the virtual camera is regulated using a Kalman filter augmented with a three state rule-based post filtering technique to prefer stationary cameras unless the lecturer is moving significantly. In a fourth example, a still further approach uses a fixed camera to estimate a saliency map of the video conference room to compute an optimal cropping window which balances a loss of information from aperture and resolution effects. Instead of cropping from the wide-angle camera, the desired subregion is used to control a robotic PTZ camera.
With regard to path planning and control, determining where the cameras should look is a key component of any autonomous system. Additionally, the planned trajectory must be smooth such that the process to decide where the camera should look at any given time instant must take into account where the camera should be looking both before and after the current time instant. Camera planning is a relevant issue regarding conventional computer graphics systems. However, computer graphics algorithms rarely consider incomplete and noisy data generated from computer vision and other sensing modalities.
The task of moving a physical camera to keep an object of interest within the field of view is referred to as visual servoing to those skilled in the art. In a first conventional system, a proportion-only feedback control algorithm is employed to adjust the pan-tilt angle of a camera mounted on the end of a human operated boom to keep a target object in the center of the camera image. When multiple targets are tracked, conventional control algorithms often monitor features derived from the point set such as mean and standard deviation. In a second conventional system, a proportion-only control is used to position the centroid of detected image features near the centers of the images of a stereo camera pair. In a third conventional system, a task-priority kinematic control is used to keep a set of interest points within the camera field of view. In such a system, the mean and variance are independent objectives in which pan-tilt values are modified to keep the mean near the center of the image and zoom is regulated to keep the standard deviation within the image boundary.
When applied to generating a video image in a sports environment, conventional systems determine where cameras should look based on player motions. In a first conventional system, a K nearest neighbor classifier is used to learn the relationship between features (such as player position) and the PTZ state of cameras operated by professionals. In a second conventional system, individual players are tracked using a particle filter and a global motion vector field is extrapolated on the ground plane using a Gaussian process regression. Consequently, this system illustrates how convergence regions in the vector field correlate with actual broadcast camera movements.
Both robotics and computer vision have been used to address the issue of planning smooth trajectories for cameras. In a first conventional system, a probabilistic roadmap is used to generate an initial estimate of linear segments which link the current camera state to the desired future camera state. The path is refined by fitting circular arcs between segments to compute a smooth velocity plan which depends on path curvature. In a second conventional system, a video stabilization technique is used to estimate the trajectory of a hand held camera using inter-frame homographies and to identify segments of constant velocity linked together with ease in/out curves. In a third conventional system, a noisy trajectory is refined using a linear program which generates a trajectory preferring constant position or constant velocity segments.