The APIDIS (Autonomous Production of Images based on Distributed and Intelligent Sensing) project tries to provide a solution to generate personalized contents for improved and low-cost visual representation of controlled scenarios such as sports television, where image quality and perceptual comfort are as essential as efficient integration of contextual information [1].
In the APIDIS context, multiple cameras are distributed around the action of interest, and the autonomous production of content involves three main technical questions regarding those cameras:                (i) how to select optimal viewpoints, i.e. cropping parameters in a given camera, so that they are tailored to limited display resolution,        (ii) how to select the right camera to render the action at a given time, and        (iii) how to smooth camera/viewpoint sequences to remove production artefacts.Production artefacts consist of both visual artefacts, which mainly means flickering effects due to shaking or fast zoom in/out of viewpoints, and story-telling artefacts such as the discontinuity of story caused by fast camera switching and dramatic viewpoint movements.        
Data fusion of multiple cameras has been widely discussed in the literature. These previous works could be roughly classified into three major categories according to their various purposes. Methods in the first category deal with camera calibration and intelligent camera controlling by integrating contextual information of the multi-camera environment [4]. Reconstruction of 3D scene [5] or arbitrary viewpoint video synthesis [2] from multiple cameras is also a hot topic. The third category uses multiple cameras to solve certain problems such as occlusion in various applications, e.g., people tracking [6]. All these works focus much on the extraction of important 3D contextual information, but consider little on the technical questions mentioned above about video production.
Regarding autonomous video production, there are some methods proposed in the literature for selecting the most representative area from a standalone image. Suh et al. [7] defined the optimal cropping region as the minimum rectangle which contained saliency over a given threshold, where the saliency was computed by the visual attention model [8]. In Ref. [9], another attention model based method was proposed, where they discussed more the optimal shifting path of attention than the decision of viewpoint. It is also known to exploit a distributed network of cameras to approximate the images that would be captured by a virtual sensor located in an arbitrary position, with arbitrary viewpoint coverage. For few cameras with quite heterogeneous lens and scene coverage, most of the state-of-the-art free-viewpoint synthesis methods produce blurred results [2][3].
In Ref. [10] an automatic production system for soccer sports videos is proposed and viewpoint selection based on scene understanding was also discussed. However, this system only switches viewpoints among three fixed shot sizes according to several fixed rules, which leads to uncomfortable visual artefacts due to dramatic changing of shot sizes. Furthermore, they only discussed the single-camera case.
In addition to the above literature survey, several patent applications have considered (omnidirectional) multi-camera systems to produce and edit video content in a semi-automatic way. Three main categories of systems can be identified.
The first category selects one view (i.e. one video) among the ones covered by a pre-defined set of cameras, based on some activity detection mechanism. In [15], each camera is activated based on some external device, which triggers the video acquisition each time a particular event is detected (e.g. an object entering the field of view). In [16], audio sensors are used to identify the direction in which the video should be captures.
The second category captures a rich visual signal, either based on omnidirectional cameras or on wide-angle multi-camera setting, so as to offer some flexibility in the way the scene is rendered at the receiver-end. For example, the systems in [17] and [18] respectively consider multi-camera and omnidirectional viewing systems to capture and broadcast wide-angle video streams. In [17], an interface allows the viewer to monitor the wide-angle video stream(s) to select which portion of the video to unwrap in real time. Further, the operator can stop the playback and control pan-tilt-zoom effects in a particular frame. In [18], the interface is improved based on the automatic detection of the video areas in which an event participant is present. Hence, the viewer gets the opportunity to choose interactively which event participant (s)he would like to look at.
Similarly, [19-21] detect people of interest in a scene (typically a lecturer or a videoconference participant). However, the improvement over [18] is twofold. Firstly, in [19-21], methods are proposed to define automatically a set of candidate shots based on automatic analysis of the scene. Secondly, mechanisms are defined to select automatically a shot among the candidate shots. In [19], the shot definition relies on detection and tracking of the lecturer, and probabilistic rules are used to pseudo-randomly switch from the audience to the lecturer camera during a lecture. In [20] and [21], a list of candidate shots is also defined based on the detection of some particular object of interest (typically a face), but more sophisticated editing effects are considered to create a dynamic (videoconference) rendering. For example, one shot can pan from one person to another, or several faces can be pasted next to each other in a single shot. The edited output video is then constructed by selecting a best shot among the candidate shots for each scene (in [20] and [21], a scene corresponds to a particular period of time). The best shot is selected based on a pre-defined set of cinematic rules, e.g. to avoid too many of the same shot in a row.
It is worth noting that the shot parameters (i.e. the cropping parameters in the view at hand) stay fixed until the camera is switched. Moreover, in [19-21] a shot is directly associated to an object, so that in final, the shot selection ends up in selecting the object(s) to render, which might be difficult and irrelevant in contexts that are more complex than a videoconference or a lecture. Specifically, [19-21] do not select the shot based on the joint processing of the positions of the multiple objects.
The third and last category of semi-automatic video production systems differentiates the cameras that are dedicated to scene analysis from the ones that are used to capture the video sequences. In [22], a grid of cameras is used for sport scene analysis purposes. The outputs of the analysis module are then exploited to compute statistics about the game, but also to control pan-tilt-zoom (PTZ) cameras that collect videos of players of interest (typically the one that holds the puck or the ball). [22] must implement all scene analysis algorithms in real time, since it aims at controlling the PTZ parameters of the camera instantaneously, as a function of the action observed in the scene. More importantly and fundamentally, [22] selects the PTZ parameters to capture a specific detected object and not to offer appropriate rendering of a team action, potentially composed of multiple objects-of-interest. In this it is similar to [19-21]. Also, when multiple videos are collected, [22] does not provide any solution to select one of them. It just forwards all the videos to an interface that presents them in an integrated manner to a human operator. This is the source of a bottleneck when many source cameras are considered.
US2008/0129825 discloses control of motorized camera to capture images of an individual tracked object, e.g. for individual sports like athletics competitions. The user selects the camera through a user interface. The location units are attached to the object. Hence they are intrusive.
GB2402011 discloses an automated camera control using event parameters. Based on player tracking and a set of trigger rules, the field of view of cameras is adapted and switched between close, mid and far views. A camera is selected based on trigger events. A trigger event typically corresponds to specific movements or actions of sports(wo)men, e.g. the service of a tennis player, or to scoreboard information updates.
US2004/0105004A1 relates rendering talks or meetings. Tracking cameras are exploited to render the presenter or a member of the audience who asks a question. The presenter and the audience members are tracked based on sound source localization, using an array of microphones. Given the position of the tracking camera target, the PTZ parameters of the motorized camera are controlled so as to provide a smooth edited video of the target. The described method and system is only suited to follow a single individual person. With respect to the selection of the camera, switching is disclosed between a set of very distinct views (one overview of the room, one view of the slides, one close view on the presenter, and one close view a speaking audience member). The camera selection process is controlled based on event detection (e.g. a new slide appearing, or a member of the audience speaking) and videography rules defined by professionals, to emulate a human video production team.