This present invention relates generally to multi-camera video systems, and more particularly to an automatic multi-camera video composition system and method for its operation.
In the general field of video transmission and recording, it is common to concurrently capture video from multiple viewpoints or locations. One common example is sports broadcasting: a baseball game, for example, may use five or more cameras to capture the action from multiple viewing angles. One or more technicians switch between the cameras to provide a television signal that consists, hopefully, of the best view of whatever is happening in the game at that moment. Another example is a movie. Movie editing, however, takes place long after the events are recorded, with most scenes using a variety of camera shots in a selected composition sequence.
Although perhaps less exciting than a sports contest or a movie, many other applications of multi-camera video data exist. For instance, a selection of camera angles can provide a much richer record of almost any taped or broadcast event, whether that event is a meeting, a presentation, a videoconference, or an electronic classroom, to mention a few examples.
One pair of researchers has proposed an automated camera switching strategy for a videoconferencing application, based on speaker behavioural patterns. See F. Canavesio and G. Castagneri, xe2x80x9cStrategies for Automated Camera Switching Versus Behavioural Patterns in Videoconferencingxe2x80x9d, in Proc. IEEE Global Telecommunications Conf., pp. 313-18, Nov. 26-29 1984. The system described in this paper has one microphone and one camera for each of six videoconference participants. Two additional cameras provide input for a split-screen overview that shows all participants. A microprocessor periodically performs an xe2x80x9cactivity talker identification processxe2x80x9d that detects who among all of the participants is talking and creates a binary activity pattern consisting of six xe2x80x9ctalk/no talkxe2x80x9d values.
A number of time-based thresholds are entered into the system. The microprocessor implements a voice-switching algorithm that decides which of the seven camera views (six individual plus one overview) will be used for each binary activity pattern. In essence, the algorithm decides which camera view to use for a new evaluation interval based on who is speaking, which camera is currently selected, and whether the currently-selected camera view has been held for a minimum amount of time. If more than one simultaneous speaker is detected or no one speaks, the system will switch to the conference overview after a preset amount of time. And generally, when one speaker is detected, the system will continuously select the close-up view of that speaker as long as they continue to talk or take only short pauses.