Transmission of moving pictures in real-time is employed in several applications like e.g. video conferencing, net meetings and video telephony. Video conferencing systems allow for simultaneous exchange of audio, video and data information among multiple conferencing sites. Control units, such as Multipoint Control Units (MCUs), perform switching functions to allow endpoints of multiple sites to intercommunicate in a conference.
An endpoint may be defined as any suitable device or apparatus that is configured to provide visual and audio communication to one or more participants at a conference site. For example, as FIG. 1 illustrates, a video conferencing system 100 that comprises endpoints 120 interconnected via an internet protocol (IP) network. A control unit 140 is in this example an MCU. As the skilled person will realize, the endpoints 120 may comprise a dedicated video communication terminal as well as suitably configured general purpose computers having video and audio communication hardware and software.
The control unit 140 links sites/endpoints/participants together by receiving frames of conference signals from the sites/endpoints, processing the received signals, and retransmitting the processed signals to appropriate sites/endpoints. The conference signals include audio, video, data and control information. In a switched conference, the video signal from one of the conference sites/endpoints, typically that of the loudest speaker, is broadcast to each of the sites/endpoints. In a continuous presence conference, video signals from two or more sites/endpoints are spatially mixed to form a composite video signal for viewing by conference participants at sites/endpoints. When the different video streams have been mixed together into one single video stream, the composed video stream is transmitted to the different sites/endpoints of the video conference, where each transmitted video stream preferably follows a set scheme indicating who will receive what video stream. In general, the different participants prefer to receive different video streams. The continuous presence or composite video stream is a combined picture that may include live video streams, still images, menus or other visual images from participants in the conference. Continuous presence may refer to a special kind of composite image for multi-screen video conferencing.
As exemplified in FIG. 1, in addition to traditional stationary video conferencing endpoints 120, external devices 130, such as mobile and computer devices, smartphones, tablets, personal devices and PCs, have recently entered the visual communication marketplace and are also used as video conferencing endpoints.
Furthermore, external devices 130 having touch screens have been used as annotation devices in video conferences. A user may annotate on the screen, e.g. on top of a snapshot of a presentation, moving a finger or a pen over the screen. An annotation application running on the external device captures the movements and transmits the movements over a dedicated annotation signal channel to an annotation software component of an MCU. The MCU then encodes the received annotation signal and transmits an annotated version of the presentation to all participant of the conference as encoded video streams. The external device 130 may also be provided with a remote-control application transmitting control signals to a control software component of the MCU, in order for a user to control the MCU.
Both the MCU and the external device are required to have additional non-standardized software components installed on the device. A data channel separate from the video signal is also required to transmit the annotation signal or the control signal from the external device to the MCU.
A drawback of such a scenario is that it is not possible to make annotations or control the MCU without being in possession of a device that has an annotation or control application installed thereon.
U.S. Patent Pub. No. 2009/0210491A1 discloses a method and apparatus to annotate frames with identifying information of participants in a multimedia conference event by detecting the participants in multiple input media streams.