Room-based telepresence (TP) environments include systems that are equipped with multiple cameras and displays, where the TP systems are configured to send and receive multiple high-definition (HD) video streams. The video streams can be defined or classified as people streams, captured by cameras and containing views of meeting participants, and data streams, which are usually computer generated graphical content presented by meeting participants. There are a number of limitations in current TP systems with regard to receiving and showing multiple video streams. Some of the limitations of current TP systems include the following.
Known TP systems typically assume that a receiver of video streams, such as a TP server, has knowledge of whether an incoming video stream is a people stream or a data stream. In current TP systems, classification of video streams into people streams or data streams is predefined and signaled to a receiver (i.e., the receiver conducts no analysis on received video streams apart from the classification that has predefined and provided to the receiver). In addition, even though a TP server may perform composition on multiple video streams and send the composed video streams to one or more receiving endpoints, the composition is limited solely to scaling video images and arranging the scaled images according to a pane layout. The server neither analyzes nor uses any content information of the video images in performing the composition, nor does it leave the flexibility to a receiving endpoint. Further, at a receiving endpoint, the reception of video streams is limited to one HD stream per display, and each received stream is displayed at full size on one screen. When considering a single-screen endpoint, one people stream plus one data stream can be received and displayed on the same screen with a simple composition (e.g., a picture-in-picture or PIP display).