Videoconferencing enables individuals located remotely from one another to conduct a face-to-face meeting. Videoconferencing may be executed by using audio and video telecommunications. A videoconference may be between as few as two sites (point-to-point), or between several sites (multi-point). A conference site may include a single participant (user, conferee) or several participants (users, conferees). Videoconferencing may also be used to share documents, presentations, information, and the like.
Participants may take part in a videoconference via a videoconferencing endpoint (EP), for example. An endpoint may be a terminal on a network, for example. An endpoint may be capable of providing real-time, two-way, audio/visual/data communication with other terminals and/or with a multipoint control unit (MCU). An endpoint may provide information/data in different forms, including audio; audio and video; data, audio, and video; etc. The terms “terminal,” “site,” and “endpoint” may be used interchangeably. In the present disclosure, the term endpoint may be used as a representative term for above group.
An endpoint may comprise a display unit (screen), upon which video images from one or more remote sites may be displayed. Example endpoints include POLYCOM® VSX® and HDX® series endpoints, each available from Polycom, Inc. (POLYCOM, VSX, and HDX are registered trademarks of Polycom, Inc.) A videoconferencing endpoint may send audio, video, and/or data from a local site to one or more remote sites, and display video and/or data received from the remote site(s) on its screen (display unit).
Video images displayed on a screen at an endpoint may be displayed in an arranged layout. A layout may include one or more segments for displaying video images. A segment may be a predefined portion of a screen of a receiving endpoint that may be allocated to a video image received from one of the sites participating in the videoconferencing session. In a videoconference between two participants, a segment may cover the entire display area of the screens of the endpoints. In each site, the segment may display the video image received from the other site.
An example of a video display mode in a videoconference between a local site and multiple remote sites may be a switching mode. In switching mode, the video/data from only one of the remote sites may be displayed on the local site's screen at a time. The displayed video may be switched to video received from another site depending on the dynamics of the conference.
In contrast to the switching mode, in a continuous presence (CP) conference, a conferee (participant) at a local endpoint may simultaneously observe several other conferees from different endpoints participating in the videoconference. Each site may be displayed in a different segment of the layout, which is displayed on the local screen. The segments may be the same size or of different sizes. The combinations of the sites displayed on a screen and their association to the segments of the layout may vary among the different sites that participate in the same session. Furthermore, in a continuous presence layout, a received video image from a site may be scaled, up or down, and/or cropped in order to fit its allocated segment size. It should be noted that the terms “conferee,” “user,” and “participant” may be used interchangeably.
An MCU may be used to manage a videoconference. An MCU is a conference controlling entity that is typically located in a node of a network or in a terminal that receives several channels from endpoints and, according to certain criteria, processes audio and/or visual signals and distributes them to a set of connected channels.
Examples of MCUs include the MGC-100 and RMX 2000®, available from Polycom Inc. (RMX 2000 is a registered trademark of Polycom, Inc.). Some MCUs may be composed of two logical units: a media controller (MC) and a media processor (MP). A more thorough definition of an endpoint and an MCU may be found in the International Telecommunication Union (“ITU”) standards, including the H.320, H.324, and H.323 standards. Additional information regarding video conferencing standards and protocols such as ITU standards or Session Initiation Protocol (SIP) may be found at the ITU website or in Engineering Task Force (IETF) website, respectively.
In a CP videoconferencing session, the association between sites and segments may be dynamically changing according to the activities taking place in the conference. In some layouts, one of the segments may be allocated to a current speaker, for example. The other segments of that layout may be allocated to other sites that were selected as presenter sites or presenter conferees. A current speaker may be selected according to certain criteria, including having the highest audio signal strength during a certain percentage of a monitoring period. The other presenter sites may include the image of the conferee that was the previous speaker; certain conferees required by management decisions to be visible; etc. A predefined number of sites, out of a plurality of sites that participate in the session, whose audio energy is higher than the rest of the conferees can be referred to as speaking conferees, and the audio signals from the speaking conferees can be mixed. The mixed audio can be distributed to all of the conferees, or in some embodiments the audio of a speaking conferee can be removed from the mixed audio that is transmitted to that speaking conferee.
In a conventional CP videoconference, each layout is associated with a video output port of an MCU. A conventional video output port may include a CP image builder and an encoder. A conventional CP image builder may obtain decoded video images of each one of the presenter sites. The CP image builder may scale and/or crop the decoded video images to a required size of a segment in which the image will be presented. The CP image builder may further write the scaled image in a CP frame memory in a location that is associated with the location of the segment in the layout. When the CP frame memory has all the presenter images located in their associated segments, the CP image may be read from the CP frame memory by the encoder.
The encoder may encode the CP image. The encoded and/or compressed CP video image may be sent toward the endpoint of the relevant conferee. A frame memory module may employ two or more frame memories, for example, a currently encoded frame memory and a next frame memory. The memory module may alternately store and output video of consecutive frames. Conventional output ports of an MCU are well known in the art and are described in a plurality of patents and patent applications. Additional information on a conventional output port can be found, for example, in U.S. Pat. No. 6,300,973, the contents of which are incorporated herein by reference in its entirety.
Some videoconferencing techniques can include two or more video cameras to deliver video images from the same site. The two or more cameras can be used for 3D simulation, keeping an eye contact with another conferee, a Telepresence videoconferencing system (TPVS), or a simulation of TPVS, etc. The TPVS can include a large conferencing table with a line of chairs along one side of the table. A video zone is located on the other side of the table, in front of the line of chairs. The video zone can include two or more video displays, adjacent to each other and two or more video cameras. In some TPVSs, the video zone, i.e., the displays and cameras, is adjusted to a certain arrangement of the table and the line of chairs. The video camera setup is adjusted to capture the conferees sitting along the other side of the table. The two or more video images are delivered to the other end or ends of the communication session, to be displayed over a video zone in the other end TPVS. The TPVS gives the impression that the conferees, located at the other side of the communication line and using another TPVS, are sitting in the same room across the conferencing table.
A simulation of TPVS video conferencing system may allow video images from two or more cameras shooting at the same site to be displayed as a single panoramic image. Accordingly, a conferencing endpoint having a single monitor can display the panoramic image of the two or more video images from an endpoint having multiple cameras, such as a common TPVS endpoint. In order to stitch two adjacent images received from different cameras, the system needs to identify similar image elements to be used as reference points occurring in the two adjacent video images. Additional information on simulating TPVS can be found in U.S. patent application Ser. No. 12/581,626, the contents of which are incorporated herein by reference.
Another videoconferencing technique can simulate 3D video. A transmitting endpoint of a video conferencing system that simulates 3D may include two or more video cameras, each of which may record the room of the site from a different angle. The transmitting endpoint may encode each video image and send the encoded streams to an MCU.
At the MCU, each of the received video streams from a plurality of conferees is transferred toward an associated input video port. In addition to the conventional components of an input video port, the input video port may comprise a conferee-point-of-view detector (CPOVD). The CPOVD may detect the angle at which the conferee, at a receiving endpoint, looks at the screen and at which region of the screen the conferee is looking. The CPOVD may send the detected information toward a controller of the MCU. Based on the detected information the controller of the MCU, may select a video stream received from another camera of that transmitting endpoint and use it as the video image that is sent to the receiving endpoint. The selected camera can fit the point of view of the conferee in the receiving endpoint.
Embodiments of 3D simulation system may use morphing techniques for smoothing the transition from one video camera to the other. Morphing technique is well known in the video processing arts and has been used for more than twenty years. To achieve good results with minimum deformation, morphing algorithm requires few reference points to be set for each video image. Additional information on simulating 3D video conferencing can be found in U.S. patent application Ser. No. 13/105,290, the contents of which are incorporate herein by reference.
A common technique for searching for reference points involves identifying similar objects or areas in frames received from two or more cameras. However, identifying similar objects in different frames involves high processing costs in terms of time and computing resources. The system not only needs to identify different patterns within each image, but it has to compare each identified pattern with all other identified patterns in the other image. As such, these techniques can prove too expensive or impractical for near real-time videoconferencing type applications.