The use of multimedia conferencing or telephony, which allows remote parties to both see and hear one another, is becoming increasingly popular. Multimedia telephony refers to communications using both video, audio, and/or data transmitted over a communications network. Such applications facilitate remote communications by providing a visual image of each conference participant. Accordingly, multimedia conferencing allows parties to communicate audibly and visibly, without requiring lengthy and expensive travel.
In a typical multimedia telephony application, a camera is positioned to obtain an image of the participants at each endpoint of the communication. The image of the participants at one endpoint is then provided to the participants at the other endpoints. Accordingly, the multimedia telecommunications interaction can include two or more endpoints, and one or more participants at each endpoint.
The image obtained from an endpoint is displayed to participants at other endpoints. For an endpoint in communication with two or more other endpoints, most commercially available multimedia conferencing systems either use a multi-point conferencing unit (MCU) to mix and redistribute video, audio, and data to all endpoints in a conference or use a centralized server to switch a video or audio stream from one endpoint to all other endpoints. Almost all switching solutions send multiple streams to all other endpoints in the conference and then employ special functionality at each endpoint.
Multimedia mixing with an MCU in a conference is computationally expensive and introduces latency, which adversely affects video and/or audio quality. On the other hand, switching a single stream with a centralized server has the limitation of only allowing one speaker to be displayed per site at a time.
Scalable video coding has been developed in an effort to overcome some of the shortcomings associated with utilizing an MCU. With scalable video coding the traditional MCU is replaced by a stream router. Each endpoint is equipped with a specialized encoder and decoder which implements scalable video coding. The centralized router examines each incoming video packet and routes it to the appropriate destination. The receiving endpoint has the responsibility of decoding video packets from multiple streams and mixing them together to create a composite image. This approach requires an infrastructure/hardware upgrade for every conference endpoint such that the endpoints are equipped with specialized encoders/decoders.
Another solution that has been developed is voice activated or operator selected switching. In this solution, the loudest speaker is identified based on voice energy or a speaker is selected by the conference operator and that identified/selected speaker's multimedia stream is switched to all endpoints in the conference. Most multimedia conference equipment vendors support this particular feature, but it is a limited feature in that only a single multimedia stream can be selected for display to all other participants.
Yet another solution that has been developed is mesh switching. In mesh switching the multimedia streams from all endpoints are collected by a centralized server and switched to all other endpoints. The receiving endpoint is responsible for decoding and mixing the signal received from the centralized server. This method requires specialized functionality in the endpoints to decode, scale, and mix multiple streams, thereby increasing the costs associated with utilizing such endpoints.