1. Field of Invention
Embodiments of the invention relate in general to communication techniques for conferencing. More specifically, embodiments of the invention relate to methods and systems for secure communication in conferencing.
2. Description of the Background Art
Modern communication techniques enable multiple endpoints at remote locations for simultaneous communication over conference calls. The endpoints correspond to speakers as well as to the listeners of the conference calls. A conference call may be organized, for example, as a teleconference or a videoconference. A teleconference may be supported by audio, video and data transmission devices such as a telephone, a radio, a television or a computer. A videoconference may be supported by telephony and video devices such as a web cam and a close-circuit television.
In conference calls with multiple endpoints, all the endpoints are not generally considered to be speakers. Endpoints corresponding to speakers are hereinafter referred to as relevant endpoints, which are selected, based on predefined parameters for the selection of the endpoints. For example, the predefined parameters may include First Come First Serve (FCFS) and comparison of noise from an endpoint with a preset noise threshold. Further, the predefined parameters may include classification of the signals from the endpoints, such as speech or silence. This classification is performed by a Voice Activity Detector (VAD).
In a conventional method for secure conferencing, ‘N’ data streams are decrypted and generated from the provided N endpoints. Decryption, dejitter, decoding, and VAD processing are applied to each stream, so that a speaker selection algorithm may select up to ‘M’ data streams as the active speakers in the conference. When more than M data streams are active, the speaker selection algorithm may use additional criteria, such as the relative loudness, to make the selection. The data stream path through such a conference, from a source endpoint to a receiver endpoint, is processed as described further. Initially, SRTP data streams are generated from all the endpoints, followed by the decryption of the generated data streams. Thereafter, the process of decoding the data streams is performed, and Voice Activity Detector (VAD) processing is applied to select the relevant endpoints. If M endpoints are selected from a provided number of N endpoints, then audio mixing of M endpoints is performed. Thereafter, the mixed data streams are encoded. Further, a secure encryption of the encoded streams is performed. Finally, SRTP streams are received at the endpoints.
Therefore, in the conventional method, the conference has to incur the cost of decryption, dejitter, and VAD processing of all N endpoints, of which N−M endpoints are not considered as relevant endpoints in conferencing.
Conventionally, in a conference that has multiple endpoints, the endpoint receiver cannot monitor the decision, when to switch between the listener endpoint mix stream and one of the speaker endpoint mix streams. Therefore, it is difficult for an endpoint to receive any of the data streams selectively.