As the development of the network facilities proceeds, Video and Voice over IP (VVoIP) has become one of the most popular applications over the Internet, with millions of users all over the world. The benefits of VVoIP over traditional video phones are that: 1) the VVoIP application is free, or much cheaper than the traditional video phones, 2) the quality of a successful VVoIP application is quite good; 3) convenience: it is extraordinarily easy to get a device with Internet connection nowadays. The most successful VVoIP applications currently include SKYPE, GTALK, MSN and EYEBEAM.
SKYPE uses a proprietary protocol stack for signaling process and transmission. It provides high quality video call service among PCs for free, and a fairly good voice call service to traditional telephone networks through its Internet-PSTN gateway at a very attractive price. However, all of the protocols in SKYPE are proprietary, including signaling and transmission protocols. Although some of the working groups commit themselves to analyze SKYPE user management and media transmission policies, there are still a lot of details we do not know. Currently there is no way for anyone outside SKYPE to study and improve the media transmission in SKYPE.
On the other hand, GTALK, Eyebeam and MSN all conform to the SIP/SDP (Session Initiation Protocol/Session Description Protocol) signaling protocol stack proposed by IETF (Internet Engineering Task Force), which is the most popular signaling protocol today. The media transmission components of MSN and Eyebeam are also all strictly based on standards proposed by IETF or ITU (International Telecommunication Union). These standards work quite well in VoIP system without video stream.
However, when a real-time video stream is introduced, the quality of a voice stream is seriously degraded. This phenomenon is mainly caused by the different characteristics of voice and video streams. For example, an encoded video frame can be as large as several kilobytes, while the size of an encoded voice frame is no more than 50 bytes in general. If video frames are sent out without considering the voice stream, the interval of the voice frames before and after the video frames may be elongated in a way that it would affect the quality of the voice.
Some solutions have been proposed to handle transmission coordination between real-time voice stream and video stream. However, no work is known to have been published on adaptive transmission of real-time voice and video streams that takes into account the on-off patterns in conversational speech. Conversational speech is a sequence of contiguous segments of speech (on pattern) and silence (off-pattern). The related parameters about talk duration and rate in conversational speech are shown in P.59, which is an artificial conversational speech standard proposed by ITU in 1993. A technique, called silence suppression, identifies and rejects transmission of silence periods, in order to reduce Internet traffic or allow per-spurt playout delay adjustment.
In the aspect of the present invention, a set of strategies on Silence-Based Adaptive Real-Time Voice and Video Transmission (SAVV) are presented. The present invention also describes in detail an SAVV client system that implements the SAVV strategies.