In this context, videotelephony (VT) in general concerns full-duplex, real-time audio-video communication between two or among several end users, where the communication consists of audio (e.g. speech) and video, or a combination of audio, data and video.
In the past, so-called videoconferencing was limited to the H.323 protocol for packet-based multimedia communications systems, which is basically a protocol suite defined by the International Telecommunication Union (ITU) Telecommunication Standardization Sector (ITU-T) for audio-visual communication sessions on any packet based data network, such as the internet, where voice transmission using the internet protocol (IP) is also known as Voice over IP (or VoIP in short). In addition to voice applications, H.323 provides mechanisms for video communication and data collaboration, in combination with the T.120 series standards of the ITU-T. In short, the H.323 specifies how real time services may be implemented over IP networks by means of basically three major steps, which are signalling under the H.225 protocol for agents to request access to the H.323 domain, signalling under the H.245 protocol for the call setup, including the media streams to be used, and, finally, data transport using real time protocol (RTP), which is an internet protocol standard defining a way for applications to manage real-time transmission of multimedia data.
The components under the H.323 architecture are terminal(s) (T), gateway(s) (GW), gatekeeper(s) (GK) and multipoint control unit(s) (MCU) for establishing multipoint conferences. Terminals represent the end devices of every communication connection, in which real time two-way communications with another H.323 terminal, gateway or multipoint control unit can be provided. Gateways establish the connection between the terminal(s) in the H.323 network and terminals belonging to networks using a different protocol stack, such as a public switched telephone network (PSTN). Gatekeepers are responsible for translating between telephone number and IP addresses, manage bandwidth, and provide mechanism for terminal registration and authentications.
Generally, there are five types of information exchange in the H.323 architecture, namely digitized audio (e.g. speech or voice), digitized video, data, communication control, controlling connections and sessions, where the main focus herein is the combination of audio and video for videotelephony.
Among the protocols contained in the H.323 protocol suite there are specialized protocols for video processing, for instance, the H.261, which contains video codecs for audiovisual services at P×64 kps, and the H.263, which concerns video coding for low bit rate communication. At the moment, the most commonly used video codecs are H.263 and its successor the H.264/MPEG-4 AVC.
In videotelephony, transmitted video data consists of a sequence of images, where an individual image is known by the expression “frame”. For reduction of the video data amount to be transmitted, there are used three major types of encoded frames. First, an I-frame is basically one encoded still image, which consequently can individually be decoded in order to get back the full still image. Secondly, a P-frame is encoded as difference from one or more preceding I-frame(s) or P-frame(s). Thirdly, another frame type is a B-frame, which is also coded as differences, but either from preceding or from following I-frames or P-frames. Since the coding of P-frames and B-frames is based on coding of differences, it is, therefore, known as predictive video (en)coding, which on the one hand provides for data compression by removal of temporal redundancy in a video image sequence. However, on the other hand, is also one weak point for quality in case of disturbances during transmission of the video data.
Another protocol used for videotelephony by videophones is the H.324. A slightly modified version of H.324, which is also known as 3G-324M and which has been defined by 3rd generation partnership project (3GPP), is used by cell phones that allow video calls. At the moment typically use is made in packet based data networks of the Universal Mobile Telecommunications System (UMTS), such as the frequency division duplex (FDD), time division duplex (TDD) and low chip rate time division duplex (LCR-TDD) and beyond implementations of the UMTS. This standard comprises several sub-protocols that handle multiplexing and demultiplexing of speech, video, user, and control data (cf. H.223 protocol) as well as in-band call control (cf. H.245 protocol).
As it regards mobile videotelephony, the term “mobile” indicates that there is at least one mobile terminal, which is connected via a radio link or radio connection. Accordingly, errors may be induced in the video bit streams caused by interferences. As mentioned before, users readily notice audio and video interruptions and/or corruptions. Thus, user experienced video quality can significantly be degraded when corruption lasts for several seconds, depending on the frequency of transmitted I-frames. However, using higher frequency of I-frames is not desired due to I-frame requiring more bandwidth than B- or P-frames.
For example, videotelephony in an UMTS environment relies on a synchronous bearer at 64 kbps with no retransmission at the radio link control (RLC) layer, also called RLC in transparent mode (RLC-TM). The UMTS bearer supports sending and receiving burst of twice 80 bytes every 20 ms. For each burst, there may be one voice frame, which is independent from the previous bursts, and one part of a video frame. Voice frames are independent from each other at 20 ms pace because voice codec is based on a pseudo-stationary voice scheme at 20 ms for adaptive multi-rate (AMR) coding and 30 ms for voice coding according to the G723.1 protocol. The videotelephony bearer relies on the UMTS protocol stack and the videotelephony session relies on the videotelephony protocol stack of the H.245 protocol, which protocols are independent. The H.245 stack is normally transparent for the UMTS protocol stack.
The H.245 protocol stack serves for control of multimedia communication by messages and procedures used for opening and closing logical channels (multiplexed paths between the endpoints used for data transfer) for audio, video and data, capability exchange, control and indications. After a connection has been set up via the call signalling procedure, the H.245 call control protocol is used to resolve the call media type and establish the media flow, before the call can be established. By the H.245 protocol the call is also managed after it has been established. There are several logical channel procedures provided by the H.245 protocol, which are used for opening and closing logical channel. Further, the H.245 provides for among others the “Video fast update”-command, which corresponds to the above mentioned “VideoFastUpdate”-request and which is used for requesting updates for video frames, in case of data loss.
As discussed above, video frames may rely on previous video frames except of I-frames. During an intra-RAT (radio access technology) UMTS hard handover, it may happen that the interruption time of the bearer is more than 100 ms. In some cases, interruption of the bearer may last several seconds in case radio link interruption and radio link failure. Further, the distant bearer (i.e., of the distant) terminal is not aware of this interruption as long as the bearer is considered as established by the network. Moreover, none of the communicating parties of an ongoing videotelephony connection are usually aware of this interruption time when the bearer is back.
That is to say, as I-frames are not generated too often, for the above-discussed reason of bearer bit rate limitation, generally, in cases of an interruption of the bearer, e.g. caused by a handover, it may take some time to get back a proper video after the interruption of the bearer is completed.