Video telephony (VT) involves the real-time communication of packets carrying audio and video data. A VT device may include a video encoder that obtains video from a video capture device, such as a video camera or video archive, and generates video packets. Similarly, an audio encoder in a VT device may obtain audio from an audio capture device, such as a microphone or speech synthesizer, or an audio archive, and generate audio packets. The video packets and audio packets may be placed in a radio link protocol (RLP) queue. A medium access control (MAC) layer unit may generate medium access control (MAC) layer packets from the contents of the RLP queue. The MAC layer packets may be converted to physical (PHY) layer packets for transmission across a communication channel to another VT device. Video packets used for VT sessions generally conform to real-time transport protocol (RTP). A video decoder within the VT device decodes the video data for presentation to a user via a display device. An audio decoder within the VT device decodes the audio data for output via an audio speaker. VT sessions are often carried out over an Internet protocol multimedia subsystem (IMS) network. In some instances, a VT device may be a receiver device, in that the receiver device implements video and audio decoding functionalities and outputs the decoded data, but does not encode video and audio data for transmission over the VT call.
VT devices also may be referred to as user equipment (UE). Two or more UEs may be used in a given VT session or “VT call.” In mobile VT applications, a receiving UE may be a wireless communication device that receives the physical layer packets via a wireless forward link (FL) (or “downlink”) from a base station to the receiving UE as a wireless terminal. A sending UE transmits the PHY layer packets via a wireless reverse link (RL) (or “uplink”) to a base station. Each UE includes PHY and MAC layers to convert the received PHY and MAC layer packets and reassemble the packet payloads into audio packets and video packets.
VT applications generally use a combination of intra-predicted video frames (i-frames), inter-predicted video frames (p-frames), and optionally, bidirectionally-predicted video frames (b-frames) to provide a video stream. I-frames represent full pictures. P-frames represent information that indicates differences (or so-called “delta” information) between two pictures, namely, a current picture and another pictures, such as the most recent i-frame in output order. B-frames represent delta information between two pictures, namely, a current picture, and two other pictures, such as the most recent past i-frame in output order, and the next i-frame in output order. Thus, an i-frame includes a greater amount of data than a p-frame or a b-frame. In other words, an i-frame is more “data-rich” than a p-frame or a b-frame.
A sending UE may transmit i-frames at predetermined time intervals during a VT call. During the time interval between transmission of two consecutive i-frames, the sending UE may transmit p-frames, which represent progressive deltas with respect to the previously-transmitted i-frame. In turn, each p-frame that follows the i-frame represents delta information with respect to the previously-transmitted i-frame. Each b-frame that follows the i-frame includes delta information with respect to the previously-generated i-frame and subsequent i-frame that follows the b-frame in output order. In this way, the sending UE may transmit a video stream at a given frame rate, while mitigating bandwidth consumption by using fewer data-rich i-frames. While reference pictures are described as being i-frames as an example, it will be appreciated that, in various instances, a reference picture may be a p-frame or a b-frame. That is, in such examples, the sending UE may further conserve bandwidth by signaling prediction frames (e.g., p-frames or b-frames) that provide delta information with respect to the reconstructed versions of other prediction frames (e.g., reconstructed p-frames or b-frames).
A receiving UE may receive a VT packet flow from one or more sending UEs. The video decoder of the receiving UE may reconstruct a received i-frame to render an image at an instance of time. In turn, the video decoder of the receiving UE may reconstruct subsequent frames of the video stream by applying the delta information of the p-frames and/or b-frames that follow the reconstructed i-frame in the received VT packet flow. The video decoder may continue to construct the video stream using progressive p-frames until a new i-frame is received for that particular packet flow. Upon receiving a new i-frame, the video decoder of the receiving UE may restart the process of constructing a subsequent series of pictures by applying the delta information of later-received p-frames and/or b-frames. In this way, VT applications enable UEs to support VT sessions while conserving bandwidth through the combined use of i-frames, p-frames, and b-frames.