In modern communications systems a video signal may be sent from one terminal to another over a medium such as a wired and/or wireless network, often a packet-based network such as the Internet. In many situations it is desired to encode and transmit the video in real time, i.e. video of some event or stream of content continues to be encoded in an ongoing fashion while preceding, previously encoded video data of that same event or stream of content is transmitted (as opposed to a whole video file being encoded in one go and then subsequently transmitted). Hence one frame of the video may be encoded while one of the immediately preceding, already-encoded frames is transmitted (or buffered for transmission), and so forth. Put another way, the video is transmitted “as and when” it is encoded. “Real-time” as used herein does not necessarily limit to zero delay. Nonetheless, the user does expect the video to be encoded, transmitted and decoded at least as quickly as the event being captured actually occurs, and at least as quickly as the video is intended to play out (on average over several frames). An example of real-time video communication would be a live video call or other live transmission, where the video is also captured in real-time as it is encoded and transmitted.
The frames of the video are encoded by the encoder at the transmitting terminal in order to compress them for transmission over the network or other medium. Compression is particularly relevant for real-time video communication, although other reasons to compress a video signal also include reducing the size of a video file for upload, download or storage on a storage medium.
The encoding commonly comprises prediction coding in the form of intra-frame prediction coding, inter-frame prediction coding, or more usually a combination of the two (e.g. a few intra-frame encoded “key” frames interleaved between sequences of inter-frame encoded frames). According to intra-frame encoding, blocks are encoded relative to other blocks in the same frame. In this case a target block is encoded in terms of a difference (the residual) between that block and another block in the same frame, e.g. a neighbouring block. The residual is smaller than an absolute value and so requires fewer bits to encode, and the smaller the residual the fewer bits are incurred in the encoding. According to inter-frame encoding, blocks in the target frame are encoded relative to corresponding portions in a preceding frame, typically based on motion prediction. In this case a target block is encoded in terms of a motion vector identifying an offset between the block and the corresponding portion from which it is to be predicted, and a difference (the residual) between the block and the corresponding portion from which it is predicted. Inter-frame encoding usually results in an even smaller residual than intra-frame encoding, and hence incurs even fewer bits.
A corresponding decoder at the receiver decodes the frames of the received video signal based on the appropriate type of prediction, in order to decompress them for output to a screen. A generic term that may be used to refer to an encoder and/or decoder is a codec.
A frame may be captured by the camera on the encoding side in a colour space based on a three-colour model such as RGB (Red, Green, Blue). This means each pixel is represented in terms of an intensity of a red (R) channel, an intensity of a green (G) channel and an intensity of a blue (B) channel. However, it is also possible to consider a pixel in terms of only two colour channels, which may be referred to as chrominance or chroma channels, and one achromatic channel representing overall light level of the pixel, e.g. in terms of brightness or lightness. For example the two chrominance channels may be red and blue channels. The achromatic channel may be referred to as the luminance or luma channel. In some contexts the term luminance is used specifically to refer to a non gamma corrected level whilst luma is used to refer to a gamma corrected level. However, in this disclosure luminance may be used as a general term for a gamma corrected or uncorrected level. Chroma and chrominance may also be used interchangeably with one another. An example of such a colour space is YUV where Y refers to the luminance channel, U the blue chrominance channel and V the red chrominance channel. Other similar colour space models will be familiar to a person skilled in the art. For example in HSV the colour channels are hue (H) and saturation (S) and the achromatic light-level channel is brightness value (V). In HSL the colour channels are hue (H) and saturation (S) and the achromatic light-level channel is lightness (L).
Prior to encoding, a frame is often explicitly transformed into a luminance-chrominance type colour-space representation (as well as being transformed from a spatial domain representation in terms of pixel coordinates into a spatial frequency domain representation in terms of a set of frequency coefficients, and being quantized). Alternatively it is not precluded that the video could be captured in YUV space, or converted to YUV or the like having been captured in some other colour space than RGB. Even if not explicitly captured or encoded in YUV type space, it is still possible to describe or consider an image in an alternative colour space such as YUV.