In digital video systems, such as network camera monitoring systems, video sequences are compressed before transmission using various video encoding methods. In many digital video encoding systems, two main modes are used for compressing video frames of a sequence of video frames: intra mode and inter mode. In the intra mode, the luminance and chrominance channels (or in some cases RGB or Bayer data) are encoded by exploiting the spatial redundancy of the pixels in a given channel of a single frame via prediction, transform, and entropy coding. The encoded frames are called intra-frames, and may also be referred to as I-frames. Within an intra-frame, blocks of pixels, also referred to as macroblocks, coding units or coding tree units, are encoded in intra-mode, meaning that they are encoded with reference to a similar block within the same image frame, or raw coded with no reference at all. The inter mode instead exploits the temporal redundancy between separate frames, and relies on a motion-compensation prediction technique that predicts parts of a frame from one or more previous frames by encoding the motion in pixels from one frame to another for selected blocks of pixels. The encoded frames are called inter-frames, and may be referred to as P-frames (forward-predicted frames), which can refer to previous frames in decoding order, or B-frames (bi-directionally predicted frames), which can refer to two or more previously decoded frames, and can have any arbitrary display-order relationship of the frames used for the prediction. Within an inter-frame, blocks of pixels may be encoded either in inter-mode, meaning that they are encoded with reference to a similar block in a previously decoded image, or in intra-mode, meaning that they are encoded with reference to a similar block within the same image frame, or raw-coded with no reference at all.
The encoded image frames are arranged in groups of pictures, or GOPs for short. Each group of pictures is started by an intra-frame, which does not refer to any other frame. The intra-frame is followed by a number of inter-frames, which do refer to other frames. As mentioned above, there are different kinds of inter-frames. For a P-frame, the reference frame is one or more previously encoded and decoded image frames, appearing before the P-frame in display order. For B-frames, two or more reference frames are used, and one of the reference frames may for instance be displayed before the B-frame, whereas the other reference frame is displayed after the B-frame. Image frames do not necessarily have to be encoded and decoded in the same order as they are captured or displayed. The only inherent limitation is that for a frame to serve as reference frame, it has to be decoded before the frame that is to use it as reference can be encoded. In surveillance or monitoring applications, encoding is generally done in real time, meaning that the most practical approach is to encode and decode the image frames in the same order as they are captured and displayed, as there will otherwise be undesired latency.
Some codecs also use another kind of inter-frame, which is sometimes referred to as a refresh-frame or an R-frame. In the same way as other inter-frames, a refresh-frame generally uses intra mode encoding for moving parts of the image, whereas static parts or background parts are encoded using inter mode encoding. Different from a P-frame, an R-frame does not use the nearest preceding decoded P-frame as reference frame, but refers back to the I-frame at the start of the GOP. In this manner, errors or artefacts that propagate as the distance from the I-frame increases are reset. Hereby, the next P-frame in the GOP may get a better starting point, leading to a lower number of bits for representing the P-frame. Another advantage of using R-frames is that it gives more flexibility in playback of the encoded and decoded video sequence. If a user wants to play back an encoded video sequence, the video sequence has to be decoded. In order to be able to decode a particular frame, its reference frame has to be decoded first. In a video sequence with a GOP structure using only I-frames and P-frames (and possibly B-frames), the I-frame at the start of the GOP has to be decoded before any of the subsequent image frames in the GOP can be decoded. If long GOP lengths are used, the time it takes to decode all preceding image frames in the GOP may be considerable when the user wishes to start playback at a point in time in the video sequence that happens to be close to the end of a GOP. If R-frames are encoded at regular or irregular intervals along the GOP, less frames have to be decoded before playback can start, as only the R-frame that is closest before the chosen playback start and the I-frame starting the GOP have to be decoded, and not the frames between the I-frame and that R-frame. A possible disadvantage of using R-frames is that a decoded version of the I-frame at the start of the GOP has to be retained as reference frame for all R-frames in the GOP, whereas if only P-frames are used, only the decoded version of the previous frame has to be retained and can be continuously replaced as image frames are encoded. Thus, the use of R-frames requires retaining two possible reference frames, and if only P-frames are used, only one reference frame has to be retained.
Encoding is often controlled by a rate controller, which may employ a constant bitrate (CBR), a maximum bitrate (MBR), or a variable bitrate (VBR). CBR means that the encoder will strive to always output the same bitrate, regardless of what happens in the captured scene. If bandwidth is limited, this may lead to low quality images when there is a motion in the scene, but high quality images when the image is static. In a surveillance or monitoring situation, this is generally not useful, as a scene with motion is normally of more interest than a static scene. With MBR, the bitrate is allowed to vary, as long as it does not exceed the bitrate limit set. The problems related to this approach are similar to the ones associated with CBR. If the MBR limit is set too low, images of a scene with motion may be of low quality. However, if the limit is set higher, in order to accommodate the motion, the output bitrate may be unnecessarily high when encoding images of a static scene. VBR may also be referred to as constant quality bitrate, meaning that the quality of the encoded images should be kept constant, but the output bitrate is allowed to vary depending on what is happening in the scene. This approach may lead to high output bitrate when there is motion in the scene. This is particularly problematic if bandwidth is limited, such as when transmitting encoded images over a mobile network. Similarly, it is problematic if storage is limited, such as when storing images on board the camera, e.g., on an SD card. High output bitrates may also be problematic in large systems of cameras if several cameras transmit images of scenes with motion simultaneously.
Regardless of the bitrate control scheme used by the rate controller, one of the parameters that the encoder can adjust in order to comply with the bitrate set by the rate controller is the GOP length. In some applications, the GOP length is set manually, by user input. In others, it is determined dynamically, e.g., based on image analysis. A longer GOP length generally gives a lower output bitrate, since inter-frames generally require fewer bits for representation than intra-frames. However, the inventors of the present disclosure have discovered that this is not always true. In some instances, increasing the GOP length may in fact not give rise to the desired bitrate reduction. The output bitrate may be decreased, but not as much as would have been expected based on the size of the GOP length increase. This is detrimental in that bandwidth requirement may become unnecessarily high, while at the same time image quality is low. Hence, there is a need for an improved encoding method.