The present invention relates to video encoding, and more specifically, to encoding a video stream that includes one or more privacy masks.
Monitoring cameras are used in many different applications, both indoors and outdoors, for monitoring a variety of environments. Images depicting a captured scene may be monitored by, e.g., an operator or a guard. In certain situations there may be a need to treat one part of a captured image differently from another part, such as when there is a need to exclude part of an image, for example, in the interest of personal integrity.
In such instances, an operator may define one or more privacy masks during set-up of the surveillance equipment. A privacy mask may be static or dynamic. Static privacy masks typically stay in place until the operator decides to move or remove them. Dynamic privacy masks may change over time, and the operator may also define when the privacy mask should be applied. For instance, the operator could define a dynamic privacy mask such that if a face is detected within the masked area, the face will be masked out, but otherwise no mask will be applied to the area.
Privacy masks are often applied to the image as an overlay. Some privacy masks take the form of an opaque area (e.g. a uniformly black area), while other privacy masks take the form of blurring, where image data is “smeared” out over the privacy mask area, or pixelation, where the image inside the privacy mask is divided into pixelation blocks and all pixels of a pixelation block are given the same value, such that the image appears blocky inside the privacy mask area. The privacy mask often has a polygonal shape, but other shapes are also possible, which more closely follow the shape of the area to occlude.
Images captured by the camera are normally transmitted to a site of use, such as a control center, where the images may be viewed and/or stored. Alternatively, they can be stored in so-called “edge storage”, i.e., storage at the camera, either on board the camera, such as on an SD-card, or in connection with the camera, such as on a NAS (network attached storage). Before transmission or edge storage, the images are typically encoded to save bandwidth and storage space. Encoding may be performed in many different ways, e.g., in accordance with the H.264 standard or other encoding standards. Most, if not all, video encoding is lossy, meaning that information present in the original images is lost during encoding and cannot be regained in decoding. There is a trade-off between reducing of the number of bits required to represent the original images and the resulting image quality. Efforts have been made to develop encoding schemes that make as efficient use of the available bits as possible.
In many digital video encoding systems, two main modes are used for compressing video frames of a sequence of video frames: intra mode and inter mode. In the intra mode, the luminance and chrominance channels (or in some cases RGB or Bayer data) are encoded by exploiting the spatial redundancy of the pixels in a given channel of a single frame via prediction, transform, and entropy coding. The encoded frames are called intra-frames (also referred to as “I-frames”). Within an I-frame, blocks of pixels, also referred to as macro blocks, coding units or coding tree units, are encoded in intra-mode, that is, they are encoded with reference to a similar block within the same image frame, or raw coded with no reference at all.
In contrast, the inter mode exploits the temporal redundancy between separate frames, and relies on a motion-compensation prediction technique that predicts parts of a frame from one or more previous frames by encoding the motion in pixels from one frame to another for selected blocks of pixels. The encoded frames are referred to as inter-frames, P-frames (forward-predicted frames), which can refer to previous frames in decoding order, or B-frames (bi-directionally predicted frames), which can refer to two or more previously decoded frames, and can have any arbitrary display order relationship of the frames used for the prediction. Within an inter-frame, blocks of pixels may be encoded either in inter-mode, meaning that they are encoded with reference to a similar block in a previously decoded image, or in intra-mode, meaning that they are encoded with reference to a similar block within the same image frame, or raw-coded with no reference.
The encoded image frames are arranged in groups of pictures (GOPs). Each GOP is started by an I-frame, which does not refer to any other frame, and is followed by a number of inter-frames (i.e., P-frames or B-frames), which do refer to other frames. Image frames do not necessarily have to be encoded and decoded in the same order as they are captured or displayed. The only inherent limitation is that a frame that serves as a reference frame must be decoded before other frames that use it as reference can be encoded. In surveillance or monitoring applications, encoding is generally done in real time, meaning that the most practical approach is to encode and decode the image frames in the same order as they are captured and displayed, as there will otherwise be undesired latency.
Two important techniques that can be used to reduce the bitrate are known as “Dynamic GOP” and “Dynamic Frames Per Second” (FPS), respectively. The dynamic GOP reduces the bitrate by avoiding storage-consuming I-frame updates. Typically, surveillance scenes with limited motion can be compressed into a very small size without any loss of detail. This algorithm makes a real-time adaption of the GOP length (i.e., the number of frames in the GOP) on the compressed video according to the amount of motion. As a general rule, for a static or low motion scene, a longer GOP length results in a lower output bitrate, since inter-frames generally require fewer bits for representation than intra-frames. An example of dynamic GOP is described in European Application No. EP17160703.9, filed on Mar. 14, 2017.
The dynamic FPS reduces the bitrate by avoiding unnecessary encoding of video frames by omitting them from the stream. The amount of motion in a scene can be used as a control variable for determining the FPS. For example, a static surveillance scene can be encoded with a significantly reduced frame rate, even though the camera may be capturing and analyzing video at full frame rate. Since motion is used as a control variable, a small moving object far away may not render at full frame rate. However, objects approaching the camera will cause the frame rate to increase, so as to capture every important detail of the objects. The number of delivered frames per second is also restricted automatically by the camera, which will save a substantial amount of data in many scenes.