The MPEG-4 visual standard provides technologies to view, access and manipulate objects (rather than pixels, in the case of the previous MPEG standards) in a large range of bit rates, in a lot of application areas such as for instance: digital television, streaming video, mobile multimedia, games, etc. Said standard operates on video objects (VOs) defined by temporal and spatial information in the form of shape, motion and texture information, coded separately in the bitstream (these VOs are the entities that the user can access and manipulate).
The MPEG-4 approach relies on a content-based visual data representation of the successive scenes of a sequence, each scene being a composition of VOs with its intrinsic properties: shape, motion, texture. In addition to the concept of VO, the MPEG-4 standard introduces other ones like the Video Object Layer (each VO can be encoded either in a scalable or non-scalable form, depending on the application, represented by the video object layer, or VOL) and the Video Object Planes (VOPs) (=instances of VOs in time). It is assumed that each frame of an input video sequence is segmented into a number of arbitrarily shaped image regions (the VOs), and that the shape, motion and texture information of the VOPs belonging to the same VO is encoded and transmitted into separate VOLs corresponding to specific temporal or spatial resolutions (which allows later to separately decode each VOP and leads to the required flexible manipulation of the video sequence).
The three types of frames processed by such a coding structure are the following: the I-VOPs, the P-VOPs and the B-VOPs. An I-VOP is an intra coded VOP: the coding operation uses information only from itself (it is the VOP that costs the greatest number of bits). A P-VOP is a predictive coded VOP, and the coding operation then uses a motion compensated prediction from a past reference VOP which can be either an I-VOP or another P-VOP (contrary to an I-VOP, only the difference between the current motion-compensated P-VOP and its reference is coded: thus, a P-VOP usually costs fewer bits than an I-VOP). A B-VOP is a VOP that is coded using a motion-compensated prediction from past and future reference VOPs (I or P-VOPs), based on so-called forward and backward motion estimations respectively. A B-VOP cannot be a reference VOP and, like the P-VOP, only the difference between the current motion compensated B-VOP and its reference VOP is coded.
Unfortunately, using said B-VOP prediction (also called interpolated or bi-directional mode) is not always a gain in term of compression. If the compression can sometimes be improved by a factor of about 20%, it can also in other cases be decreased by a drastic factor.