1. Field of the Invention
The invention relates to video encoding methods with variable video frames.
2. Description of the Related Technology
A video information stream comprises a time sequence of video frames. The time sequence of video frames can be recorded for instance by a video camera/recorder. Each of the video frames can be considered as a still image. The video frames are represented in a digital system as an array of pixels. The pixels comprises luminance or light intensity and chrominance or color information. The information is stored in a memory of the digital system. For each pixel some bits are reserved. From a programming point of view each video frame can be considered as a two-dimensional data type, although the video frames are not necessary rectangular. Note that fields from an interlaced video time sequence can also be considered as video frames.
A particular aspect of the considered video frames is that they are variable in size and even location with respect to a fixed reference such as, e.g., the display. Moreover, the considered video frames support the object concept by indicating whether a pixel belongs to an object or not.
In principle when the video information stream must be transmitted between two digital systems, this can be realized by sending the video frames sequentially in time, for instance by sending the pixels of the video frames and thus the bits representing the pixels sequentially in time.
There exist, however, more elaborated transmission schemes enabling faster and more reliable communication between two digital systems the transmission schemes are based on encoding the video information stream in the transmitting digital system and decoding the encoded video information stream in the receiving digital system. Note that the same principles can be exploited for storage purposes.
During encoding the original video information stream is transformed into another digital representation the digital representation is then transmitted. While decoding the original video information stream is reconstructed from the digital representation.
The MPEG-4 standard defines such a transmission (and storage) efficient encoded digital representation of a video information stream.
Encoding requires operations on the video information stream, the operations are performed on a digital system (for instance in the transmitting digital system). Each operation performed by a digital system consumes power. The way in which the operations for encoding are performed is called a method. The methods have some characteristics such as encoding speed and the overall power consumption needed for encoding.
The digital system can either be application-specific hardware or a programmable processor architecture. It is well-known that most power consumption in the digital systems, while performing real-time multi-dimensional signal processing such as video stream encoding on the digital systems, is due to the memory units in the digital systems and the communication path between the memory units. More precisely individual read and write operations from and to memory units by processors and/or datapaths and between memories become more power expensive when the memory units are larger, and so does the access time or latency from the busses. Naturally also the amount of read and write operations are determining the overall power consumption and the bus loading. The larger the communication path the larger is also the power consumption for a data transfer operation. With communication is meant here the communication between memory units and the processors and data paths found in the digital system and between memories themselves. There is also a difference between on- and off-chip memories. Note that the same considerations are valid when considering speed as a performance criterion.
As the power consumption of the digital system is dominated by read and write operations, thus manipulations on data types, such as video frames, the methods are considered to be data-dominated.
As the algorithm specification, the algorithm choice and its implementation determine the amount of operations and the required memory sizes it is clear that these have a big impact on the overall power consumption and other performance criteria such as speed and bus loading.
A method for encoding a video information stream, resulting in a minimal power consumption of the digital system on which the method is implemented, and exhibiting excellent performance, e.g., being fast, must be based on optimized data storage, related to memory sizes, and data transfer, related to the amount of read and write operations. Such a method can be developed by transforming an initial less power optimal method by using various code manipulations. Such a transformation approach must be supported by an adequate exploration methodology.
In general a method can be described as an ordered set of operations which are repetitively executed. The repetition is organized in a loop. During execution data is consumed and produced. The code manipulations can be loop- and/or data-flow transformations. The transformations change the ordering of the operations in the loop and result in another data consumption-production ordering. Also data reuse concepts can be used in order to obtain a more power consumption and speed optimal method. Data reuse deals with specifying from and to which memory data is read and written. More in particular applying the data reuse concept means making copies of data to smaller memories and to let the data be accessed by the processors and/or datapaths from the smaller memories.
Naturally when such a power consumption and speed optimal encoding method exist it can be implemented on a digital system, adapted for the method. This adaptation can be done by an efficient programming of programmable (application specific) processor architectures or by actually designing an application-specific or domain-specific processor with the appropriate memory units.
The fact that the power consumption is heavily dominated by data storage and data transfer of multi-dimensional data types is demonstrated in the publication [F. Catthoor, F. Franssen, S. Wuytack, L. Nachtergaele, H. De Man, xe2x80x9cGlobal communication and memory optimizing transformations for low power signal processing systemsxe2x80x9d, IEEE workshop on VLSI signal processing, La Jolla Calif., October 1994] and [R. Gonzales, M. Horowitz, xe2x80x9cEnergy dissipation in general-purpose microprocessorsxe2x80x9d, IEEE J. Solid-state Circ., Vol.SC-31, No.9, pp.1277-1283, September 1996] for custom hardware and programmable processors respectively.
Power consumption in deep submicron CMOS digital devices is dominated by the charging of wires on-chip and off-chip. The technological evolution aims at minimizing the power consumption by lowering the supply voltages, using short thin wires and small devices, using reduced logic swing. These non-application specific approaches do not exploit the characteristics of the application in the design of the digital system and/or implementation on a given digital system.
Some following general principles for power consumption reduction are known: match architecture and computation, preserve locality and regularity inherent in the application, exploit signal statistics and data correlations and deliver energy and performance on demand. These guidelines must however be translated and extended for a more memory related context as found in multi-media applications.
The data storage and transfer exploration methodology, applied for constructing the encoding methods presented in the invention, is discussed in the detailed description of the invention.
The different aspects of the invention will be illustrated for encoding following the MPEG-4 standard, discussed in the detailed description of the invention. The current realizations of MPEG based video coding multi-media applications can be distinguished in two main classes: the customized architectures and the programmable architectures.
The disadvantages of the customized approach [P. Pirsch, N. Demassieux, W. Gehrke, xe2x80x9cVLSI architectures for video compressionxe2x80x94a surveyxe2x80x9d, Proc. of the IEEE, invited paper, Vol. 83, No. 2, pp. 220-246, February 1995] is that the design is difficult as only limited design exploration support is available, application-specific, still has large power consumption, due to rigid memory hierarchy and central bus architecture. Many programmable processor solutions, for video and image processing, have been proposed, also in the context of MPEG [K. Roenner, J. Kneip, xe2x80x9cArchitecture and applications of the HiPar video signal processorxe2x80x9d, IEEE Trans. on Circuit and Systems for Video Technology, special issue on xe2x80x9cVLSI for video signal processorsxe2x80x9d.]. Power consumption management and reduction for such processors is however hardly tackled. The disadvantages of the implementation on a programmable processor are indeed (1) the large power consumption, due to expensive data transfers of which many are not really necessary, (2) most area of chip/board is taken up by memories and busses, (3) addressing and control complexity are high and (4) the speed is too low such that parallel processing is necessary, which are difficult to program efficiently due to data communication.
Much work has been published in the past on cache coherence protocols, for parallel processors. These approaches are mostly based on load balancing and parallelisation issues for arithmetic operations. Although some work on data localization issues in order to obtain better cache usage exist, it is clear that a more data transfer and storage oriented solution is required for data-dominated applications such as multi-media applications. Data reuse is the basis for traditional caching policies. These policies are however not sufficiently application oriented, and thus not exploiting enough the particular algorithm which must be implemented, and not based on global optimization considerations.
The use of global and aggressive system-level data-flow and loop transformations is illustrated for a customized video compression architecture for the H.263 video conferencing decoder standard in [L. Nachtergaele, F. Catthoor, B. Kapoor, D. Moolenaar, S. Janssens, xe2x80x9cLow power storage exploration for H.263 video decoderxe2x80x9d, IEEE workshop on VLSI signal processing, Monterey Calif., October 1996] and other realistic multi-media kernels in [F. Catthoor, S. Wuytack, E. De Greef, F. Franssen, L. Nachtergaele. H. De Man, xe2x80x9cSystem-level transformations for low power data transfer and storagexe2x80x9d, in paper collection on xe2x80x9cLow power CMOS designxe2x80x9d (eds. A. Chandrakasan, R. Brodersen), IEEE Press, pp.609-618, 1998] [S. Wuytack, F. Catthoor, L. Nachtergaele, H. De Man, xe2x80x9cPower Exploration for Data Dominated Video Applicationsxe2x80x9d, Proc. IEEE Intnl. Symp. on Low Power Design, Monterey, pp.359-364, August 1996].
The invention includes video information stream encoding methods, for application with a data storage and transfer design methodology for data-dominated applications.
The invention relates to video encoding methods with variable video frames designed such that the digital system on which the methods are implemented, consumes a minimal of power, during execution of the methods and still excellent performance such as speed compliance is obtained.
The resulting video information stream encoding methods can be mapped on different processor architectures and custom hardware. The methods enable combined low power consumption, reduced bus loading and increased performance to achieve speed compliance.
Methods for encoding a video information stream are disclosed. A video information stream comprises of a time ordered time sequence of video frames. Each of the video frames can be considered as a still image. The video frames are represented as an array of pixels. The video frames of a video information stream can have different sizes and locations with respect to a fixed reference. Besides light intensity and color for each pixel position additional information can be stored. For instance it can be specified to which object a pixel belong or possible whether the pixel is not belonging to an object. Pixels not belonging to an object are denoted transparent pixels.
Encoding of the video information stream is done for obtaining another digital representation of the video information stream. The digital representation being more efficient for transmission or storage. The encoding is based on the fact that temporal nearby video frames are often quite similar except for some motion. The arrays of pixels of temporal nearby video frames often contain the same luminance and chrominance information except that the coordinate places or pixel positions of the information in the arrays are shifted some locations. Shifting in place as function of time defines a motion. The motion is characterized by a motion vector.
Encoding of the video information stream is done by performing encoding of the video frames of the time sequence with respect to other video frames of the time sequence. The other video frames are denoted reference video frames. Any video frame may be a reference frame. For the presented encoding methods it is important to denote how both the video frames to be encoded and the reference video frames are located in time with respect to each order. As such a time ordering of the video frames is explicitly stated in the methods. A video frame situated in time before the video frame under consideration is denoted a previous video frame. A video frame situating in time after the video frame under consideration is denoted a next video frame. The video frame under consideration can be denoted current video frame.
The encoding is in principal based on motion estimation of the motion between a video frame and a reference video frame. The motion estimation defines a motion vector. Motion estimation is based on calculating a norm of the difference between parts of two video frames. Such a norm is a measure of the difference between parts of two video frames. Often the sum of absolute differences is used as norm. Other norms can also be used. The norm can also be denoted as a mathematical norm, being an operator on two object, here video frames, measuring the differences between the objects. At least the norm is minimal when the difference is zero, thus when the objects are the same. When the motion is estimated, a motion compensation is performed. The motion compensation comprising of constructing a new motion compensated video frame from the reference video frame by applying the found motion. The motion compensated video frame comprises of the pixels of the reference video frame but located at different coordinate places. The motion compensated video frame can then be subtracted from the video frame under consideration. This results in an error video frame. Due to the temporal relation between the video frames the error video frame will contain less information. This error video frame and the motion estimation vectors are then transmitted, optionally after some additional coding. The substraction and additional coding is further denoted coding. Also padding can be included in the coding.
The encoding will be limited to a part of a video frame. The encoding is also not performed on the video frame as a whole but on blocks of the video frame. The video frame is divided in non-overlapping or overlapping blocks. The blocks are thus arrays of pixels but of smaller size than the video frame array. Blocks can be considered as array of pixels being different to each other by the fact that they are at least partly spacely divided. Note that different video frame can be characterized as arrays of pixels being spaced in time. The encoding operations are then performed on all the blocks of the video frame. As the encoding of a video frame is performed with respect to a reference video frame, implicitly a relation is defined between the blocks of the video frames under consideration and the blocks of the reference video frame. Indeed the calculation of the sum of absolute differences or any other norm will only be performed for a block of a video frame and blocks of the reference video frame which are nearby located. These locations are defined by the maximum length of the motion estimation vector. These locations define a search-area. Blocks of video frames to be encoded are called related when they refer to the same block in the reference video frame. One can also define these blocks as related because they will exploit the same search area in the reference video frame. In the reference video frame also a so-called related block is defined. The related block is the block in the reference video frame used for calculation of a particular norm for a block of the video frame under consideration.
In the application encoding of a video frame with respect to one reference video frame, encoding of a video frame with respect to two reference video frames, encoding of a time sequence of video frames with respect to two reference video frames and methods for motion estimation are presented. The encoding and motion estimation methods are designed such that when implemented on a digital system, the power consumption of the digital system while executing the methods, is minimal. The encoding and motion estimation methods also exhibit excellent performance with respect to other performance criteria such as speed.
The presented methods comprises of operations on data. The operations can be reading from and writing to a memory. The operations can also be arithmetic operations.
The different aspects of the invention are stated below. These aspects can be used independently or combined.
A first aspect of the invention is a method for encoding of at least a part of a video frame with respect to a reference video frame by dividing the video frame under consideration into blocks and performing the basic encoding operations such as motion estimation, motion compensation and block coding (including padding), in the order described above on a block of the considered video frame before considering another block of the considered video frame.
A second aspect of the invention is a method for encoding of at least a part of a video frame with respect to two reference video frames. A time ordering between the video frames to be encoded and the reference video frames is introduced. The encoding method also uses a block based implementation as described above. The application of the basic encoding operations such as motion estimation, compensation and block coding are performed in a particular order.
A third aspect of the invention is the introduction of several methods for encoding of a time sequence of video frames with respect to two reference video frames. As such a merging of the encoding of the video frames is realized. A time ordering between the video frames to be encoded and the reference video frames is introduced. The encoding methods also use a block based implementation as described above. The application of the basic encoding operations such as motion estimation, compensation and block coding are performed in a particular order. The choice between the proposed methods can be done at run-time.
A fourth aspect of the invention is the introduction of a particular implementation of the above defined methods for encoding of a time sequence of video frames with respect to two reference video frames. In the implementation it is specified that the further encoding of blocks is started as soon as this is technically possible. The implementation is denoted a chasing mode implementation.
A fifth aspect of the invention is the introduction of the concept of a group video frame or video frame group for encoding of a time sequence of video frames. The group video frame contains the video frames of the time sequence for which the encoding is merged. The group video frame is divided in cells. The encoding of blocks of the original video frames is merged when the blocks belong to the same cell of the group video frame.
A sixth aspect of the invention is a method for encoding a time sequence of video frames exploiting the group video frame concept thereby performing the motion estimation such that a pixel is reused for all motion estimations in which it is needed. In this method a check is performed to determine whether a pixel is needed for any of the motion estimations. When it is needed, it is read and used for all motion estimations in which it is needed.
A seventh aspect of the invention is a method for determining a motion estimation vector for a block with respect to a reference video frame based on norm calculations, wherein calculation of this norm is excluded when part of the related block in the reference video frame falls out of the reference video frame.
An eighth aspect of the invention are methods for determining a motion estimation vector for a block with respect to a reference video frame based on norm calculations wherein calculation of the norm is excluded when part of the related block in the reference video frame contains transparent pixels.
A ninth aspect of the invention is a method for determining a motion estimation vector for a block with respect to a reference video frame based on an interpolated version of that reference video frame. The interpolated version of the reference video frame is not determined in advance but the interpolated pixels are calculated when needed and not stored.
A tenth aspect of the invention are methods for determining a motion estimation vector for a block with respect to a reference video frame wherein a memory hierarchy is exploited.