The form that exists in a certain time and space is a source of information, for example, the propagation of light in space forms dynamic images, the flow of large numbers of water molecules produces ocean information, and the dynamic movement of air molecules and other floats forms climate information. In terms of dynamic images, humans and creatures perceive the world by capturing photons through their eyes, and the modern camera records the dynamically changing world by using CCD (Charge-coupled Device) or CMOS (Complementary Metal Oxide Semiconductor) to capture photons, which generates large amounts of image and video data.
Traditional methods of representing dynamic images are two-dimensional images and videos as sequences of images. The traditional image is a two-dimensional information form. The narrowly defined image is the result that light is projected on a photographic plane after reflection, diffuse reflection, refraction or scattering in the physical world. The generalized image includes any information form distributed on a two-dimensional plane. The image represented by the digital form is more convenient to process, transmit and store, so it is necessary to transform an image in the analog signal form into an image represented by the digital form, i.e. a digital image. The process of image digitization mainly includes three steps: sampling, quantization and coding. Sampling is a process of discretizing the space distribution of an image. For a two-dimensional image, the most common way is to divide a rectangular area covered by the image into equal-sized sampling points at equal intervals, and the number of rows of sampling points and the number of sampling points per row are usually called digital image resolution (more accurate resolution refers to the number of sampling points per unit physical size). Quantization is a process of discretizing the color (or other physical quantities) of an image at each sampling point, which is generally represented by a quantization level. The quantized values of each sampling point and its color (or other physical quantities) form one pixel of the image, and all pixels arranged in rows or columns form a digital image.
The traditional concept of video is a sequence of images obtained at a certain time interval. An image in the sequence is also called a frame image. Therefore, a video is also an image sequence. The time interval division between images is also a part of sampling. Usually, equal interval division is adopted, and the number of images collected per second is called frame rate. In order to ensure that the information is not lost in the process of digitalization, that is, complete restoration can be achieved when the information is restored to the analog form, according to the sampling theorem, it needs to be sampled at least twice the frequency of the image spatial signal.
The video collected in the traditional way produces a large amount of data after the digitalization. Taking a high-definition video as an example, the amount of data per second is 1920×1080×24 bits×30 frames per second=1492992000 bits per second, which is about 1.5 Gbps. It is almost impossible for network and storage technology to transmit such amount of data through a broadcast communication network, or to provide video services for thousands of users on the internet, or to store video data generated by millions of cameras in cities for 24 hours. A large amount of redundancy needs to be removed in high-precision digital video data, which is the central goal of digital video coding, so digital video coding is also called digital video compression. From the research of Hoffman coding and differential pulse coding modulation in the late 1840s and early 1850s, video coding technology has experienced the development for nearly 60 years. In this process, three types of classical techniques including transform coding, prediction coding and entropy coding were generally formed to remove spatial redundancy, temporal redundancy and information entropy redundancy of video signals respectively.
Based on the requirement of technology accumulation and information technology development for more than 30 years, various video coding technologies began to converge in the 1980s, and gradually formed a block-based hybrid coding framework of prediction and transformation. The hybrid coding framework was standardized by the standardization organization, and began to be applied on a large scale in the industry. There are two major international organizations specializing in the formulation of video coding standards in the world, namely the MPEG (Motion Picture Experts Group) organization under the ISO/IEC and the VCEG (Video Coding Experts Group) organization of the ITU-T. The MPEG founded in 1986 is specifically responsible for the developing of related standards in the multimedia field, which is mainly used in storage, broadcast television, streaming media on the Internet or wireless network and so on. ITU, the International Telecommunication Union, mainly formulates video coding standards for the field of real-time video communications, such as video telephony, video conference and other applications. The AVS working group, set up by China in June 2002, is responsible for formulating corresponding digital audio and video coding standards for the domestic multimedia industry.
In 1992, the MPEG organization formulated the MPEG-1 standard (launched in 1988, was a superset of ITU H.261) for VCD (Video Compact Disk) application with a data rate of about 1.5 Mbps; in 1994, the MPEG-2 standard (launched in 1990) for applications such as DVD and digital video broadcasting was released, which is applicable to bit rates of 1.5-60 Mbps or even higher; in 1998, the MPEG formulated the MPEG-4 standard (launched in 1993, based on the MPEG-2 and H263) for low bit rate transmission. ITU basically kept pace with the development of the MPEG, and also formulated a series of H.26x standards. The H261 standard, which began in 1984, was a precursor to the MPEG-1 standard and was basically completed in 1989, mainly formulated for realizing videophone and video conference on ISDN. On the basis of H.261, the ITU-T formulated the H.263 coding standard (launched in 1992) in 1996, and successively introduced H.263+, H.263++, etc.
In 2001, the ITU-T and the MPEG jointly established the JVT (Joint Video Team) working group, and set up a new video coding standard. The first edition was completed in 2003. The standard was called the tenth part of the MPEG-4 standard (MPEG-4 PartAVC) in the ISO, and called the H.264 standard in the ITU. Four months later, the Microsoft-led VC-1 video coding standard was promulgated as an industry standard by the Society of Motion Picture and Television Engineers (SMPTE) of America. In 2004, a national standard with independent intellectual property rights was developed in China, and it was promulgated as a national standard of “information Technology Advanced Audio and Video Coding Part II Video” (National label GB/T 20090.2-2006, usually referred to as the AVS video coding standard for short) in February 2006, after industrialization verification such as chip implementation. These three standards are usually referred to as the second generation video coding standard, and their coding efficiency is double that of the first generation, and the compression ratio is up to about 150 times, that is, a high-definition video (under the condition that the quality meets the broadcast requirements) may be compressed to 10 Mbps or less.
In the first half of 2013, ITU-T H.265 and ISO/IEC HEVC (High Efficiency Video Coding) as the third generation video coding international standard were promulgated, and the coding efficiency was doubled that of H.264. In parallel with this, China formulated the second generation AVS standard AVS2, which is called “Information Technology Efficient Multimedia Coding”. Compared with the first generation AVS standard, the code rate of AVS2 is reduced by more than 50%, which means that the coding efficiency is doubled. For a scene-like video such as a monitoring video, the compression efficiency of AVS2 is further doubled, and up to four times that of AVC/H.264, that is, the compression efficiency has reached 600 times.
Although modern video coding technology has already achieved remarkable results and has been widely applied, and the compression efficiency has realized “doubling every ten years”, it is far from reaching an ideal level. According to the existing research report, the global data volume reached 2.84 ZB in 2012. By 2020 the figure will rise to 40 ZB, which will double about every two years, of which the monitoring video will account for 44%. In other data such as health data, transaction data, network media, video entertainment data, etc., the image and the video will also account for a large proportion. In China, more than 30 million cameras have been installed in public places, and these cameras have produced nearly 100 EB video which requires hundreds of billions of yuan in storage. Therefore, the technological progress of “doubling every ten years” in video coding efficiency has been far from satisfying the rapid growth of “doubling every two years” in video big data, and how to improve the video coding efficiency has become a major challenge in the information age.
As mentioned above, the formation of video concept originates from the invention of film, and the basis for the technique scheme of representing a video with an image sequence is visual persistence phenomenon of human vision. The film uses 24 frames per second and the television uses 25 or 30 frames per second, which can basically meet the needs of human eyes to get a continuous sense. This technical setting is also solidified as a technical formula with the wide application of film, television and personal camera equipment. However, the disadvantages of this method of representing dynamic images are also obvious. It can't record high-speed movements such as a rotating wheel, a high-speed sport table tennis or even soccer. It also fails to catch the movement details in video monitoring, and it can't support scientific research, high-precision detection and other special requirements. New high-definition and ultra-high definition televisions are also trying to increase the frame rate to 60 frames per second or even higher to better represent high-speed sports such as table tennis. However, such a video frame rate can't represent a faster changing physical phenomenon, so high frequency cameras appear. Their frame rate can reach 1000 frames per second, even 10,000 frames or higher. The problem is the large-scale growth of data volume, and the corresponding acquisition and processing circuit design are expensive or even impossible. More importantly, the increase in the frame rate means exposure time of a single frame is reduced, and the exposure of the collected single frame image is seriously insufficient. A way to compensate this is to increase pixel size, which brings about the reduction of spatial resolution. In the final analysis, all of these problems are caused by video acquisition and representation using “first space, after time” equal time interval method. This method is only a technological choice based on the persistence characteristics of human vision when the film appears, does not mean it is the best solution to represent dynamic images.
Therefore, it is an urgent problem to develop an effective video coding method that takes account of temporal information and spatial information simultaneously.