The human visual system is capable of comprehending an incredible amount, and a remarkable variety, of visual information presented to it. However, most machine imaging and display systems capture and present us with only a limited two dimensional (2D) window into the real three dimensional (3D) world. In certain specialized applications in defense, medicine, entertainment and others, several attempts have been made to introduce more sophisticated display systems that can portray visual information as it appears from various viewpoints. However, such applications essentially use analog signals, and are mainly standalone systems, rarely needing to interoperate or exchange data with others. Moreover, in many such applications, either graphics or more sophisticated computer generated animations are used, and multi-viewpoint video is not as prevalent. Recently there has been an emergence of more demanding applications in the area of virtual space conferencing, video games and multimedia presentations where not only more sophisticated displays are necessary but networked communication of compressed digital video data may also be necessary.
Multi-view video has potential applications in education, training, 3D movies/entertainment, medical surgery, video conferencing, virtual travel and shopping, multimedia presentations, video games, immersive virtual reality experiences, and others. Although the potential applications of multi-view video are many, there are several challenges to be overcome before its potential can be truly harnessed and it can become wide spread. For example, currently, most practical means of displaying stereoscopic video require viewers to use specialized viewing glasses. Although some displays not requiring specialized viewing glasses (autostereoscopic systems) are available, they impose other restrictions, e.g., viewing zones and discreetness of views which may typically require between 10 and 20 views for realism. Stereoscopic displays, on the other hand, require the use of specialized viewing glasses, but can impart the perception of depth in a scene.
There are other types of 3D displays that can also present a more realistic portrayal of the scene. However they are sometimes cumbersome and impose other restrictions. For example, it is well known that increasing resolution can impart an increased perception of realism. Thus HDTV systems can be used to enhance depth perception. The development of HDTV has necessitated advances in compact high resolution displays, high speed circuits for compression and decompression etc. A different technique to impart a high degree of realism is related to employing multiple viewpoints of a scene and immersing the viewer in the scene by virtue of a multiview display arrangement. These multiple viewpoints need to be imaged via suitable arrangements of cameras, encoded, decoded and displayed on a suitable arrangement of displays. Sometimes not all the displayed views need to be transmitted, stored or manipulated and only some representative views are used. Regardless, the complexity of multiview processing can be substantially high and represents a stumbling block in widespread use of such systems.
With the need for networked visual communication and the need to interchange data across a variety of applications, the need for video coding standards has arisen. Of particular relevance is the second phase of ISO Moving Pictures Experts Group (MPEG-2) video coding standard, which though only recently completed, is well recognized to offer a good solution to a large variety of applications requiring digital video including broadcast TV via satellite, cable TV, HDTV, digital VCRs, multipoint video and others. It is desirable to have compatible solutions that can be used for coding of multiview video with low complexity, while providing interoperability with other applications. Towards that end, the video codings algorithm tools of MPEG-2 video standard are of significant interest due to a large expected installed base of MPEG-2 video decoders with whom interoperability of multi-view applications may be necessary.
While efficient digital compression of multiview video is important, the capture of different views and the display of these views is closely related to it, since there should be a significant correlation between views for high coding efficiency. While it is possible to encode each of the views of multi-view video separately (simulcast), it is envisaged that combined coding of these views would be more efficient. It is possible to achieve joint coding of multiview video by two basic approaches. The first approach results in compatibility with normal video in the sense that one (or more) views may be decoded for normal video display, while all other views could be decoded for a truly multi-view display. The second approach involves joint coding without regard to compatibility and may presumably lead to higher compression but sacrifices interoperability. We adopt the first approach, and, although our invention can be used with other coding approaches as well, we use MPEG-2 based video codings to illustrate our technique.
Both the single layer (nonscalable) video coding as well as the layered (scalable) video coding framework of MPEG-2 video coding are exploited and extended by the present invention. Nonscalable video coding in MPEG-2 involves motion-compensated DCT coding of frame or field pictures and is dealt with in detail in Test Model Editing Committee, "MPEG-2 Video Test Model 5," ISO/IEC/JTC1/SC29WG11 Doc. N0400, April 1993; A. Puri, "Video Coding Using the MPEG-2 Compression Standard," Proceedings of SPIE Visual Communications and Image Processing, Boston, Mass., November 1993, pp. 1701-1713; Video Draft Editing Committee, "Generic Coding of Moving Pictures and Associated Audio," Recommendation H.262, ISO/IEC 13818-2, International Standard for Video, Singapore, No. 1994; and elsewhere. Among the scalable video coding schemes, our scheme is highly related to temporal scalability, which we have also used earlier as a basis for compression of stereoscopic video; see A. Puri, R. V. Kollarits and B. G. Haskell, "Stereoscopic Video Compression Using Temporal Scalability," Proceedings of SPIE Visual Communications and Image Processing, Taipei, Taiwan, May 1995.
Temporal scalability involves coding of video as two layers in time, such that the first layer, called the base-layer, can be decoded independently of the second layer, called the enhancement layer. A premise behind temporal scalability is to provide a base video stream or layer that has a particular frame rate which is the baseline or minimum that must be supported to properly display the video. The base layer can be coded with any coder, such as a block-based motion compensated DCT coder of MPEG-1 or nonscalable MPEG-2. To improve the video quality, an "enhancement" video stream is provided to carry intermediate frames so that more sophisticated viewers can display the video at a higher frame rate, e.g., the displayed frame rate can be temporally scaled to the capabilities of the display system. The enhancement layer also uses the motion compensated DCT structure but makes temporal predictions from images in the base layer. Since, there are no explicit restrictions on which coders to employ in the base and enhancement layers, other than the use of temporal prediction between layers, the underlying framework of temporal scalability is exploited and extended in our invention.
Currently, MPEG-2 has set up an ad hoc group to investigate how MPEG-2 can be applied to the coding of multiview video, building on the existing framework of temporal scalability. Requirement Subgroup, "Status Report on the Study of Multi-view Profile", ISO/IEC JTC1/SC29/WG11 Doc. N0906, March 1995. However, it is also deemed necessary to come up with a practical realization of encoding/decoding that is acceptable in terms of complexity and does not require brand new types of coding/decoding hardware. It is expected that MPEG will specify interoperability between its nonscalable (Main Profile) video coding and the multi-view coding approach that it adopts. This interoperability is usually specified in terms of compliance points which various MPEG compatible systems must meet. The compliance points correspond to a discrete grid of allowed Profile and Level combinations. Profiles limit the syntax and semantics (i.e., algorithms) that MPEG compliant systems must be capable of processing. Levels limit coding parameters such as frame sample rates, frame dimensions, and coded bit rates, thus restricting the values certain parameters are allowed to take in the bitstream. For example, the MPEG-2 video "Main" Profile and "Main" Level were chosen to normalize complexity within feasible limits of 1994 VLSI technology (0.5 micron), yet still meet the needs of the majority of applications. Thus, the profile and levels implicitly specify the expected decoding complexity, and for instance in an anticipated Multi-view profile, a specification of a number of views to be decoded is necessary. There is considerable debate on how many views to support in a Multi-view profile due to significantly higher decoding complexity if more views are supported. We propose to circumvent the problem by only allowing two views--each called a super-view. However, each of the two super-views can include multiple low resolution views. Thus a decoder compliant to, for example, Main profile at Main level only decodes with a complexity typical of two layer temporal scalability, however, if higher resolution for these views is necessary, an encoder can choose to code at Main profile at High level, to which Main profile and Main level compliant decoders do not need to be interoperable.
The proposed technique of this invention provides a framework and a mapping that is necessary to address the problem of encoding/decoding of multiple views. Multi-view video comprising two or more image signals is mapped into a pair of "super-views and coded similar to our proposals for coding of stereoscopic video and high compression (B. G. Haskell, R. V. Kollarits and A. Puri, "Digital Stereoscopic Video Compression Technique Utilizing Two Disparity Estimates," U.S. patent application Ser. No. 08/452,464 filed May 26, 1995, now U.S. Pat. No. 5,612,735 and B. G. Haskell, R. V. Kollarits and A. Puri, "Digital Stereoscopic Video Compression Technique Utilizing One Disparity and One Motion Estimates," U.S. patent application Ser. No. 08/452,463 filed May 26, 1995, now U.S. Pat. No. 5,619,256 both of which are incorporated by reference). Considerable attention is given to camera arrangements as it impacts the exploitation of redundancy between different views. The encoder determines the mapping for spatial multiplexing of a number of views into two super-views so that the coding technique can maximally exploit correlations. Coding can be done using MPEG-2 or other coding schemes. As an additional issue, the next phase of MPEG (MPEG-4) also intends to support coding of multi-view video and thus the principals of this inventions are also applicable there, as far as a spatial multiplexing arrangement of many views is employed in coding.