1. Field of the Invention
The invention relates to methods and apparatus for multimedia applications, in particular multimedia applications so tuned for enhancing execution thereof on a graphics processing unit (GPU) or on any other “control-limited” architecture (or even an application-specific integrated circuit (ASIC)).
2. Description of the Related Technology
Over the past years, many research efforts have been made to tackle interactive synthesizing any desired novel view between relatively sparse camera viewpoints with various image-based rendering (IBR) techniques [1]. Among them, rendering a scene with associated depth maps is often favored for a practical system, because this image synthesis paradigm can lead to high-quality intermediate images [2, 3], and real-time 3D scene acquisition and view synthesis [4]. However, a real-time IBR system that can synthesize any intermediate view at visually plausible quality is not reported. For instance, the advanced disparity estimation methods [2, 3] are completely performed offline, while the simple real-time correlation approach [4] severely compromises the synthesized image quality.
Stereo matching, as an important early vision topic, has attracted intensive research interests for decades. A substantial amount of work has been done on stereo correspondence, which is systematically surveyed and evaluated by Scharstein and Szeliski [5]. In general, casting a stereo problem as a global optimization problem usually leads to high quality disparity estimation results, but most of these global techniques are too computationally expensive for online processing. Real-time stereo applications today still largely rely on local stereo methods together with a winner-takes-all (WTA) strategy.
Typically, local stereo approaches choose to aggregate the matching cost over a given support window to increase the robustness to image noise and insufficient (or repetitive) texture. The well-known challenge for area-based local stereo methods is that a local support window should be large enough to include enough intensity variation for reliable matching, while it should be small enough to avoid disparity variation inside the window. Therefore, to obtain accurate disparity results at depth discontinuities as well as on homogeneous regions, an appropriate support window for each pixel should be decided adaptively. Among the previous local stereo methods, Fusiello et al. [6] performed the correlation with nine square windows anchored at different points and retained the disparity with the smallest matching cost. However, this method and its generalized technique, i.e., shiftable windows [5], fail to produce good disparity estimation results for different image regions, because of their fixed-sized windows in constant shapes. Veksler [7] instead found a useful range of window sizes and shapes to explore while evaluating the window cost, which works well for comparing windows of different sizes. However, this variable window approach cannot achieve real-time speed yet, due to the large number of candidate support windows and also the costly matching cost update for each pixel. Recently, Yoon and Kweon [8] proposed a state-of-the-art local window method yet at a very demanding computational cost, where pixel-wise adaptive support-weights are defined using Laplacian kernels, and they modeled the unequal importance of each support pixel.
To meet the requirement of real-time stereo estimation, solely resorting to local stereo methods is not a guarantee. In fact, until recently software-only real-time stereo systems begin to emerge, which exploit assembly level instruction optimization using Intel's MMX extension, but few CPU cycles are left to perform other tasks including high-level interpretation of the stereo results. Harnessing some powerful built-in functionalities of the modern graphics processing unit (GPU), Yang et al. first proposed a pyramid-shaped correlation kernel [9] and small-scale adaptive support windows [10]. GPU's are “streaming architectures” through which data should flow regularly with as little control as possible, and that it is recommended to use some “simple” operations (e.g. box filtering) to fully benefit from the GPUs performance. Having “variable” parameters that constantly change is not recommended in such architectures, because it causes a severe performance penalty. Though very impressive disparity estimation throughput was obtained on GPUs, these techniques cannot strike an optimal quality balance between homogeneous and heterogeneous regions. Later, Gong and Yang [11] proposed an image-gradient-guided correlation method with improved accuracy, while still maintaining real-time speed on GPUs. Wang et al. [12] recently introduced a computationally expensive cost aggregation scheme in a dynamic-programming stereo framework, and obtained accurate disparity estimation results. But their real-time speed is only achieved on relatively low resolution stereo images with very limited disparity levels, while computational resources of both the CPU and the GPU are consumed.
“Algorithmic” prior art describes concepts like “Variable Window Size” and “Variable Window Shape” for best marrying the “edge preserving” and “uniform region” processing in stereo matching. These approaches exploit very complex schemes of multi-scale and bilateral filtering, which has the consequence that this leads to very GPU-unfriendly processing (e.g. dynamic programming that is so control-intensive that GPU's (or on any other “control-limited” architecture (or even ASIC)) fail to achieve good performance).