Graphic applications based on three-dimensional scenes have become prevalent in many applications such as specifically computer graphic applications. In order to support fast three-dimensional graphics processing, a number of standards and specifications have been developed. This not only provides faster design and implementation as it may provide standardized functions and routines for many standard operations, such as view point shifting, but also allows for dedicated hardware graphic engines to be developed and optimized for these routines. Indeed, for many computers, the Graphic Processing Unit (GPU) may nowadays often be at least as powerful and important as the Central Processing Unit (CPU).
An example of a standard for supporting fast graphics processing is the OpenGL specification which provides an Applicant Process Interface (API) with a number of functions supporting graphics processing. The specification is typically used to provide hardware accelerated graphics processing with the specific routines being implemented by dedicated accelerated hardware in the form of a GPU.
In most such graphic specifications, the representation of the scene is by a combination of a texture map and a three-dimensional mesh. Indeed, a particularly effective approach in many scenarios is to represent image objects, or indeed the scene as a whole, by a polygon mesh where a set of polygons are connected by their common edges or corners (vertices), which are given by three-dimensional positions. The combined three-dimensional polygon mesh accordingly provides an effective model of three-dimensional objects, including possibly a three-dimensional description of an entire image. The polygon mesh is often a triangle mesh formed by triangles having common corners given in 3D space.
As an example, a stereo camera may record an image of a scene from a given view point. For each pixel, a disparity estimation may be performed to estimate the distance to the object represented by the pixel. This may be performed for each pixel thereby providing a three-dimensional position of x,y,z for each pixel. These positions may then be used as vertices for a triangle mesh with two triangles being formed for each group of 2×2 pixels. As this may result in a large number of triangles, the process may include combining some initial triangles into larger triangles (or in some scenarios more generally into larger polygons). This will reduce the number of triangles but also decrease the spatial resolution of the mesh. Accordingly, it is typically dependent on the depth variations and predominantly done in flatter areas.
Each vertex is further associated with a light intensity value of the texture map. The texture map essentially provides the light/color intensity in the scene for the object at the pixel position for the vertex. Typically, a light intensity image/texture map is provided together with the mesh with each vertex containing data representing the x, y, z position of the vertex and u,v data identifying a linked position in the texture map, i.e. it points to the light intensity at the x, y, z position as captured in the texture map.
In such representations, the polygon mesh is used to provide information of the three-dimensional geometry of the objects whereas the texture is typically provided as a separate data structure. Specifically, the texture is often provided as a separate two-dimensional map which by the processing algorithm can be overlaid on the three-dimensional geometry.
The use of triangle meshes is particularly suitable for processing and manipulation by computer graphics algorithms, and many efficient software and hardware solutions have been developed and are available in the market. A substantial computational efficiency is in many of the systems achieved by the algorithm processing the individual vertices commonly for a plurality of polygons rather than processing each polygon separately. For example, for a typical triangle mesh, the individual vertex is often common to several (often 3-8) triangles. The processing of a single vertex may accordingly be applicable to a relatively high number of triangles thereby substantially reducing the number of points in an image or other object that is being processed.
As a specific example, many current Systems on Chip (SoCs) contain a GPU which is highly optimized for processing of 3D graphics. For instance, the processing of 3D object geometry and 3D object texture is done using two largely separate paths in the so called OpenGL rendering pipeline (or in many other APIs such as DirectX). The hardware of GPUs on SoCs can deal efficiently with 3D graphics as long as the 3D source is presented to the GPU in the form of vertices (typically of triangles) and textures. The OpenGL application interface then allows setting and control of a virtual perspective camera that determines how 3D objects appear as projected on the 2D screen. Although OpenGL uses 3D objects as input, the output is typically a 2D image suitable for a normal 2D display.
However, such approaches require the three-dimensional information to be provided by a polygon mesh and associated texture information. Whereas this may be relatively easy to provide in some applications, such as e.g. games based on fully computer generated virtual scenes and environments, it may be less easy in other embodiments. In particular, in applications that are based on capturing real scenes, it requires that these are converted into a texture and mesh representation. This may, as previously mentioned, be based on stereo images or on an image and depth representation of the scene. However, although a number of approaches for performing such a conversion are known, it is not trivial and poses a number of complex problems and challenges.
A common operation in graphics processing is view point changes where an image is generated for a different view point than that of the input texture map and mesh. Graphic APIs typically have functions for very efficiently performing such view point transformations. However, as the input mesh typically is not perfect, such view point transformations may result in quality degradation of the shift is too significant. Further, a representation of a scene from a view point will typically include a number of occluded elements where a foreground object occludes elements behind it. These elements may be visible from the new direction, i.e. the view point change may result in de-occlusion. However, the input texture map and mesh will in such a case not comprise any information for these de-occluded parts. Accordingly, they cannot be optimally represented as the required information is not available.
For these reasons, view point transformation is often based on a plurality of texture maps and meshes corresponding to different view directions. Indeed, in order to synthesize a new (unseen) viewpoint, it is typically preferred or even necessary to combine multiple captured meshes with associated camera images (textures) from the different view-points. The main reason for combining data from different view-points is to recover objects that are hidden (occluded) in one view but visible in another view. This problem is often referred to as view-point interpolation.
However, conventional approaches for this still tend to be suboptimal.
For example, one approach for generating a new view-point is to transform the meshes originating from the different view-points to a single world coordinate system and then perform a perspective projection onto a new camera plane. These steps can be done in standard graphics hardware. However, this will typically not correctly show hidden surfaces. Specifically, graphics hardware uses depth testing to select the front-most point when points are combined at a single pixel. This approach is used to address self-occlusion where the view point shifting may result in image objects moving relative to each other such that new occlusions occur, i.e. at the new view point there may be an occlusion for two points that are not occluded from the original view point. However, when applied to different images this may result in errors or degradations. Indeed, the depth is typically linearly interpolated such that it extends beyond foreground objects (like a halo effect), the front-most point will often correspond to areas that may be occluded due being next to a foreground object.
An example of a technique for view-interpolation based on depth images is provided in C. L. Zitnick et. al. “High-quality video view interpolation using a layered representation”. SIGGRAPH '04 ACM SIGGRAPH 2004, pp. 600-608. To achieve high quality, the technique uses a two-layer representation consisting of a main layer and a boundary layer (around depth transitions). These are constructed using alpha matting (accounting for transparency) and both are warped (and mixed with other views) during the render process. A drawback of this approach is the need to disconnect the mesh to generate the two-layer representation. This process needs to select a threshold for the depth map and erase triangles of the corresponding mesh at depth discontinues. This is not desirable since using thresholds can potentially decrease temporal stability in the rendering.
Hence, an improved approach for generating images for a different view point would be advantageous and in particular an approach that allows increased flexibility, increased accuracy, reduced complexity, improved computational efficiency, improved compatibility with existing graphic processing approaches, improved image quality, improved de-occlusion performance, and/or improved performance would be advantageous.