Depth estimation is an algorithmic step in a variety of applications such as autonomous navigation, robot and driving systems [1], 3D geographic information systems [2], object detection and tracking [3], medical imaging [4], computer games and advanced graphic applications [5], 3D holography [6], 3D television [7], multiview coding for stereoscopic video compression [8], and disparity-based rendering [9]. These applications require high accuracy and speed performances for depth estimation.
Depth estimation can be performed by exploiting three main techniques: time-of-flight (TOF) camera, LIDAR sensor and stereo camera. A TOF camera easily measures the distance between the object and camera using a sensor, circumventing the need of intricate digital image processing hardware [10]. However, it does not provide efficient results when the distance between the object and camera is high. Moreover, the resolution of TOF cameras is usually very low (200×200) [10] when it is compared to the Full HD display standard (1920×1080). Furthermore, their commercial price is much higher than the CMOS and CCD cameras. LIDAR sensors compute the depth by using laser scanning mechanisms but they are also very expensive compared to CMOS and CCD cameras. Due to laser scanning hardware, LIDAR sensors are heavy and bulky devices. Therefore, they can be used mainly for static images. Consequently, in order to compute depth map, the majority of research focus on extracting the disparity information using two or more synchronized images taken from different viewpoints, using CMOS or CCD cameras [11].
Many Disparity Estimation (DE) algorithms have been developed with the goal to provide high-quality disparity results. These are ranked with respect to their performance in the evaluation of Middlebury benchmarks [11]. Although top-performer algorithms provide impressive visual and quantitative results [12]-[14], their implementations in real-time High Resolution (HR) stereo video are challenging due to their complex multi-step refinement processes or their global processing requirements that demand huge memory size and bandwidth. For example, the AD-Census algorithm [12], currently the top published performer, provides successful results that are very close to the ground truths. However, this algorithm consists of multi disparity enhancement sub-algorithms, and implementing them into a mid-range FPGA is very challenging both in terms of hardware resource and memory limitations.
Various hardware architectures that are presented in literature provide real-time DE [15]-[21]. Some implemented hardware architectures only target CIF or VGA video [15]-[18]. The hardware proposed in [15] only claims real-time for CIF video. It uses the Census transform [22] and currently provides the highest quality disparity results compared to real-time hardware implementations in ASICs and FPGAs. The hardware presented in [15] uses low complexity Mini-Census method to determine the matching cost, and aggregates the Hamming costs following the method in [12]. Due to high complexity cost aggregation, the hardware proposed in [15] requires high memory bandwidth and intense hardware resource utilization, even for Low Resolution (LR) video. Therefore, it is able to reach less than 3 frames per second (fps) when its performance is scaled to 1024×768 video resolution and 128 pixel disparity range.
Real-time DE for HR images offers some crucial advantages compared to low resolution DE. First, processing HR stereo images increases the disparity map resolution which improves the quality of the object definition. Second, DE for HR stereo images is able to define the disparity with sub-pixel efficiency compared to the DE for LR image. Therefore, the DE for HR provides more precise depth measurement than the DE for LR. Third, disparity values between 0-2 can be considered as background for LR images. In HR such disparities are defined within a larger disparity range; thus, the depth of far objects can be established more precisely.
Despite the advantages of HR disparity estimation, the use of HR stereo images brings some challenges. Disparity estimation needs to be assigned pixel by pixel for high-quality disparity estimation. Pixel-wise operations cause a sharp increase in computational complexity when the DE targets HR stereo video. Moreover, DE for HR stereo images requires stereo matching checks with larger number of candidate pixels than the disparity estimation for LR images. The large amount of candidates increases the challenge to reach real-time performance for HR images. Furthermore, high-quality disparity estimation may require multiple reads of input images or intermediate results, which poses severe demands on off-chip and on-chip memory size and bandwidth especially for HR images.
The systems proposed in [19]-[21] claim to reach real-time for HR video. Still, their quality results in terms of the HR benchmarks given in [11] are not provided. [19] claims to reach 550 fps for 80 pixel disparity range at a 800×600 video resolution, but it requires extremely large hardware resources. A simple edge-directed method presented in [20] reaches 50 fps at a 1280×1024 video resolution and 120 pixel disparity range, but does not provide satisfactory DE results due to a low-complexity architecture. In [21], a hierarchical structure with respect to image resolution is presented to reach 30 fps at a 1920×1080 video resolution and 256 pixel disparity range, but it does not provide high-quality DE for HR.
In order to reduce the computational complexity of DE, Patent Publication [27] utilizes Census transform by sampling pixels in a searched window and succeeds parallelism using multiple FPGAs. However, it does not present dynamically adaptive window size selection algorithm and hardware, and it does not benefit from the adaptive and hybrid cost computation. In order to adapt the disparity estimation process to the local texture on the image, Patent Publication [28] utilizes adaptive size cost aggregation window method. Patent Publication [28] does not utilize dynamic window size for stereo matching during the cost computation, but it utilizes adaptive window size while aggregating cost values. Cost aggregation method requires large computation load and local memory. Therefore, this technique is not used in the algorithm and implementation that are presented in this patent, instead matching window size is adaptively changed.