Many applications involve two-dimensional data sets. Visual processing applications are one example. The term “visual processing” as used herein refers to a general class of processing of image and video picture elements (“pixels”) and related data. This includes applications such as enhancing images and converting pixels formats, motion detection and tracking, and identifying features or objects in either still-image or video frames. Other applications involving data that can be expressed in two-dimensional sets include modeling of physical quantities such as forces or electromagnetic fields. Three-dimensional data sets can in some cases be represented in two-dimensions, such as by projection or transformation into a two-dimensional plane, or as multiple adjacent two-dimensional data sets. Prior-art solutions have taken one of two forms: 1) programmable solutions using a processing core, or 2) hardwired (or hardware-configurable) solutions using hardware logic gates. The distinction between these options is that programmable solutions are flexible and can be readily improved and adapted to various applications, though with very low performance relative to hardwired solutions, whereas hardwired solutions can readily meet performance requirements, though are very difficult to design, maintain, and improve to meet new requirements.
It is desirable to have solutions that combine the advantages of programmable and hardwired processing without the corresponding disadvantages. Consider, as an example, the case of visual processing applications. The term “visual computing” as used herein refers to a programmable approach, based on a computing architecture that is specifically targeted to visual processing, in contrast to using a general-purpose processor. Though many examples of visual-computing solutions exist, all have a fundamental limitation in capability. This limitation becomes particularly acute with state-of-the art sensor and display resolutions, and with increasing frame rates in frames per second.
To understand the nature of this limitation, it is first necessary to understand the typical requirements of visual processing. FIG. 1 illustrates some typical stages for processing a still image in a digital camera. The input from the sensor 100 contains red, green, and blue pixel values, or data elements, in a Bayer mosaic (this format contains twice as much green information as red and blue because of the sensitivity of the eye to green information). Processing performs quality enhancement and format conversion, producing YUV data 108 that can be used to create standard image formats such as JPEG (YUV are three pixels of luminance and chrominance information). Though a Bayer pattern is shown in this example, there are many non-standard, proprietary formats that typically are vendor-specific. This is one motivation for programmable solutions, so that various formats can be processed by a common solution.
Pixel processing typically produces a pixel value at a given location based on neighboring pixel values. For example, the noise reduction stage 104 is based on comparing the value of a given pixel to the values of pixels, of the same format, in its local neighborhood. If the value is above or below some threshold of a value predicted by averaging neighboring pixel values, this is considered to be due to noise, such as lens impurities or sensor defects. In this case, the pixel value is replaced by the predicted value. Other processing stages, such as white balance stage 106, typically use the same approach of considering neighboring pixel values. Black level adjustment 102 is an exception, because it simply subtracts known offsets from each pixel value to compensate for drift in pixel values from zero for purely dark input.
This approach to processing causes input and output relationships between processing stages such as that shown in FIG. 2. This example assumes that an output pixel depends on a neighboring region of pixels, such that a central pixel depends on two columns of pixels to the right and left and two rows of pixels above and below: the total input region is a block of pixels that is 5 pixels wide and 5 pixels high (5×5), with the output corresponding to the central pixel in this block. In this example, assuming that processing stages shown in FIG. 1 are numbered sequentially, a 9×9 input region 200 to processing stage N results in a 5×5 output region 202, which becomes the input to processing stage N+1, and this 5×5 region in turn generates a single pixel 204 as the output of stage N+1.
Regions of pixels that are required as input, but which do not have any corresponding output, are referred to as aprons. Apron requirements cause the size of the result region of a processing stage to be smaller than the input region, and this reduction is larger the more complex, and therefore higher-quality, the processing. Higher quality requires a larger number of processing stages and a larger number of apron pixels used at each stage to produce output.
The operation of typical prior-art visual-processing hardware is illustrated in FIG. 3, with processing stages corresponding to those in FIG. 2. The input to stage N is provided by memory buffer 300. This buffer retains sufficient context to satisfy apron requirements, with input being provided one set of pixels at a time, where a set of pixels in this example is a set of four pixels such as the four pixels 302. Each set of input pixels is sufficient to generate one set of output pixels, because input of the set of pixels that fills the buffer satisfies the apron requirements for the pixels in the center of the buffer. For example, input of pixel set 302 fills the buffer to satisfy the apron requirements for pixel set 304 centered in the 5×5 region having 5 sets of 4 pixels in each direction. Scanning hardware 306 accesses pixels in this 5×5 region, providing input to operation hardware 308 for stage N. Scanning hardware is required because each input set is at a different location within the image, and the relative locations in buffer 300 of the pixels required for stage N changes with each input set. Correspondingly, the output of stage N is processed by merging hardware 310, which writes the output of stage N for multiple input pixel sets into buffer 312 in a way that preserves results written into this buffer from operations on previously-scanned pixel sets. Buffer 312 is shown offset to clarify the relative positions of the pixels in this buffer with respect to the pixels in buffer 300. The output enabled by pixels 302 updates the pixels 314 in buffer 312, enabling input to stage N+1 for the region of pixels 316 in buffer 312.
The configuration of hardware shown in FIG. 3 cannot be replicated by a software program, because all of the hardware processing stages operate concurrently (upon pixels at different locations along the scan line), whereas software programs execute one instruction at a time and cannot control the concurrent operations. Each hardware stage is hardwired, or has limited hardwired configurations, to operate concurrently with other stages. The throughput of these stages is typically one set of pixels every processing cycle, so, for example, hardware that operates at 450 mega-Hertz can provide a throughput of 450 mega-pixels per cycle, which is required for state-of-the art sensors, which provide input of 30 mega-pixels per frame at a rate of 15 frames per second. However, because the stages are hardwired, the solution is inflexible.
Programmable solutions overcome this inflexibility. However, because program execution is serial in nature, programs generate a set of outputs only some number of sequential processing steps, or cycles, after input. The number of cycles is determined by the number of processing stages and the complexity of the operations performed at each stage, and is typically on the order of 1000-2000 cycles. Thus, a programmable solution operating at 450 mega-Hertz would provide throughput of at most 0.45 mega-pixels per second, far below the throughput of hardware.
To improve the throughput of programmable visual processing, the program must process and output a large number of pixels at each processing stage. For example, a program that requires 1000 cycles to execute can match the throughput of hardware processing if it produces 1000 sets of pixels of output for each set of input pixels. However, because of apron requirements, the input required to produce this output is much larger than 1000 sets of inputs, except in a hypothetical ideal case illustrated in FIG. 4. In this conceptual illustration, there are a number of processing datapaths 402 equal to the width in pixels of the entire image, operating on data in a memory 400 that is also of this width. Each datapath can access a horizontal region of this memory sufficient for apron access in the horizontal direction, for example two pixels to the right and left for a 5×5 region, and the data is organized into buffers whose depth, in number of lines, is sufficient for apron access in the vertical direction, for example 5 lines deep for a 5×5 region.
Input is provided to this ideal solution an entire scan-line at a time, discarding the oldest scan-line. These lines are represented by horizontal lines 404 in memory 400, labeled in this example with corresponding pixel colors in a Bayer format. Pixels of like format (color) must be in separate buffers, because the program performs the same operations on all pixels at the same time, and these must be of the same format for correct results. The program executes, taking the required number of cycles, and produces a number of pixels proportional to the number of datapaths, which must be on the order of a few thousand to match the throughput of hardware. It should be understood that this example is for illustration only. Operating this many datapaths simultaneously is physically impossible, and would be prohibitively expensive in any case.
Physical limitations restrict the number of datapaths in a typical implementation, with a resulting typical organization shown in FIG. 5. The conceptual memory and datapath bank of FIG. 4 is partitioned into datapath banks 510-516 and memory segments 500-506 corresponding to a number of datapaths that can reasonably be implemented, typically either 32 or 64. However, operating these datapath banks over the entire span of a program, from input to output, causes an unacceptable loss in the effectiveness of execution, represented by the shaded grey regions marked “x” that represent the loss of effective output resulting from the apron requirements at each processing stage. For example, if there are 10 processing stages, each requiring a 5×5 region of input to each stage, the datapath banks each lose 4 pixels of effective context, in the horizontal direction, at each stage. This causes output at the last stage to be 40 pixels narrower than the input, which, in the case of a 64 datapaths, causes the output to be only 24 pixels wide, meaning that the parallel datapaths are only 38% (24/64) effective. To avoid this, it would be required to share data in memories 500-506 between datapath banks 510-516. However, this is not possible because it cannot be guaranteed that the datapaths perform the same operations at the same time, and thus it is not possible to guarantee that data is valid when accessed by a datapath from another memory that is not local to the datapath.
The prior art addresses the aforementioned problem of reduction in output data in partitioned datapaths by performing only a single processing stage at a time, rather than all stages in series, as illustrated in FIG. 6. After a given stage N, the output of stage N is written in a system data-movement operation to a global shared memory 600. Following this, data is read in a system data-movement operation back into the memories local to the datapaths (500-506 in FIG. 5). This read-back operation can be used to minimize data loss for the interior data banks by overlapping the portions of a scan line that are read into the data banks. Repeating the data elements near the end of one data bank at the beginning of the adjacent bank provides the apron needed for each data bank and eliminates data loss at the boundary. The loss of data per processing stage is only that caused by the outer ends of the scan line, for example 4 pixels. This loss is a much smaller proportion of the datapath width than if all processing stages are executed, so the effectiveness of parallel execution is 60/64=94%. However, additional cycles are required to copy the outputs of stage N and the inputs of stage N+1, and these cycles add to the effective execution time and also limit the throughput.
In both of the examples just presented, there is a fundamental throughput limitation for a programmable visual-computing solution. This is caused either by the ineffectiveness of parallel operations caused by inability to share data to meet apron requirements, or by additional cycles required to re-form the shared context in a global memory.
The above discussion illustrates the difficulty of efficient software processing when an operation to be performed on one element of a data set requires access to one or more other elements of the data set—i.e., when the elements in the data set are interrelated, at least with respect to a given operation. Hardware implementations of the operation can handle this situation efficiently by repeatedly acquiring the required input elements through a scanning process. It would be desirable to have an efficient approach to software implementation of operations on interrelated elements in two-dimensional data sets.