There are many different types of data processing systems which include functionality for determining the median value from a group of data values. Examples of such data processing systems include image processing systems, audio processing systems and signal processing systems to give just a few examples. For example, an image processing system may be used in a camera pipeline to process pixel values originating from image sensors in a camera to provide a set of processed pixel values representing a captured image. A median determination may be performed for many different purposes, e.g. to implement a median filter for attenuating impulsive noise (i.e. denoising), for defective pixel detection, or defective pixel correction to give just some examples.
The number (n) of data values from which a median value is to be found may be different in different implementations. If an input set of data values from which the median is to be found is small, e.g. where n=3 or n=5, then finding the median value is trivial. In a general solution, the n data values are sorted into order and then the middle value is chosen as the median value.
Algorithms for sorting a set of data values tend to use recursion, which is suitable for being implemented in software. As the number of data values in the set increases, the complexity involved in sorting the data values in software typically scales by a factor of (n log n). However, algorithms using recursion are not well suited for being implemented in hardware. Modules of some data processing systems are implemented in hardware (e.g. fixed function circuitry) rather than being implemented in software running on general purpose hardware, because hardware implementations can provide a more optimised system (e.g. in terms of processing latency, power consumption and the physical size of the processing system e.g. when implemented in silicon) for performing a specific function. Therefore, if a data processing system is intended to operate quickly (e.g. for processing and outputting data in real-time), and/or in a device with limited battery life (e.g. a mobile device such as a smartphone, tablet, camera, laptop, etc.), and/or on a System On Chip (SOC) which has constraints on its physical size (e.g. for use in mobile devices) then a hardware implementation of the data processing system is often implemented. One example of a data processing system which tends to be implemented in hardware is an image processing system used in a camera pipeline for processing pixel values received from an image sensor for providing processed pixel values to be captured and/or displayed to a user in real-time, e.g. on the screen of a tablet, smartphone or handheld camera etc.
One method for implementing a median determining unit in hardware is to use a Bubble sort algorithm. According to the Bubble sort algorithm (which may be referred to as a “sinking sort”) comparisons between two of the data values are repeatedly performed to compare each pair of adjacent data values in turn and swap them if they are in the wrong order. The pass through the data values is repeated until no swaps are needed, which indicates that the data values are sorted into the correct order.
FIG. 1 shows an example in which five data values are sorted using the Bubble sort algorithm. The horizontal axis in FIG. 1 represents units of time, e.g. clock cycles or processing cycles. For example, a unit of time may be the time taken to perform a compare and swap operation. The time to do a compare increases with the number of bits in the operands being compared, e.g. it takes longer to compare two 12-bit values than to compare two 5-bit values. The horizontal lines A to E represent positions between which data values may move as the sort process progresses. The way in which data values are represented at particular positions and times depends on how the sorter is implemented. For example, data values may be stored in registers, or may exist as signals in logic units. At the beginning of the sort process the values in positions A to E are unsorted, and at the end of the sort process the values will be sorted, e.g. with the largest value in position A and the smallest value in position E. In FIG. 1, thick vertical lines between two positions, e.g. as represented with reference numeral 102, indicate compare and swap operations, which may be implemented by a piece of dedicated hardware connected between the two positions at which data values are to be compared and optionally swapped. At time instance 1, the data values in positions A and B are compared, and if the data value in position B is greater than the data value in position A then the data values are swapped in positions A and B, otherwise the data values in positions A and B are not swapped in time instance 1. At time instance 2 the data values in positions B and C are compared and if the data value in position C is greater than the data value in position B then the data values are swapped in positions B and C, otherwise the data values in positions B and C are not swapped in time instance 2. At time instance 3 the data values in positions C and D are compared and if the data value in position D is greater than the data value in position C then the data values are swapped in positions C and D, otherwise the data values in positions C and D are not swapped in time instance 3. At time instance 4 the data values in positions D and E are compared and if the data value in position E is greater than the data value in position D then the data values are swapped in positions D and E, otherwise the data values in positions D and E are not swapped in time instance 4. Therefore, following time instance 4, the smallest data value will be in position E. The compare and swap operations are repeated as illustrated in FIG. 1, such that following time instance 7, the second smallest data value will be in position D; following time instance 9, the third smallest data value will be in position C; and following time instance 10, the largest data value will be in position A and the second largest data value will be in position B. Therefore, after time instance 10, the data values are sorted into the correct order in position A to E. After the sorting process, the median value is the data value stored in the middle position, i.e. in position C. In the simple example shown in FIG. 1, with five inputs (i.e. n=5), there are ten comparisons and the sort takes ten units of time to complete. In general, if the approach shown in FIG. 1 is used for n inputs, the number of comparisons that are performed is given by ½n(n−1), and the number of units of time that the sort takes is also given by ½n(n−1).
The example shown in FIG. 1 is conceptually simple to understand, but the efficiency of the sorting algorithm can be improved, in terms of the time taken to perform the search. FIG. 2 shows an example in which multiple comparisons can be performed at the same time instance on different pairings of positions. For example, at time instance 3, the data values in positions A and B are compared and optionally swapped, at the same time that the data values in positions C and D are compared and optionally swapped. In the example shown in FIG. 1, the data values in positions A and B do not change at time instances 3 and 4, so the comparison that is performed at time instance 5 in FIG. 1, can be implemented at time instance 3 in the example shown in FIG. 2, without effecting the outcome of the sorting process. The same reasoning applies to explain how the other compare operations shown in FIG. 1 can be compressed into the seven time units as shown in FIG. 2. In the example shown in FIG. 2, with five inputs (i.e. n=5), there are ten comparisons and the sort takes seven units of time to complete. In general, if the approach shown in FIG. 2 is used for n inputs, the number of comparisons that are performed is given by ½n(n−1), and the number of units of time that the sort takes is given by 2n−3.
It can be shown that the compare operations can be compressed even further, as shown in FIG. 3, such that with five inputs (i.e. n=5), the sort takes five units of time to complete. There are still ten comparisons in the example shown in FIG. 3. In general, if the approach shown in FIG. 3 is used for n inputs, the number of comparisons that are performed is given by ½n(n−1), and the sort takes n units of time to complete.
FIG. 4 shows how the same approach as that shown in FIG. 3 can be applied to the case of 7 inputs. In this example, 21 comparisons are performed and the sort takes 7 units of time to complete. The example shown in FIG. 4 is a simplification, and in a real system one or more retiming stages may be required between some of the time instances at which comparisons are performed, so that the signals can be safely swapped between positions before the values in those positions are subsequently used in further comparisons. In other words, there may be propagation delays when data values are swapped between positions, so latency may be added to the sort process to account for the propagation delays. For example, the number of sequential transistors on the worst case path through the logic (i.e. the “logic depth”) determines the minimum amount of time that can be safely allowed to the circuit for it to operate correctly. An n-bit compare takes o(n) transistor times, but this can be improved at the cost of faster but less area-efficient logic. In current technology, the logic depth should not exceed approximately 30, else it becomes very difficult to achieve layout. This maps into approximately three compare and swap operations before it becomes necessary to add registers, and stall the result by a clock cycle. Therefore, the sort shown in FIG. 4 may take longer than seven units of time to complete in a real system. It is noted that extra additional registers cost both area and power.
FIG. 5 shows how the same approach as that shown in FIG. 3 can be applied to the case of 9 inputs. In this example, 36 comparisons are performed and the sort takes 9 units of time to complete, plus some time to allow for the propagation delays, as described above.
Each comparison and swap that is performed consumes power. Furthermore, when the algorithm is implemented in fixed function hardware, each comparison that is performed is implemented with a block of hardware implementing the comparison and optional swap functionality. The routing of the correct signals to the different comparison blocks can become complicated when the number of comparisons increases. Therefore, for a multitude of reasons (e.g. to reduce the size of the hardware and to reduce the power consumption of the hardware), it can be beneficial to reduce the number of comparisons that are performed. If the hardware is used to determine a median value, but not used to perform a full sort of all of the input values, then some of the comparisons might not need to be implemented in some of the examples described above. For example, in FIG. 3, one of the comparisons is shown with a dashed line, rather than a solid line, to indicate that this comparison does not need to be implemented in order to determine the median value. In FIG. 4, three of the comparisons are shown with dashed lines, rather than solid lines, to indicate that these comparisons do not need to be implemented in order to determine the median value. It should be apparent that these comparisons will not affect the data value which is found to be the median value. Therefore, these three comparisons might not be implemented in order to reduce the number of comparisons. Similarly, in FIG. 5, six of the comparisons are shown with dashed lines, rather than solid lines, to indicate that these comparisons do not need to be implemented. It should be apparent that these comparisons will not affect the data value which is found to be the median value. Therefore, these six comparisons might not be implemented in order to reduce the number of comparisons.
The Bubble sort algorithm is simple, but it is slow to perform and involves the implementation of a large number of comparisons, particularly when the number of inputs increases, e.g. above n=9. The same issues apply with other known sorting techniques such as an insertion sort. In both a bubble sort technique and an insertion sort technique the number of compare&swap units scales on the order of n2, and the time taken to perform the sort scales on the order of n, plus extra retiming stages which are required approximately
  n  3times.
Typically in a data processing system, such as an image processing system for use in a camera pipeline, the number of inputs to a median determining unit can be greater than nine. For example, a typical operation in a camera pipeline (e.g. denoising or defective pixel detection/correction) may be performed for each particular pixel within an image being processed, and may involve finding the median of the pixel values within a block of pixel values including (e.g. centred on) the particular pixel. The block of pixel values may for example be a 3×3 block, a 5×5 block, a 7×7 block, a 3×5 block, a 5×7 block, a 7×9 block or a 9×9 block to give just some examples. For some functions, a 3×3 block of pixel values is simply too small to provide acceptable image processing results. Obviously, a 5×5 block of pixel values includes 25 pixel values and a 7×7 block of pixel values includes 49 pixel values. Algorithms such as the bubble sort and the insertion sort are not suitable to be implemented in hardware for use in finding the median of such a large number of inputs. For example, with 25 inputs, (i.e. n=25) a bubble sort algorithm would include 300 comparisons and the sort would take 25 units of time to complete, plus additional time and logic to allow for the propagation delays, as described above. With 49 inputs, (i.e. n=49) a bubble sort algorithm would include 1176 comparisons and the sort would take 49 units of time to complete, plus additional time and logic to allow for the propagation delays, as described above.
Therefore, the bubble sort algorithm and the insertion sort algorithm are not suitable for use in an image processing system (e.g. for use in a camera pipeline) which must process data values for output in real-time and for which the size and power consumption of the hardware are important considerations. With the current state of the art, it is difficult to implement a median determining unit in hardware that can provide results for use in real-time processing with acceptable power consumption and silicon area for a set of more than eleven data values.