1. Field of the Invention
This invention relates generally to a Pipelined Image Processing Engine and more particularly to a system and method for processing a plurality of image frames through an effect pipeline on a multi-core processing system.
2. Description of the Related Art
As manufacturing technology continues to improve, the physical limits of semiconductor-based microelectronics expand to meet the demand for more capable microprocessors. Limitations in processing speed have led to adaptations that leverage existing processor speeds and capabilities through parallelism by employing multi-core or distributed architectures to distribute workloads and thereby increase processing capacity and reduce processing time.
Applications in the film/video/imaging space often chain together multiple image processing effects in a pipelined fashion. These image processing pipelines can benefit from parallelism offered through multi-core or distributed architectures. Applications performing such image processing share common characteristics, including the need to both process and move vast amounts of data at real-time rates.
Conventional implementations of such systems suffer from a combination of problems, ranging from poor throughput, high latency, and increased complexity, which reduces extensibility and scalability. These limitations are compounded when chaining together multiple, discrete effects which depend on consecutive execution, as in a pipelined system.
When processing video, an ‘effect pipeline’ refers to the application of visual effects in a defined order to an image. Similarly, an ‘effect’ refers to each stage of the effect pipeline.
Prior approaches to distributing hardware resources among the various effects in effect pipelines generally suffer from a number of limitations with respect to parallelism.
One approach defines each image frame of data as the minimum quantum of work, allowing each effect to operate on a given frame independently or other frames. While this approach allows multiple effects to coexist in a shared system without enforcing tight integration amongst them, this approach also results in the high latency of the overall system. This latency correlates with the number of effects in the effect pipeline.
‘Pipeline performance’ measures performance as a combination of latency and computation time. ‘Latency’ refers to the time required by the pipelined system to emit a given unit of data. With respect to an effect pipeline, ‘latency’ describes the time spent by each image frame in the pipeline, from the moment it enters the first effect in the pipeline, to the time when it exits the last effect in the pipeline.
‘Computation time’ refers to the time required to process a standard unit of data, e.g., an image frame. Furthermore, computation time may be represented as a function of frame rate for a video system, or a function of actual time to process a frame.
FIG. 11 illustrates a 6-stage frame-based effect pipeline on a multi-core system. In FIG. 11, each effect is assigned to a processor (or core), on the best-case assumption that the number of processors is equal to or exceeds the number of effects in the pipeline. Each processor processes an effect on one complete image frame at a time. The image frame may originate from a video stream, or other source of image frames. Each incoming frame is sequentially processed in turn. For simplicity, each effect is assumed to take the same amount of time to process a frame, which represents the optimal time required to output each frame so as to maintain a consistent frame-rate.
At time t1, frame 1 enters the pipeline, and the first effect is applied to frame 1 by processor 1. At time t2, frame 2 is loaded and processed by processor 1, and frame 1 is loaded and processed by processor 2. At time t3, frame 3 is loaded and processed by processor 1, frame 2 is loaded and processed by processor 2, and frame 1 is loaded and processed by processor 3. From time t4 to t6, frames 4 to 6 are introduced into the pipeline, and frames 1-3 proceed along the pipeline. At the conclusion of time t6, frame 1 emerges from the pipeline.
Pipeline latency can be measured as a function of the time required to process a frame through all of the stages in the pipeline, or in ‘frame time,’ i.e. the number of frames processed by the first effect in the pipe before the frame N exits the effect pipeline. As a function of frame time, pipeline latency is computed as:PLFT=M−N  (Eq. 1)where:                PLFT is the Pipeline Latency, in frame time.        M is the frame entering the pipeline at time instant T.        N is the frame exiting the pipeline at time instant T, and M<N.        
For a frame-based architecture, pipeline latency can also be defined as a product of the processing time for the slowest effect in the chain and the number of effects (or stages) in the pipeline. This is computed as:PLCT=N*TFM   (Eq. 2)where:                PLCT is the Pipeline Latency, as computation time.        N is the number of stages in the pipeline.        TFM is the time required to process a frame by the slowest effect M in the pipeline.        
By assuming that a new frame is fed into the pipeline every 120 ms, that the pipeline contains 6 stages or effects, and that the slowest effect in the pipeline also takes 120 ms to process one frame of data, the Pipeline Latency (PL) for the frame-based system in FIG. 11 is 6*120=960 ms.
Furthermore, since the pipeline latency is measured as a function of time from when the first frame, i.e., frame 1, enters and exits the pipeline, no benefit is incurred by having each processor run every effect consecutively in parallel. That is, even if every frame were immediately available for processing, instead of a new frame being fed consecutively into the pipeline every 120 ms, the pipeline would still have a 6 frame latency and a 960 ms computation time latency, based on the time from when the first frame entered and exited the pipeline.
The above system is limited by the hardware resources available, i.e. processing becomes considerably slower if the number of effects exceeds the number of available independent processing components, or if the cache memory available at each processor is incapable of storing an entire frame.