Data processing by multiple processing engines or processors such as encountered in, for example, video data processing, (e.g., multipass compositing), requires synchronization of the multiple processing engines. If the processing engines are not operating at the same speed (i.e., are not in lock step), then parallel processing generally, and video data processing in particular, requires extensive software overhead to coordinate processing activities and/or maintain data integrity.
For example, video data processing typically entails sending the video data being processed through multiple processing engines, particularly during multipass compositing. Separate data streams may be simultaneously processed by distinct processing engines and effective coordination of the different processing engines is required to prevent processing engines from idling and undesirably slowing the total rate of data processing throughput.
Video data processing frequently involves sequential processing of video data by different processing engines that are optimized for different processes. For example, one or more processing engines may be optimized for video input/output, a plurality of other processing engines may be optimized for image compositing, and one or more other processing engines may be optimized for interfacing with main memory. In order to composite images, such as for a cross fade of two images, it is necessary to receive the two images, perform any necessary color space conversion, execute the compositing pass (crossfade), and output the resultant image. Depending upon the number of available processing engines and the capabilities of the available processing engines, the process can be highly parallelized. However, there are significant data dependencies which must be coordinated for optimal execution.
For example, if processing engines optimized for image compositing only have access to local memory, then the images to be composited must be written to local memory. If the subsystem has two processing engines optimized for memory input/output, then these processing engines should concurrently execute the required main memory access. Because the two images may have different extents and the separate memory accesses may not take the same time, one image may be available for compositing several timing intervals before the second image is available. In typical prior art systems, however, the processing engine optimized for image compositing will remain idle waiting for the second image to be input and the system software to instruct the processing engine to begin execution of the image compositing process. Providing a processing engine with multiple processing queues allows the idle processing engine to execute background tasks during this idle time as background task operations may be queued in a lower priority queue. Moreover, relying upon the system software to initiate processing may unnecessarily postpone execution of one of the processes and requires substantial overhead processing which can introduce substantive delays. Accordingly, it is desirable to provide a mechanism by which the activities of the processing engines may be optimally coordinated without requiring software overhead.
This need for efficient coordination of data processing engines is particularly acute in the case of a video processing subsystem wherein the processing engines may have access to two separate memory systems; main memory and subsystem memory. Typically, access to the subsystem memory (also known as local memory) is accomplished with substantially less latency and at a higher throughput rate. Also, any intermediate results can be kept in subsystem memory, desirably saving a substantial amount of main memory bandwidth as only the final results have to be sent back to main memory. Coordinating the optimal use of the local memory space typically requires continuous monitoring of the processing engines to ensure that commands (and processing data) which will not be executed immediately are not occupying memory space which could be used for commands and processing data that could be used by an idled processing engine.
Various techniques have been used in the past to try and provide data to processing engines in a video subsystem in a timely fashion. Fixed pipelines have been implemented using specialized hardware to pass video image data through. processing engines in a predetermined sequence. The fixed pipeline technique is generally disadvantageous because it is inherently inflexible and therefore cannot be adapted as video processing requirements change or in response to differing user needs and priorities.
Another technique used to coordinate processing engines is the implementation of expensive parallel processing systems with several microprocessors and an operating system modified (or constructed) for parallel processing. This technique, while flexible, is frequently costly, requiring an inordinate investment of resources. Furthermore, it may be unsuitable for a processing subsystem such as a video processing subsystem that may or may not operate in a parallel processing environment. In general, software-based processor coordination is less efficient than hardware-based processor coordination because of the processing overhead (both in terms of time and complexity) that is necessarily added by software coordination.
This is particularly problematic when highly optimized processing engines are used, as in many digital signal processing applications, and the software overhead and latency delays necessarily incurred by interrupt processing prevent efficient task overlapping because the time required for software execution becomes a significant portion of the total time required for processing. Video processing, especially real-time and faster than real-time video processing, is rapidly reaching the point where prompt and economically feasible coordination of processing engines is required to fully realize the capabilities of the processing engines.