1. Field of the Invention
The present invention relates to a parallel processing processor system, provided with multiple processors, that processes data to be processed in parallel using the multiple processors. The present invention particularly relates to a parallel processing processor system capable of reducing inherent instruction cache capacities of each of the processors while also maintaining the degree of performance thereof.
2. Description of the Related Art
In the controllers of MFPs (Multifunction Peripherals), individual hardware logic is provided for processes such as image reading, recording, printing, communication, fax, and so on, thereby realizing functions requested of the MFP. However, preparing circuits for each function makes it difficult to reduce the cost of the controller while also maintaining its functionality.
Reducing costs while maintaining functionality is possible by executing non-simultaneous image processes using programmable hardware. DSPs (Digital Signal Processors), reconfigurable processors, and configurable processors can be given as examples of programmable hardware. Here, reducing costs by switching firmware using multiple DSPs shall be considered as an example.
A configuration in which multiple DSPs that are each assigned to different image processes are connected and a series of multiple types of image processes are executed sequentially on the same image region is called “pipeline architecture”. If pipeline architecture is employed, differences in processing times among the DSPs will result in DSPs that act as bottlenecks, making sufficient throughput difficult to achieve.
In order to avoid this problem, the DSPs can be customized so that the processing times of the individual DSPs are equal.
However, if a DSP is customized for a certain process, it is difficult to customize that DSP in the same manner for a different piece of firmware when switching to and executing that different piece of firmware.
Meanwhile, although techniques for regulating loads among the DSPs exist (for example, see Japanese Patent Laid-Open No. 2006-133839), such regulation requires overhead; furthermore, improving the throughput is difficult and the control involved is complex, and thus such a technique is not necessarily desirable. Moreover, pipeline architecture has a problem in that it is difficult to implement a changeable configuration that has scalability, where costs are reduced by reducing the number of DSPs, performance is improved by increasing the number of DSPs, and so on.
Based on this, a data parallel processing architecture, in which the image data to be processed is divided, each piece of image data obtained through the division is assigned to a different DSP, and the multiple processes that were executed by different DSPs in the pipeline architecture are executed by those multiple DSPs, is more preferable than a pipeline architecture. In the present specification, an architecture in which multiple DSPs are used, the image data to be processed is divided, and a series of processes are performed on the pieces of image data obtained through the division in parallel by the DSPs shall be called a data parallel processing architecture.
When a structure in which image data to be processed is divided and data parallel processing is executed thereon by multiple DSPs, the size of the programs executed by the DSPs increases, and thus the cache miss rate is higher than when a pipeline architecture is employed for an instruction cache of the same capacity. When a cache miss occurs, the DSP accesses a main memory. The main memory is a DRAM (Dynamic Random Access Memory) or the like located off of the chip that implements the DSP.
With an off-chip DRAM, 20 to 30 clocks are necessary for a one-word read/write, and thus the latency at the time of the cache miss is extremely high, which greatly influences the processing capabilities of the DSP. Meanwhile, if an instruction cache having a capacity capable of storing all the processes assigned to each DSP is employed, the size of the instruction cache increases, thereby increasing the surface area of the circuit.
A method that uses a secondary cache can be employed in order to reduce the latency at the time of a cache miss. A “secondary cache” is a processor-specific storage device with a higher latency than a primary cache and a lower latency than a DRAM.
Although using a secondary cache can solve the aforementioned problem, doing so also leads to the following problems:                because a cache requires a circuit called a “tag” in addition to a circuit for storing data, the circuit scale increases; and        cache transfer is executed in units called “cache lines” and thus the efficiency is poor.        