This invention relates to an accelerator and a processor system including the accelerator.
In order to improve a throughput of a specific function of a computer including a processor system, in particular, a multi-core processor system as a component thereof and suppress power consumption thereof to a low level, a hardware module called “accelerator” (hereinafter also referred to as “ACC”) is incorporated into the processor system. The ACC represents, for example, a graphic accelerator for speeding up graphic display, a 3D accelerator for speeding up three-dimensional display, or the like.
Up to now, a tight coupling method and a loose coupling method are known as methods of coupling such an ACC to the processor system.
The tight coupling method represents a method of integrating the ACC with a CPU (central processing unit or general-purpose processor; hereinafter also referred to simply as “processor”) or coupling the ACC to the CPU in an almost integrated manner. In the tight coupling method, the ACC and the processor function in close cooperation with each other, and an advantage that overhead for activating and controlling the ACC is low is provided. This also provides an advantage that the ACC can be efficiently used even when the ACC takes a short time to execute short vector processing (acceleration processing whose processing data amount is relatively small) or the like.
However, the tight coupling method poses a problem that, when the ACC is newly coupled to a processor, an instruction set of the processor needs to be extended in accordance with the ACC to be coupled. Examples of the extended instruction set include streaming SIMD extensions (SSE) disclosed in S. Thakkar, T. Huff, “The Internet Streaming SIMD Extensions”, Intel Technology Journal Q2, 1999.
Further, as a technology in the category of the tight coupling method, a technology, such as a co-processor, relating to a method for coupling the ACC directly to the processor is disclosed in, for example, M. Awaga, H. Takahashi, “The uVP 64-Bit Vector Coprocessor: A New Implementation of High Performance Numerical Computation”, IEEE Micro, Vol. 13, No. 5, October 1993. In this method, there is no need to extend the instruction set, but the co-processor needs to be called for each processing unit (accelerator instruction). Due to this, higher overhead is required, and thus, thereby posing a problem that improvement in arithmetic operation speed is impaired as a whole.
On the other hand, the loose coupling method represents, for example, a method, such as a graphics processing unit (GPU), for coupling the ACC to an external bus of the processor as disclosed in “NVIDIA CUDA C Programming Guide Version 3.2”, 2010 or a method, such as an open multimedia application platform (OMAP), for coupling the ACC to an internal bus of the processor as disclosed in “OMAP-L137 Application Processor System Reference Guide”, Texas Instruments, March, 2010. In the loose coupling method, although there is a difference between an external bus coupling method and an internal bus coupling method, the ACC and the processor are separately provided, and hence it is possible to reserve an abundance of arithmetic units and memories for the ACC, and an advantage that the loose coupling method is suitable for regular arithmetic processing for a huge amount of data is provided.
Further, in the loose coupling method, there is no need to extend the instruction set.
However, in the loose coupling method, it is necessary to call the ACC and transfer data for each processing unit (accelerator instruction). Due to this, higher overhead is required, thereby posing a problem that the improvement in the arithmetic operation speed is impaired as a whole. Therefore, the loose coupling method is not suitable for irregular arithmetic processing.
Further, the above-mentioned multi-core processor system represents a processor system formed of a plurality of processor cores, and each of the processor cores includes the processor and, as necessary, the above-mentioned ACC.
In the multi-core processor system, processing is parallelized by the plurality of processor cores, thereby reducing the power consumption and improving an arithmetic throughput. Therefore, a parallelizing compiler for converting a serial processing program that can operate only on the processor system formed of one processor into a parallel processing program that can operate in parallel on a so-called multi-core processor formed of a plurality of processor cores analyzes an input program of the serial processing, extracts portions that can operate in parallel from the input program, and allocates the arithmetic processing for the portions to a plurality of processors, thereby improving the throughput compared to the processor system formed of one processor as described above.
Technologies and the like disclosed in the following Patent Documents are known as technologies relating to: an architecture of such a multi-core processor system, in particular, the multi-core processor system having a plurality of processor cores including a general-purpose processor and an application-specific processor (such as ACC); and the parallelizing compiler for generating a parallel processing program that can operate in parallel on the multi-core processor.
JP 2006-293768 A discloses a technology relating to: a compiler for, in a multi-core processor system in which a variety of processor cores are mounted, efficiently operating each processor core by automatically extracting tasks having parallelism from an input program of serial processing to be processed and arranging the tasks in accordance with characteristics of the respective processor cores, and further generating a code for optimizing an operating frequency and a power supply voltage by estimating a processing amount of the processor core before adding the code to a target program; and a multiprocessor system that enables optimization thereof.
JP 2007-328415 A discloses a technology for preventing, in a heterogeneous multiprocessor system including a plurality of processor elements (such as processors) which are different in the instruction set and configuration, resources of a specific processor element from becoming short to improve the throughput of the whole multiprocessor system.
JP 2007-328416 A discloses a technology that allows efficient processing at low power while making maximum use of performance of a multiprocessor system, in which a variety of processor cores are integrated, by using a method of parallelizing a program by cooperation of a plurality of compilers for dividing the program, arranging portions thereof, and generating a control code therefor in such a manner as to efficiently operate the processor core.
JP 4476267 B discloses a technology for reducing, in a multi-core processor in which a data transfer mechanism is provided to each of a plurality of processor cores, overhead for data transfer between the processor cores, while using a compiler to facilitate optimization of the data transfer, thereby improving the throughput of the whole processor.