The present invention relates to a data processing device having multiple processors, and more particularly to a data processing device having a processor capable of computing variable-length bits and a processor adapted to mainly compute fixed-length bits and a data processing method thereof.
In recent years, there has been an increase in the importance of digital signal processing, which rapidly processes a large amount of audio, video, and other data. In such digital signal processing, a DSP (Digital Signal Processor) is often used as a dedicated semiconductor device under normal conditions. However, when a signal processing application, or more specifically, an image processing application, is used, the processing capacity of the DSP is not sufficient because an extremely large amount of data needs to be processed.
Meanwhile, a parallel processor technology, which enables multiple arithmetic units to operate in a parallel manner to deliver high signal processing performance, has been increasingly developed. When a dedicated processor derived from the parallel processor technology is used as an accelerator attached to a CPU (Central Processing Unit), high signal processing performance can be delivered even in a situation where low power consumption and low cost are demanded as in the case of an LSI incorporated in an embedded device.
An SIMD (Single Instruction Multiple Data stream) processor, which performs computations in accordance with an SIMD method, can be cited as an example of the above-described parallel processor.
The SIMD processor includes a fine-grained arithmetic core and is suitable for integer arithmetic operations and fixed-point arithmetic operations. Here, it is assumed that the fine-grained arithmetic core is an arithmetic core capable of computing variable-length bits by performing an arithmetic operation multiple times.
A massively parallel processor, which is an SIMD processor incorporating 1024 fine-grained arithmetic units (hereinafter may be referred to as the PEs (Processor Elements)) that are tightly coupled with a memory and capable of performing computations in units of 1 to 2 bits, can perform a large number of integer arithmetic operations and fixed-point arithmetic operations within a short period of time. The massively parallel processor may be hereinafter referred to as the matrix-type massively parallel processor (MX).
Further, as the matrix-type massively parallel processor uses the fine-grained arithmetic units, it can perform necessary bit length computations only. Therefore, its power consumption can be reduced to let it deliver higher performance-to-power consumption ratio than general-purpose DSPs and the like.
Furthermore, as the matrix-type massively parallel processor can load and execute a prepared program, it can perform parallel computations simultaneously with a CPU that controls it. Moreover, the matrix-type massively parallel processor incorporates an entry communicator (ECM) to move data between the arithmetic units as described later so that data exchange can be made simultaneously with computations with the aid of a controller supporting a VLIW (Very Long Instruction Word) instruction. Therefore, the matrix-type massively parallel processor can supply data with higher efficiency than a processor in which arithmetic units are simply arrayed in a parallel manner.
Meanwhile, a coarse-grained arithmetic core, such as a floating-point arithmetic unit (FPU), is an arithmetic unit specifically designed for fixed-length floating-point arithmetic operations and used while it is coupled to a CPU. Here, it is assumed that the coarse-grained arithmetic core is an arithmetic core capable of computing fixed-length bits by performing a single arithmetic operation.
The floating-point arithmetic unit includes a floating-point arithmetic register. The data to be subjected to an arithmetic operation is supplied from the CPU or a memory through this register. The CPU interprets an execution instruction and issues a computation request to the floating-point arithmetic unit. The floating-point arithmetic unit has a pipeline configuration. Even when a single arithmetic process is not completed in one cycle, the floating-point arithmetic unit substantially performs one arithmetic operation per cycle as far as data is continuously supplied. Relevant technologies are described in connection with inventions disclosed in Japanese Unexamined Patent Publications No. 2001-027945 and 2001-167058.
The invention disclosed in Japanese Unexamined Patent Publication No. 2001-027945 aims to provide a floating-point unit that does not require dedicated hardware for each of different data type formats. A device described in Japanese Unexamined Patent Publication No. 2001-027945 includes a floating-point unit having a standard multiply-accumulate (MAC) unit capable of performing a multiply-accumulate operation on the data type formats. The standard MAC unit is configured to compute a conventional data type format and a single-instruction multiple-data (SIMD) type format. As this eliminates the need for a dedicated SIMD MAC unit, the area of a die is considerably reduced. When an SIMD instruction is computed by one MAC unit, data is given to high-order and low-order MAC units as a 64-bit word. The MAC units each receive one and more bits selecting the upper half or the lower half of the 64-bit word. The MAC units each compute their respective 32-bit word. The results of the computations are combined into a 64-bit word by bypass blocks of the floating-point unit.
The invention disclosed in Japanese Unexamined Patent Publication No. 2001-167058 provides an information processing device capable of permitting a CPU or other similar microprocessor and an FPU (floating-point arithmetic unit) or other similar dedicated processor to perform processing operations in a parallel manner, and aims to provide an increased processing capacity by reducing the wait time of the microprocessor. The information processing device has a multi-FPU configuration. An FPU status register in an FPU coupling controller monitors the status of each of multiple FPUs. When any one of multiple CPUs issues a request concerning an assistance-requesting instruction to an FPU status decoder in the FPU coupling controller, an FPU selector is controlled so as to couple the requesting CPU to a nonoperating, unoccupied FPU in accordance with information stored in the FPU status register. Further, a temporary storage register selection controller controls a temporary storage register selector to prevent damage to data in an area used by a temporary storage register.