Recently, the micropatterning of processes advances, and the scale of circuits incorporated into LSIs has reached several tens of millions of gates.
Meanwhile, due to the influence of the downsizing of devices, various problems have become significant. Among them all, the increase in wiring delay inside an LSI poses particularly a serious problem.
An example which is largely influenced by this wiring delay inside an LSI is a system bus which connects modules by interconnections.
Also, IP (Intellectual Property) cores requiring high throughputs are recently abruptly increasing. For example, USB1.1 is replaced with USB2.0, and PCI is replaced with PCI-Express. Accordingly, demands for on-chip buses capable of high-speed throughput transfer are increasing (e.g., U.S. Pat. No. 6,857,037).
In June 2003, ARM announced the AXI (Advanced extensible Interface) protocol of AMBA3.0 as a standard of the next-generation on-chip buses by ARM, and this protocol is attracting attention. AMBA is an abbreviation of “Advanced Microcontroller Bus Architecture”. The AXI has introduced the concept “channel” which the conventional AHB (Advanced High-Performance Bus) did not have, and improves the throughput of data transfer by this channel. More specifically, the AXI supports an independent transfer system using an address phase and data phase, and an out-of-order transfer system by which the result of a cycle inserted later can be fed forward.
As shown in FIG. 1, the “channel” is defined as a series of transfer paths in which a master 101 and slave 102 transfer data 105 by two-line handshake by using valid 103 and ready 104 as signals; the transfer source of the data 105 outputs the valid 103, and the transfer destination outputs the ready 104.
Referring to FIG. 1, one transfer is established in a cycle in which the valid 103 asserted by the master 101 and the ready 104 asserted by the slave 102 are simultaneously asserted.
As shown in FIG. 2, this AXI transfers data between a master 201 and slave 202 by using four channels, i.e., address 203, write data 205, read data 204, and write response 206. Since, therefore, data can be independently transferred in the address channel and data channel, address processes can be issued one after another so that the bus can be effectively used.
In addition, since out-of-order is supported, data from a slave having a small latency can be returned first. From the foregoing, it can be expected to increase the bus utilization efficiency by the AXI.
Note that the AXI is the definition of a protocol, and hence does not define the packaging of a bus connecting network. Normally, the network is presumably implemented by a crossbar structure or multilayer structure. Also, ARM supplies PrimeCell PL300 having a multilayer structure as its own IP (Intellectual Property) core.
On the other hand, a system having a plurality of CPUs, memories, and I/Os inside a system LSI is becoming popular, so these devices must be interconnected. However, a plurality of bus masters and bus slaves are not always evenly laid out inside the system LSI. Therefore, it is possible to find in the stage of a layout step that some connecting networks, particularly, high-load buses such as address buses and data buses cannot operate at the expected operating frequency any longer.
In this case, time and labor are wasted to lay out paths not meeting timings again and again, thereby ensuring timings at the expected operating frequency. In the worst case, it is necessary to lower the operating frequency, or increase the chip size.
Accordingly, these problems are solved by using a method by which a register is inserted in signals in a portion between a master and slave as a critical path, thereby dividing the path and reducing the register-register delay amount. In a point-to-point connection like the AXI described above, a protocol is defined by two-line handshake. Since information transferred by the handshake flows in one direction, a register can be inserted relatively easily. This method is called a “register slice” in the AXI.
Assume, for example, that the path delay between the master and slave is 2 ns and the operating frequency is 800 MHz (one period=1.25 ns), the delay value can be divided into about 1 ns by inserting a register slice between the two points. This makes it possible to achieve the necessary operating frequency.
FIG. 3 is a view showing a connection example in which a register slice is inserted in the point-to-point connection shown in FIG. 2. In this example shown in FIG. 3, register slices 301 and 302 are respectively inserted in the address channel and read data channel, thereby separating the path. Consequently, the latency from the issue of an address by the master 201 to the reception of read data increases by two cycles, but the system can operate at a double operating frequency as a maximum.
As described above, a high operating frequency can be assured by dividing a path by inserting registers in the point-to-point connection. Although the operating frequency rises, however, the performance does not unconditionally improve; the performance may adversely deteriorate depending on the relationship between the throughput and the increasing latency.
Also, it should be noted that signals propagate not only in the direction from the master to the slave, but also in the response signal (ready signal) direction from the slave to the master. No handshake timing is established if each signal is simply latched by one flip-flop (FF).
This is so because both the master and slave receive signals output from each other in the immediately preceding cycle.
To solve this problem, it is possible to form a register slice by using two FFs. However, the use of two FFs has the demerits that the circuit scale increases, and the latency also increases.
These demerits are eliminated if the register slice is formed by using not two FFs but two half-latches using the normal phase and reverse phase of a clock.
FIG. 16 shows an example of a register slice formed by half-latches. Referring to FIG. 16, the left side shows master (initiator)-side signals, and the right side shows slave (target)-side signals. The half-latches shown in FIG. 16 become transparent when the EN input is 1.
FIG. 17 shows a timing chart of the register slice. Referring to FIG. 17, each hatched portion represents a half-latch holding operation, i.e., the timing of EN=0.
As shown in FIG. 17, when the ready signal from the target is deasserted, a handshake timing is generated by masking the data holding operation.
The average transfer rate with respect to the operating frequency when burst transfer is performed by a 32-bit data bus connected by the point-to-point connection as shown in FIG. 2 will be explained below with reference to FIGS. 4 and 5.
FIG. 4 is a graph showing the relationship between the operating frequency and transfer rate when 16-burst transfer is performed. FIG. 5 is a graph showing the relationship between the operating frequency and transfer rate when 4-burst transfer is performed.
As shown in FIGS. 4 and 5, the latency difference between the cases in which the register slice is inserted and is not inserted appears as the average transfer rate difference. For example, when 16-burst transfer shown in FIG. 4 is performed, a transfer rate of 480 MBytes/s is obtained in a 120-MHz operation without the register slice. By contrast, when the register slice is inserted, the performance of the 120-MHz operation without the register slice cannot be exceeded unless the system is operated at an operating frequency of at least about 135 MHz or more.
Note that the number of cycles necessary for 16-burst transfer is 16 when the register slice is not inserted, and 18 when the register slice is inserted because the latency increases by “+2 cycles”.
When 4-burst transfer shown in FIG. 5 is performed, a transfer rate of 480 MBytes/s is obtained in a 120-MHz operation without the register slice. By contrast, when the register slice is inserted, the performance of the 120-MHz operation without the register slice cannot be exceeded unless the system is operated at an operating frequency of at least about 195 MHz or more.
That is, the smaller the burst length to be supported, the larger the demand for the operating frequency when the register slice is inserted.
The above results indicate that if it is found in the layout stage that timing convergence at the target frequency is difficult, the following problem arises even when the target frequency can be achieved by inserting the register slice. That is, the target frequency corresponds to the performance before the register slice is inserted, so it is sometimes impossible to satisfy the performance any longer after the register slice is inserted because the latency increases. In this case, the performance must be met by selecting a higher frequency from selectable frequency candidates.
When the platform of developed system LSIs is to be expanded from the low-end product to the high-end product in accordance with the product specifications, the power consumption of a product having a low required performance can be made smaller when it is operated at a low operating frequency than when it is operated at a high operating frequency. However, if the operating frequency of a system LSI in which the register slice is simply inserted is lowered, it is sometimes impossible to satisfy the performance because the latency increases.