Field of the Invention
The present invention relates to a processing technology, particularly to a speculative lookahead processing device and method.
Description of the Related Art
The IC fabrication technology is advancing persistently, and the transistors on a chip are growing smaller and faster. Thus, more transistors operating faster are packed into a smaller chip, and the performance of IC is enhanced, which favors the applications to high speed computation, consumer electronics, automobile electronics, medicine and healthcare. However, the IC having fast transistors densely packed thereinside would result in problems of power consumption and heat dissipation, which have been the bottlenecks in IC design. The power consumption of a chip is proportional to the square of the voltage supplied to the chip. Decreasing the operating voltage of a chip is one of the most effective methods to save power. In order to increase the working time of battery-powered devices, the industry tends to design and fabricate IC operating at ultralow voltage. Thus, many ultralow voltage IC-related technologies have been developed. In the advanced process technology, especially in the sub-40 nanometer IC design, variation of transistors is very serious, wherein different trace width and length resulting from optical diffraction and chemical etching causes the circuits to operate at different speeds, especially in the ultralow voltage IC. The abovementioned transistor variation is likely to affect the performance of IC. Refer to FIG. 1 for the data proposed by B. H. Calhoun, which shows the latency distributions of IC operating at different voltages. It is found in FIG. 1: the operation of IC is considerably decelerated at ultralow voltages (200 mV/300 mV). In comparison with the operation at a normal voltage (1V), the slowest speed at 200 mV is 1000 times slower than the slowest speed at 1V, and the slowest speed at 300 mV is still 100 times slower than the slowest speed at 1V. It is also found in FIG. 1: the lower the operating voltage, the wider the curve. The abovementioned phenomenon indicates that the latency distributions and variations spread divergently for ICs operating at different voltages. Refer to FIG. 2 and FIG. 3 respectively showing the normalized latencies of ICs operating at 300 mV and 1.2V. FIG. 2 and FIG. 3 shows that lowering the operating voltage not only decreases the overall speed of the circuit but also increases the range of speed variation. The latency of the slowest case is 1.4 times the relative latency. Thus, the overall performance of the circuit is degraded. The synchronous circuit technology normally adopts a single clock and uses STA (Static Timing Analysis) of EDA (Electronic Design Automation) to analyze latency of the circuit and designs the circuit according to the worst case: slow PMOS/slow NMOS operating at a voltage lower than the rated voltage by 10% and a temperature of 125° C. to guarantee that the circuit fabricated by any possible process and receiving inputs of different variations can operate correctly at any possible voltage and temperature. Directly using the traditional overproof design rule to design ultralow-voltage IC is too pessimistic an approach, which will seriously downgrade the performance. The non-synchronous IC technology suffers from lacking an EDA tool available to verify the IC design. The variable-latency datapath is a circuit technology effectively exempt from satisfying the worst case, not overproofing IC to deal with the worst case but designing IC according to the normal case. Refer to FIG. 4 a diagram schematically showing a variable-latency datapath. The block in the center denotes a variable-latency datapath. The input is triggered by the clock. The variable-latency datapath processes the input x[n] and outputs y[n]. The datapath contains an error detection circuit 10, which can be realized by various error detection methods. Different variable-latency datapath technologies are respectively characterized by the error detection methods thereof. While the latency is lengthy, the error detection circuit 10 emits a waiting signal to indicate that the datapath is still operating.
The double latching mechanism is one of the variable-latency datapath technologies. In addition to data variation, the double latching mechanism can also dynamically deal with the variations in fabrication process, voltage and temperature. Refer to FIG. 5 schematically showing a circuit-level double latching mechanism for latency speculation. The traditional synchronous circuit technology adjusts the execution clock rate to equal or exceed the path latency of the worst case of the computing core so as to guarantee that the calculation can be completed correctly and timely in all cases. The double latching mechanism would radically force the execution clock rate to be lower than that required by the path latency of the worst case so as to achieve a faster operation speed and support the latency speculation circuit. The double latching mechanism needs two latches respectively storing a speculation value and a correct value, a comparator comparing the speculation value with the correct value, and a clock-delay latch latching the correct value. The double latching mechanism will store the speculation result in a second latch 12 beforehand. The abovementioned action should be undertaken in a premise that most calculations can be completed within the current clock cycle. Otherwise, the double latching mechanism can only achieve very limited benefit or even achieves none benefit. After the extra clock delay, the double latching mechanism stores the correct result in a third latch 14. The sum of the clock delay and the original clock cycle should equal or exceed the path latency of the worst case so as to guarantee that the calculation can achieve a correct result. If the result stored in the second latch 12 is consistent with the result stored in the third latch 14, it indicates that the speculation is successful. Then, the calculation process continues. If the result stored in the second latch 12 is inconsistent with the result stored in the third latch 14, the calculation results based on the incorrect speculation result are all deleted. Then, the correct result stored in the third latch 14 is fed back to the second latch 12 in the next clock cycle and calculated once again. The system behavior of an incorrect speculation is equivalent to stalling the pipeline for a clock cycle or using two clock cycles to complete the calculation of the data.
Refer to FIG. 6 for a timing diagram describing an example of the double latching mechanism, wherein φ is a clock cycle of 3 ns, δ is a clock delay of 1 ns. In the example, the delay of the worst case of the datapath is 4 ns (φ+δ). The first piece of data enters the system at the 0th ns. The second latch 12 latches the speculation output at the 3rd ns. After the elapse of 1 ns, i.e. at the 4th ns (the first clock cycle φ+the delay δ), the third latch 14 latches the correct output. If the output at the 3rd ns is consistent with the output at the 4th ns, the speculation is successful. The speculation output of the input received at the 3rd ns is latched at the 6th ns. The correct output of the input received at the 3rd ns is latched at the 7th ns. As the output at the 6th ns is inconsistent with the output at the 7th ns, the pipeline is stalled during this clock cycle. The third latch 14 feeds back the correct value to the second latch 12. The second latch 12 latches the correct value and outputs the correct value at the 9th ns. As the pipeline is stalled during this period of time and calculation is undertaken once again, the first latch 16 does not receive any new input until the 12th ns. At the 12th ns, the second latch 12 latches the speculation output of the datapath. At the 13th ns, the third latch 14 latches the correct output. As the speculation is successful, the second latch 12 directly latches the speculation output at the 15th ns, and the third latch 14 latches the correct output. As the value of the second latch 12 is consistent with the value of the third latch 14, the speculation is successful.
In order to prevent the correct data, which is latched in the third latch 14 during the datapath delay, from the race condition, the computing core has to guarantee that the shortest datapath delay is not smaller than δ lest the data stored in the first latch 16 for the succeeding calculation interfere with the current calculation. In practice, the circuit level and latency speculation of the double latching mechanism can be incorporated into the IC design process, wherein the double latching mechanism is triggered by the reverse clock pulse; the duty cycle is controlled to adjust δ; the shortest datapath delay is realized via constraining the hold time, whereby extra transistors can be interposed to satisfy the limitation. The abovementioned double latching mechanism has a drawback complained often: it detects timing violation at a considerable cost. Especially in the processor design, the extra third latch 14 functioning as the shadow register, the extra transistors for avoiding the race condition, and the comparator for verifying the speculation value have a total size almost identical to the size of the arithmetic unit whose outputs are speculated, occupying extra area of the chip and consuming extra power. Further, the delay required by the shadow register brings about difficulties in designing the clock tree. Furthermore, the duty cycle ratio is hard to maintain constant in various modes and at different operating voltages even though dual edge triggering is adopted. The clock tree is very likely to be influenced by variances. Therefore, the system is hard to generate accurate clock latency. Then, the uncertainty of clock signals increases, and more overproof designs are required to overcome the extra problems. Thus, the gain cannot balance the loss.
Accordingly, the present invention proposes a speculative lookahead processing device and method to overcome the abovementioned problems.