The present application generally relates to arbitrating a shared resource in a computing environment. More particularly, the present application relates to detecting and/or correcting soft error(s) in an arbitration logic device in a digital circuit while the arbitration logic device continues to work correctly under the soft error(s).
In a digital circuit, it is common that multiple modules compete for a single shared resource (e.g. bus, cache memory, etc.). Thus, an arbitration logic device is often used to resolve shared resource conflicts. An arbitration logic device selects one of a winner requestor among the multiple requestors (i.e., the competing multiple modules). Then, the winning requestor accesses the shared resource. In very large scale integrated (VLSI) circuits, a large number of requestors are subject to competing each other. For example, there are hundreds or even thousands of candidate requestors for such competition.
An arbitration logic device memorizes the state of each requestor (e.g. whether each requestor has a pending request), e.g., by storing the state of each requestor in storage elements, e.g., latches, registers, flip-flops, etc. However, these storage elements can flip their values due to soft errors. Soft error refers to an error on data stored in a computing system that does not damage hardware of the computing system but corrupts the data. Because of a trend of high-density and low-power consumption in semiconductor designing/manufacturing technology (e.g., 20-nm CMOS technology), a soft error may occur more frequently in recent VLSI circuits. A soft error occurs not only in a memory device (e.g., SRAM, DRAM, SDRAM, etc.), but also in a register, for example, in a processor (core). Therefore, a soft error becomes more significant problem as the digital circuits are designed based on nanotechnology (e.g., 30-nm CMOS technology).
Traditionally, a duplication method has been used to detect and correct soft errors in a digital circuit. Duplication method uses multiple instances of storage elements to store same data. Using two copies of data, it is possible for the digital circuit to detect a single bit error. For example, if the two copies have different values, there exists a soft error on the data. Similarly, using three copies of data, the digital circuit can correct a single bit error, e.g., considering two copies that store same data as valid copies. Although this duplication method is simple and easy to implement, it increases the number of storage elements in the digital circuits unacceptably in terms of hardware size and power consumption.
ECC (Error Correcting Code) has also been a popular method to correct soft errors in digital circuits. Adding a small number of extra information (e.g., additional 10% data) to original information, hardware logic implementing an ECC scheme (e.g., multiple parity bits) can correct soft errors as long as the number of flipped bits is small enough (e.g., the number of bits being corrupted is one).
Protecting memory cells (i.e., cells in a memory device) using ECC is widely used in current digital systems. However, a naïve ECC method is not efficient for the arbitration logic device. For example, because the arbitration logic device needs to know the states of all the requestors, the arbitration logic device looks up all the memorized information at once. Therefore, all the memorized information has to be corrected at the same time. As a result, significant amount of ECC correction logic device is necessary: traditionally, one ECC correction logic device is required per one ECC word. The number of ECC correction logic devices increases as the number of ECC words increases. However, this increase becomes not acceptable both in hardware size and in power consumption as digital circuits become dense and operate in a low-power environment (e.g., Vdd=1.6V). Furthermore, an ECC correction delay (i.e., the time that an ECC correction logic device takes to fix a soft error) is added to the critical path in the arbitration logic device, thus increasing latency for the arbitration. A critical path in a digital circuit refers to a path that takes the longest time to operate in the digital circuit.
There have been other methods proposed to solve soft errors that include, but are not limited to: 1. Exploiting time redundancy to tolerate soft errors, 2. Using a known Delay-Assignment-Variation (DAV) methodology to mitigate soft errors, 3. Optimizing internal structures of latches to make them tolerant to soft errors, etc. Though these methods have some effect on reducing the impact of soft errors, they depend on semiconductor devices or development tools. Thus, they are lack of generality because these proposed methods rely on semiconductor device technologies (e.g., 40 nm CMOS technology) and synthesis tools (e.g., synthesis tools from Cadence®, etc.) through which these method are implemented on semiconductor devices. Sometimes, they are difficult or even impossible to be implemented.
There has been a method for fixing soft errors at a system level. For example, there is a method for microprocessors to recover from soft errors by an additional system-level logic or process for soft error handling, e.g., adding check points. However, depending on a design of a digital circuit, it may not be easy to add such mechanism in the digital circuit.