A computer system can be broken into three basic blocks: a central processing unit (CPU), memory, and input/output (I/O) units. These blocks are interconnected by means of a bus. An input device such as a keyboard, mouse, disk drive, analog-to-digital converter, etc., is used to input instructions and data to the computer system via the I/O unit. These instructions and data can be stored in memory. The CPU retrieves the data stored in the memory and processes the data as directed by the stored instructions. The results can be stored back into memory or outputted via the I/O unit to an output device such as a printer, cathode-ray tube (CRT) display, digital-to-analog converter, LCD, etc.
In one instance, the CPU consisted of a single semiconductor chip known as a microprocessor. This microprocessor executed the programs stored in the main memory by fetching their instructions, examining them, and then executing them one after another. Due to rapid advances in semiconductor technology, faster, more powerful and flexible microprocessors were developed to meet the demands imposed by ever more sophisticated and complex software.
In some applications multiple processors are utilized. A singularly complex task can be broken into sub-tasks. Each sub-task is processed individually by a separate processor. For example, in a multiprocessor computer system, word processing can be performed as follows. One processor can be used to handle the background task of printing a document, while a different processor handles the foreground task of interfacing with a user typing on another document. Thereby, both tasks are handled in a fast, efficient manner. This use of multiple processors allows various tasks or functions to be handled by other than a single CPU so that the computing power of the overall system is enhanced. And depending on the complexity of a particular job, additional processors may be added. Furthermore, utilizing multiple processors has the added advantage that two or more processors may share the same data stored within the system.
These processors often contain a small mount of dedicated memory, known as a cache. Caches are used to increase the speed of operation. In a processor having a cache, as information is called from main memory and used, it is also stored, along with its address, in a small portion of especially fast memory, usually in static random access memory (SRAM). As each new read or write command is issued, the system looks to the fast SRAM (cache) to see if the information exists. A comparison of the desired address and the addresses in the cache memory is made. If an address in the cache memory matches the address sought, then there is a hit (i.e., the information is available in the cache). The information is then accessed in the cache so that access to main memory is not required. Thereby, the command is processed much more rapidly. If the information is not available in the cache, the new data is copied from the main memory and stored in the cache for future use.
Because these caches are typically localized, these multiple memory elements in a multiprocessor computer system can (and usually do) contain multiple copies of a given data item. It is important that any processor or other agent accessing any copy of this data receives a valid data value. In other words, cache coherency in hardware must be maintained. One way to implement cache coherency involves having all caches "snoop" the memory bus traffic. Snooping refers to the act of monitoring data and address traffic for values of interest. If a processor writes memory for an address that is in the local cache, that cache will have been snooping the memory bus and will notice that it now has a stale copy of that data. That cache entry will then be invalidated. The next time that cache entry is accessed, instead of retrieving outdated data, it will incur a cache miss, and the new data will be forwarded from memory.
However, a problem could potentially arise when multiple writeback processors perform write transactions to the same cache line. If all of the processors perform a write-through, two or more copies of the line containing different data can exist in their internal caches. Only the data in the main memory contains the valid data. Hence, the cache lines within their respective caches must be invalidated.
The other option available for writeback processors is using a write allocate policy to obtain the exclusive ownership of the cache line prior to updating the data. Thus a processor in Shared state will issue a Bus Write Invalidate Line operation to invalidate other caches and make a state transition to Exclusive state. This is followed by the actual data update and a state transition to Modified State.
However, this approach also creates an opportunity for race condition amongst two processors that may simultaneously try to make a transition from Shared State to Exclusive State. Clearly, only one processor could be allowed to successfully complete the transition. One prior art method for resolving this race condition involved giving a negative acknowledgment (NACK) response to the second processor. However, the disadvantage with this approach is that it fails to address the issue of temporary live-lock scenarios. A live-lock scenario might occur when the same processor gets NACKed multiple times over in its attempt in getting ownership of the cache line. This may cause a temporary stall and lack of forward program. The possibility of the processor stalling increases as more processors are added and share the same bus. Clearly, from a performance standpoint, this is a highly undesirable situation.
Thus, there is a need in the prior art for a mechanism for resolving race conditions attributed to multiple processors writing to the same cache line. It would be preferable if such a mechanism could eliminate live-lock situations while providing a simple, uniform process to maintain cache coherency in a multi-cluster system environment. It would also be preferable if such a mechanism allows for the use of a deeply pipelined bus in a single cluster containing multiple processors. Furthermore, it would be beneficial if such a mechanism allocates cache lines on a write-miss condition.