Many different types of computing systems have attained widespread use around the world. These computing systems include personal computers, servers, mainframes and a wide variety of stand-alone and embedded computing devices. Sprawling client-server systems exist, with applications and information spread across many PC networks, mainframes and minicomputers. In a distributed system connected by networks, a user may access many application programs, databases, network systems, operating systems and mainframe applications. Computers provide individuals and businesses with a host of software applications including word processing, spreadsheet, accounting, e-mail, voice over Internet protocol telecommunications, and facsimile.
Users of digital processors such as computers continue to demand greater and greater performance from such systems for handling increasingly complex and difficult tasks. In addition, processing speed has increased much more quickly than that of main memory accesses. As a result, cache memories, or caches, are often used in many such systems to increase performance in a relatively cost-effective manner. Many modem computers also support “multi-tasking” or “multi-threading” in which two or more programs, or threads of programs, are run in alternation in the execution pipeline of the digital processor. Thus, multiple program actions can be processed concurrently using multi-threading.
At present, general-purpose computers, from servers to low-power embedded processors, include at least a first level cache L1 and often second and third levels of cache, L2 and L3. This cache memory system enables storing frequently accessed data and instructions close to the execution units of the processor to minimize the time required to transmit data to and from a higher latency memory. L1 cache is typically located within each processor to be closer to that processor's execution units. L2 and L3 caches are typically external to the processor chip but physically close to it. Accessing the L1 cache is faster than accessing the more distant system memory. Ideally, as the time for execution of an instruction nears, instructions and data are moved to the L2 cache from a more distant, higher latency memory.
Moving the instructions and data from a more distant memory generally involves retrieving a copy of a memory line from the more distant, higher latency memory and storing the copy of the memory line in a fill buffer for that L2 cache. The fill buffer temporarily stores the memory line until the memory line can be written into the cache.
In a multiprocessor environment, the state and ownership of a line must be properly communicated to each and every processor to maintain cache coherency. When a line of data is read into a processor's cache and there is no intention of modifying the line, then that line can be read in and stored in the cache in what is known as a ‘shared’ state. While this line is being filled from memory, if the processor wants to modify the data contained in that line, the processor must obtain ‘exclusive’ ownership of that line.
One solution allocates a second fill buffer for a new request for the memory line in an exclusive state, via, e.g., an address-only kill request. This solution also allows the previously allocated fill buffer to continue to receive the memory line in a shared state. Thus, two fill buffers contain the memory line, one in a shared state and one in an exclusive state. There are two significant drawbacks with allowing the same cache line to occupy two fill buffers. The first drawback is that the same cache line is now contained in two fill buffers of the same cache with conflicting states and as a result the logic must include additional functionality for control and data hazards. The second drawback is that multiple resources are consumed to manage the same cache line, which may otherwise be utilized for other cache requests.
Another solution allows the original fill buffer to receive the memory line in shared state and writes the line into the cache array while stalling the address-only kill request. Once the shared line is written into the cache, then the address-only kill is allowed to occupy a fill buffer and proceed to the bus to obtain the memory line in exclusive state. Unfortunately, the stall induced by this solution is as long as the memory latency needed to fill the cache line.
Therefore, there is a need for systems and arrangements to promote a cache line in a single fill buffer from shared state to exclusive state without causing a significant data hazard and without adding a latency that is as long as the latency needed to fill the cache line from a higher level of memory.