Processors nowadays are more powerful and faster than ever. So much so that even memory access time, typically in tens of nanoseconds, is seen as an impediment to a processor running at its full speed. Typical CPU time of a processor is the sum of the clock cycles used for executing instructions and the clock cycles used for memory access. While modern day processors have improved greatly in the Instruction execution time, access times of reasonably priced memory devices have not similarly improved. Also, in a modern computer that requires an increasing capacity for I/O bandwidth, the above memory latencies would severely limit the system performance.
A common method to hide the memory access latency is memory caching. Caching takes advantage of the antithetical nature of the capacity and speed of a memory device. That is, a bigger (or larger storage capacity) memory is generally slower than a small memory. Also, slower memories are less costly, thus are more suitable for use as a portion of mass storage than are more expensive, smaller and faster memories.
In a caching system, memory is arranged in a hierarchical order of different speeds, sizes and costs. For example, a smaller and faster memory—usually referred to as a cache memory—is placed between a processor and a larger, slower main memory. The cache memory may hold a small subset of data stored in the main memory. The processor needs only a certain, small amount of the data from the main memory to execute individual instructions for a particular application. The subset of memory is chosen based on an immediate relevance, e.g., likely to be used in the near future based on the well known “locality” theories, i.e., temporal and spatial locality theories. This is much like borrowing only a few books at a time from a large collection of books in a library to carry out a large research project. Just as research may be as effective and even more efficient if only a few books at a time were borrowed, processing of an application program is efficient if a small portion of the data was selected and stored in the cache memory at any one time.
A cache generally includes status bits with each line of data (hereinafter referred to as a “cache line”), e.g., most commonly, a valid bit that indicates whether the cache line is currently in use or if it is empty, and a dirty bit indicating whether the data has been modified. An input/output (I/O) cache memories may store more status information to for each cache line than a processor cache, e.g., keep track of the identity of the I/O device requesting access to and/or having ownership of a cache line. In an I/O cache memory, these status bits are changed by transactions such as DMA writes to the cache line, snoops, new fetches being issued using the cache line, and fetches returning from memory with data and/or ownership, or the like.
When more than one event, e.g., multiple requests, happens to the same cache line, the correct order in which the events are allowed to occur must be ensured to prevent an erroneous result. For example, if a cache line is being modified by a write operation by one cache user, and at the same time, is being snooped out by another cache user, the data must be written fully before the snoop can be performed.
Prior attempts to ensure the above correct order of events includes an arbitration of accesses to the cache memory, in which only one of the events is allowed an access the cache memory at a time regardless of whether the events are attempting to access the same cache line.
Another attempt to ensure the above correct order of events is to design the system with timing requirement that prevents overlap of the critical events that may interfere with each other if allowed access to the cache at the same time. In these systems, e.g., delays may be added to some events, e.g., a snoop operation, so as not to occur before another event, e.g., a write function.
Unfortunately, these prior attempted solutions are inefficient and severely limit performance, e.g., of a multi-ported cache memory (with multiple TAG lookup ports and/or multiple data ports), because it allows only one transaction to occur at a time, i.e., serializes the transactions.
Moreover, the non-overlapping timing system requires considerable complexity and time in designing and testing, and because all possible events must be accounted for and evaluated, is prone to unexpected failures, i.e., bugs. Typically, a unique timing solutions, e.g., amount of delay and the like, is required for each possible overlapping pair of events. Thus, there can be no uniform approach in dealing with various combination of events, and thus it is very difficult to develop design rules that can be applied without having an adverse effect on at least some aspect of the system.
Thus, there is a need for more efficient method and device for a per cache line arbitration between multiple cache access requests that permits multiple concurrent access to cache lines.
There is a further need for a more efficient and faster method and device for an arbitration between multiple cache access requests that provides a uniform approach in dealing with various combination of cache access events.