The present invention is directed to computer memory architectures and, more specifically, to a prefetching technique for reducing delays associated with accessing a cache memory of a network switch.
Computer architecture generally defines the functional operation, including the flow of information and control, among individual hardware units of a computer. One such hardware unit is the processor or processing engine which contains arithmetic and logic processing circuits organized as a set of data paths. In some implementations, the data path circuits may be configured as a processor having internal registers for use with operations that are defined by a set of instructions. The instructions are typically stored in an instruction memory and specify a set of hardware functions that are available on the processor. When implementing these functions, the CPU generally processes xe2x80x9ctransientxe2x80x9d data residing in a data memory in accordance with the instructions.
A high-performance processing engine configured for use in, e.g., a network switch, may be realized by using a number of identical processors to perform certain tasks in parallel. For a purely parallel multiprocessor architecture, each processor may have shared or private access to non-transient data (such as xe2x80x9ctablexe2x80x9d data contained in forwarding and routing tables, statistics, access filters, encryption keys and/or queuing information) stored in an external memory coupled to the processor. Access to the external memory is generally inefficient because the execution capability of each processor is substantially faster than its external interface capability; as a result, the processor often idles while waiting for the accessed data. Moreover, scheduling of external accesses to a shared memory is cumbersome because the processors may be processing different portions of the data.
A cache memory may be used in these situations to improve the memory efficiency of each processor of the processing engine. The cache memory access speed is closer to the operational speed of the processor and thus, use of the cache memory increases the speed of data processing by providing information to the processor at a rapid rate. The cache memory holds a number of blocks (i.e, cache lines) of information, with each cache line containing information from one or more contiguous external memory locations. Each cache line is identified by a cache address that includes memory address bits that identify corresponding memory locations. These bits are collectively called an index. In addition to information from the memory, each cache line is also associated with a remainder of the memory address bits identifying the specific location in external memory from which the information in the cache block was obtained. These latter bits are collectively called a tag. When the processor requires information, the index is used to select a tag in the cache to determine whether a cache line contains the requested information.
When the address in the cache does not match the address in the external memory specified by the processor (i.e., a cache miss), a read request is issued to acquire a new cache line of information from the memory. The acquired cache line is provided to the processor which may update the data and then overwrite the cache line location in the cache. If the cache is a store-through cache, the updated data is immediately copied onto external memory to ensure that the memory always has a valid copy of all data. An advantage of the store-through cache is that overwrite operations can occur immediately in the cache without losing data. However, the processor generally writes data to the cache much faster than the data is forwarded onto external memory. Such a speed mismatch may result in outstanding write requests to the external memory which may, in turn, adversely impact the performance of the processor if it stalls waiting for the requests to complete.
The cache memory generally operates in accordance with the property of xe2x80x9clocality of referencexe2x80x9d, whereby references to memory locations at any given time tend to be confined within a localized area in memory. Based on this property, attempts to control the internal activities (e.g., the aging policy) of a cache relied heavily on previous accesses to the cache and on conventional algorithms, such as a least recently used (LRU) algorithm. However, random accesses to non-transient table data used by a processor of a network switch can adversely affect the locality of references and, thus, control of the cache memory""s internal activities.
Therefore, an object of the present invention is to provide a caching technique that improves the overall memory efficiency of a processing engine of an intermediate network station.
Another object of the present invention is to provide a technique that provides efficient control of a cache memory associated with a processor of a programmable arrayed processing engine.
The present invention comprises an enhanced prefetching technique that enables control of internal activities of a cache memory by a processor without relying on conventional algorithms. The cache memory is preferably a level zero (L0) cache contained within a processor complex of a programmable arrayed processing engine used to efficiently process data within an intermediate network station of a computer network. Notably, the L0 cache may assume various functions while providing an interface to, inter alia, an external memory of the station via a memory controller. That is, the L0 cache may function as a read buffer, a write buffer and/or a buffer for pending atomic commands, each of which is merged into a single memory bank that can be partitioned in any manner to enable efficient utilization.
According to an aspect of the invention, the technique provides an attribute based enhancement to prefetching that comprises, inter alia, various aging policies. For example, a specific aging policy may be associated with each cache line entry of the L0 cache to indicate when and under what circumstances the cache line should be invalidated. Each entry of the cache has its own attributes that are software-controlled using a novel prefetch attribute register to specify the nature of a memory access involving each cache line entry. The prefetch attribute register includes a lock group field, a no write back field, an exclusive field, an aging policy field and a length of data prefetched field. The attribute contents of these fields may be dynamically altered on a per phase basis and/or depending upon use of the data at a particular processor complex of the engine.
Another aspect of the invention involves a deferred store-through implementation of the L0 cache and, in particular, the paces of data written back through the memory controller to external memory. In this latter case, the L0 cache may function as a write buffer for writing data back to the external memory on a deferred basis and at a predefined rate. The prefetch register may be extended to include a write back attribute specifying a write back priority/rate of data from the cache to external memory. The predefined rate may be determined by the usable arbitration cycles available to each cache when arbitrating for access to the common external memory.