1. Field of the Invention
This invention relates to the field of translation tables and, more particularly, to skewed hashing functions employed within translation tables of multiprocessor computer systems.
2. Description of the Relevant Art
Multiprocessing computer systems include two or more processors which may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole. Generally speaking, a processor is a device configured to perform an operation upon one or more operands to produce a result. The operation is performed in response to an instruction executed by the processor.
A popular architecture in commercial multiprocessing computer systems is the symmetric multiprocessor (SMP) architecture. Typically, an SMP computer system comprises multiple processors connected through a cache hierarchy to a shared bus. Additionally connected to the bus is a memory, which is shared among the processors in the system. Access to any particular memory location within the memory occurs in a similar amount of time as access to any other particular memory location. Since each location in the memory may be accessed in a uniform manner, this structure is often referred to as a uniform memory architecture (UMA).
Processors are often configured with internal caches, and one or more caches are typically included in the cache hierarchy between the processors and the shared bus in an SMP computer system. Multiple copies of data residing at a particular main memory address may be stored in these caches. In order to maintain the shared memory model, in which a particular address stores exactly one data value at any given time, shared bus computer systems employ cache coherency. Generally speaking, an operation is coherent if the effects of the operation upon data stored at a particular memory address are reflected in each copy of the data within the cache hierarchy. For example, when data stored at a particular memory address is updated, the update may be supplied to the caches which are storing copies of the previous data. Alternatively, the copies of the previous data may be invalidated in the caches such that a subsequent access to the particular memory address causes the updated copy to be transferred from main memory. For shared bus systems, a snoop bus protocol is typically employed. Each coherent transaction performed upon the shared bus is examined (or xe2x80x9csnoopedxe2x80x9d) against data in the caches. If a copy of the affected data is found, the state of the cache line containing the data may be updated in response to the coherent transaction.
Unfortunately, shared bus architectures suffer from several drawbacks which limit their usefulness in multiprocessing computer systems. A bus is capable of a peak bandwidth (e.g. a number of bytes/second which may be transferred across the bus). As additional processors are attached to the bus, the bandwidth required to supply the processors with data and instructions may exceed the peak bus bandwidth. Since some processors are forced to wait for available bus bandwidth, performance of the computer system suffers when the bandwidth requirements of the processors exceeds available bus bandwidth.
Additionally, adding more processors to a shared bus increases the capacitive loading on the bus and may even cause the physical length of the bus to be increased. The increased capacitive loading and extended bus length increases the delay in propagating a signal across the bus. Due to the increased propagation delay, transactions may take longer to perform. Therefore, the peak bandwidth of the bus may decrease as more processors are added.
These problems are further magnified by the continued increase in operating frequency and performance of processors. The increased performance enabled by the higher frequencies and more advanced processor microarchitectures results in higher bandwidth requirements than previous processor generations, even for the same number of processors. Therefore, buses which previously provided sufficient bandwidth for a multiprocessing computer system may be insufficient for a similar computer system employing the higher performance processors.
Another structure for multiprocessing computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes within which processors and memory reside. The multiple nodes communicate via a network coupled there between. When considered as a whole, the memory included within the multiple nodes forms the shared memory for the computer system. Typically, directories are used to identify which nodes have cached copies of data corresponding to a particular address. Coherency activities may be generated via examination of the directories.
Distributed shared memory systems are scaleable, overcoming the limitations of the shared bus architecture. Since many of the processor accesses are completed within a node, nodes typically have much lower bandwidth requirements upon the network than a shared bus architecture must provide upon its shared bus. The nodes may operate at high clock frequency and bandwidth, accessing the network when needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.
Distributed shared memory system may employ local and global address spaces. A portion of the global address space may be assigned to each node within the distributed shared memory system. In some distributed shared memory systems, data corresponding to the address of remote nodes may be copied to a requesting node""s shared memory such that future accesses to that data may be performed via local transactions rather than global transactions. The copied data is referred to as shadow pages. In such systems, CPU""s local to the node may use the local physical address may be assigned to shadow pages. Address translation tables are provided to translate between the global address and the local physical address assigned to the shadow pages. In distributed shared memory systems with large address spaces, the translation tables used to translate between global addresses and local physical addresses can become very large. For example, in a distributed shared memory system with four nodes with 1M pages per node, a global address to local physical address translation table may include 4M entries. In some systems, the access time of such a large translation table may add unacceptable delay to a memory transaction.
To reduce the latency and the implementation cost associated with a global address to local physical address translation, some distributed shared memory systems employ a cache for storing the most recently accessed translations. The cache reduces the propagation delay for translations stored in the cache. Cache misses, however, add significant latency and the cache adds significant complexity to the translation table.
To decrease the number of cache misses, the size of the cache may be increased or the cache may be made set associative. Associative caches trade-off utilization for access time. In other words, the higher the associatively of a cache, the longer the access time. For example, a fully associative cache may approach 100% utilization. However, the access time of a fully associative cache is relatively long because each entry in the cache may be queried for the desired data. Alternatively, a direct mapped cache has a relatively short access time (only one entry is accessed), but the utilization of a direct mapped cache may be relatively low. A look-up table with high utilization and short access times is desirable.
The problems outlined above are in large part solved by a skewed-associative table that implements an insertion algorithm to maximize the utilization of the table. In one embodiment, an input address is converted to two look-up addresses using one or more index functions. The look-up addresses address a primary entry associated with the input address and a secondary entry associated with the input address. Only the primary entry and the secondary entry need to be accessed during a table look-up. An insertion algorithm maximizes the utilization of the table by realigning the data stored in the table to make an entry available for new data. For example, if a primary entry and secondary entry associated with an input address are occupied by other entries, the insertion algorithm will move the data stored in either the primary entry or the secondary entry to an alternative entry for that data. By moving the data to an alternative entry, the entry is made available to store the new data. If the alternative entries for the data stored in the primary entry and secondary entry is unavailable, the data stored in the alternative entries are stored in an alternative entry for that data. The data in the primary entry or secondary entry is then stored to its alternative entry and the entry is made available to store the new data. Accordingly, the insertion algorithm increases the utilization of the table to approach the utilization of a fully associative table while the access time of the table is similar to a two-way set-associative table. It is noted that the present invention applies to caches as well as tables.
Broadly speaking the present invention contemplates a look-up table configured to store and output data corresponding to input addresses. The lookup table includes a plurality of entries for storing the data and a look-up address circuit. The look-up address circuit is configured to receive the input address and includes a first index function circuit and a second index function circuit. The first index function circuit is configured to convert a first input address to a primary look-up address that corresponds to the first input address, wherein a primary entry of the plurality of entries is addressed by the primary look-up address. The second index function circuit is configured to convert the first input address to a secondary look-up address that corresponds to the first input address, wherein a secondary entry of the plurality of entries is addressed by the secondary look-up address. The look-up table is configured to store a first datum to the primary entry if the primary entry is available and to store the first datum to the secondary entry if the primary entry is unavailable. If the primary entry and the secondary entry are unavailable, the look-up table is configured to move a second datum stored in the primary entry (or secondary entry) to an alternate entry for the second datum and to store the first datum to the primary (or secondary entry) entry.
The present invention further contemplates a method of storing and retrieving data in a look-up table wherein the data corresponds to input addresses and each input address corresponds to a primary entry and a secondary entry of the look-up table comprising: if a primary entry corresponding to a first input address is available, storing a first datum to the primary entry; if the primary entry is unavailable, storing the first datum to a secondary entry corresponding to the first input address; if the primary entry and the secondary entry are unavailable, moving a second datum stored in the primary entry (or secondary entry) to an alternate entry of the second datum and storing the first datum to the primary entry (or secondary entry).