With the continuous advances in semiconductor technology and architecture design, computer central processing unit (CPU) clock rates have increased dramatically in recent years, in a trend that is expected to continue in the near future. Memory chips, on the other hand, have seen significant but less dramatic increases in speed. In order to take full advantage of a powerful CPU, a computer memory system must be able to sustain data transfer rates matched to the CPU processing capabilities. One solution to this problem is offered by memory interleaving. An interleaved memory includes a number of memory modules. Each memory module is capable of servicing a memory request independent of the other modules. Hence, more than one module can be processing requests at any given time. This parallel processing capability may be used to attain effective memory speeds which are greater than the speed of any single memory module.
To be effective, such a memory module system must operate in an environment in which successive memory requests generated by a program running on the CPU are, on average, processed by different memory modules. If two successive requests must be processed by the same memory module, then the beginning of the processing of the second request will be delayed until the processing of the first request has been completed. If, however, the successive requests are processed by different memory modules, the processing of the second request may begin different the end of the processing of the first request.
In principle, a programmer with a knowledge of the specific computer on which his or her program is to run can allocate the storage for that program such that successive memory requests will be processed by separate memory modules. Unfortunately, such a scheme is impractical for at least two reasons. First, the labor needed to arrange for such an allocation is excessive. Second, the particular storage allocation would be advantageous for only those computers having exactly the same memory module structure. A different storage allocation would need to be generated for each computer on which the program was to be run.
Consequently, hardware schemes referred to as hashing schemes have been employed to accomplish the storage allocation on such systems. In hashed systems, the programmer treats the main memory as consisting of a single large block of contiguous memory addresses. The actual memory consists of M memory modules, where M is a positive integer (usually, a power of 2). Each main memory address is mapped to a module, and to an address within that module by special purpose hardware associated with the computer memory. In principle, one memory request can arrive on each clock cycle; hence, the hashing hardware must be capable of generating the required mappings at one per CPU cycle.
In addition, it is advantageous to have ms small a latency time as possible in the hardware. In principle, the speed requirement may be met by employing a pipe-lined hash processor which receives one memory address to be converted each cycle and calculates one converted address each cycle. The calculated address will, in general, correspond to a main memory address received several cycles earlier. The delay in question is referred to as the latency of the processor. The CPU may compensate for this latency by sending addresses in advance of the time at which the result is needed. However the CPU cannot always predict the required sequence of main memory addresses sufficiently in advance. When a mistake is made, a delay is introduced into the processing. The delay in question is typically of a magnitude equal to the latency time of the pipe-line. Hence, pipe-lines with minimum latency time are advantageous.
One prior art system for hashing utilizes a mapping in which the least significant bits of the main memory address determine which memory module is associated with that address. For example, consider a computer memory having 8 memory modules. If the least significant 3 bits of the main memory address are interpreted as the module address, successive memory addresses will be mapped in different memory modules. Those addresses ending in 000 will be mapped to the first memory module, those ending with 001 to the second memory module, and so on. Hence, if the program sequentially addressed each location in a block of main memory address, the successive requests would always be processed by different memory modules.
In practice, the sequence of memory addresses accessed by a computer program is unpredictable at computer design time, and it varies widely among different computer programs. There will always be sequences of memory requests that must be processed by the same memory module. Hence, the best that any hashing scheme can accomplish is to guarantee that the storage is allocated such that the most common sequences of memory requests are not required to be processed by the same memory module. One of the most common sequences of memory address requests is an arithmetic progression of the form a, a+s, a+2s , . . . . Such requests are typically generated by a program which is sequentially accessing the elements of an array of fixed size data objects. The base address, a, is the address of the beginning of the array, and the stride, s, is the size of the data objects.
The simple prior an hashing scheme described above will function adequately in this environment if the number of memory modules is not an integer multiple of a non-trival divisor of the stride s. For example, if the number of memory modules is equal to s, successive requests must be processed by the same memory module, and the hashing scheme fails. Since the stride is different for different programs, there will always be pathological cases in which the stride is equal to the number of memory modules and the hashing scheme will fail. In this regard, it should also be noted that memory speed degradation may also occur if the stride is an integer fraction of the number of memory modules. Consider a memory with 8 modules in which the processing time required by each module is 8 clock cycles and the CPU sends requests each clock cycle. Assume that the stride is equal to 4, and the first request is made of the first memory module. The second request will be made to the 5th memory module and the third request will be made of the first memory module. However, the first memory module will still be processing the previous request. Hence, the third request will be delayed by 6 clock cycles. In addition, the delay will accumulate with each additional request.
Since data objects tend to be allocated in sizes which are powers of 2 and the number of memory modules in a memory are also often powers of 2, degradation often results. Hence, pathological cases are sufficiently common to result in performance degradation in the above described prior art hashing schemes.
For the purposes of this discussion, a hashing scheme will be said to have "no pathologies" if, on the average, the distribution of memory accesses to the different modules is very close to uniform for any stride less than some predetermined maximum. If this condition is satisfied, each memory module will be accessed roughly the same number of times in a long sequence of accesses.
Broadly, it is an object of the present invention to provide an improved hashing scheme.
It is a further object of the present invention to provide a hashing scheme that lacks pathological cases for all stride values less than some predetermined stride value.
It is yet another object of the present invention to provide a hashing scheme which is implementable in a pipe-lined architecture with a small latency time.
These and other objects of the present invention will become apparent to those skilled in the art from the following detailed description of the present invention and the accompanying drawings.