1. Field of the Invention
The present invention generally relates to a method and apparatus for addressing multidimensional numerical arrays using fixed overhead resources and more specifically to a method and apparatus for accessing very large, general banded numerical arrays with reduced data storage and formatting requirements. Moreover, the invention is for use as a data cache address generator that decreases the processing time of general banded numerical arrays.
2. Description of the Related Art
The workstation and its microprocessor data cache memory are used in a variety of scientific and engineering applications involving very large general banded numerical arrays. For purposes of this application, "very large general banded numerical arrays" are defined as linear systems of equations with hundreds, or more, unknowns where non-zero coefficients are clustered around the diagonal. Current storage methods for these arrays require relatively large amounts of memory capacity.
A microprocessor typically includes one or more fixed- and floating-point data processors and one or more instruction and data caches, all of which are implemented using one or more integrated circuits (ICs), as shown in FIG. 10.
FIG. 10 shows the five major subassemblies of a microprocessor according to the prior art: the central processing unit 10 (CPU), the data cache 11, the instruction cache 12, the system bus 13, and the memory 14. Addresses are transferred along the CPU address bus 19 and data is transferred along the CPU data bus 20. The CPU 10 has one or more fixed point arithmetic logic units (ALUs, not shown) and one or more floating-point arithmetic units (not shown). The instruction address and data address are maintained by the fixed-point processor.
In high performance computers, caches serve to reduce the observed latency to memory. The cache provides a relatively small but high performance memory close to the processor. The data cache is a local microprocessor memory that stores/retains relatively recent operands (e.g., a predetermined number of the most recent operands) and improves microprocessor performance by reducing the need to access (i.e., read or write) the more remote system memory (which typically consumes many microprocessor cycles). Instead of accessing the remote system memory, access can simply be made to the data cache (e.g., typically within one or at most several machine cycles), thereby reducing processing time, as shown in FIGS. 11a and 11c.
The data cache 11 provides copies of data within the memory 14 to the CPU 10. The data cache 11 includes the data cache directory and control 15 (DIR and CNTRL). Both the data cache 11 and the data cache directory and control 15 are random access memories (RAMs). The data cache 11 holds the data. The directory and control 15 holds the address of the data currently in the data cache 11. Variables currently used by the CPU 10 are located in the data cache 11. The contents of the data cache 11 can be stored back to system memory 14 after use.
The instruction cache 12 provides copies of instructions from memory 14 to the CPU 10. The instruction cache 12 includes the instruction cache directory and control 16. The contents of the instruction cache 12 usually are not stored back to system memory 14 since the instructions are not modified by the CPU 10.
The system bus 13 is the electrical connection to the memory 14. The system bus is divided into a data bus 13a, an address bus 13b and a control bus 13c, all having multiple signal lines. Unidirectional buffers 17 and bi-directional buffers 18 prevent interference between the data cache 11 and the instruction cache 12 when either are accessing memory 14. The buffers 17, 18 are controlled by the cache directory and control units 15,16.
FIG. 11(a) shows a "read-hit" cache operation, which occurs when the CPU 10 executes a load register instruction: EQU R.rarw.MEM (ADDR)! {EQ 1}
where the left hand side R is a CPU register, and MEM (ADDR)! is the data from memory as pointed to by the CPU supplied address (ADDR).
FIG. 12 shows the format of a typical address request that may be issued by the CPU 10. The address request appears on the CPU address bus 19 and includes "set bits" and "tag bits" as shown in FIG. 12. The directory and control unit 15 monitors the set bits that appear in the address on the address bus 19 to determine if they match the set bits of the directory and control unit 15. The tag bits contain address information of the associated data that is stored in the cache. If the directory and control unit 15 match the set bits on the address bus to those in the directory, then the data associated with these set bits is stored in the cache and the tag bits provide a cache address of that data. If the set bits on the address bus do not match those in the directory and control unit 15, then the related data is not in the cache.
In the case of a read hit, the set bits in a load request from the CPU 10 match set bits within the directory and control unit 15. After such a match, directory and control 15 disables the address buffers for the system bus, and enables the data cache 11 to load a register within the CPU 10. The register load operation for a read-hit is EQU R.rarw.MEM (ADDR)!.sub.CACHE {EQ 2}
where R is a CPU register and MEM(ADDR)!.sub.CACHE is the data available in the cache at the CPU supplied address (ADDR).
FIG. 11(b) shows a "read miss", where set bit entries in the directory and control 15 do not match the address of the load request shown in Equation 1. In this case, the system bus address buffers are enabled, and the memory provides the data for the CPU register.
Normally, several data elements are read from memory 14 or the data cache 11, starting from the address provided by the CPU 10. This set of data elements is referred to as a "line" of data. The directory and control 15 configures the bi-directional data buffers to receive the incoming data, signals the CPU 10 to load the register with the first incoming data element along with the remainder of the data into the data cache 11. The directory and control 15 also records the set bits and tag bits for the first data element in the line. Thus, the register load operation for a read miss is EQU R.rarw.MEM (ADDR)!.sub.MEMORY {EQ 3}
FIG. 11(C) shows a "write hit", where the set bits in a store request from the CPU 10 match set bits within the directory and control unit 15. After such a match, directory and control 15 enables the address buffers for the system bus, and enables the data cache and memory to store the contents of the CPU register 10. This process is referred to as a "write-through" operation where both the data cache memory and main memory are updated. This operation can be denoted as EQU MEM(ADDR)!.sub.MEMORY .rarw.R
and, EQU MEM(ADDR)!.sub.CACHE .rarw.R {EQ 3.1}
FIG. 11(d) shows a "write miss". Since the directory and control 15 does not have set bits correlating to the register store operation, the data is written only to the system memory 14. This is denoted as EQU MEM(ADDR)!.sub.MEMORY .rarw.R {EQ 3.2}
Data from the much larger, and slower, main memory is automatically staged into the cache on a demand basis, as shown in FIG. 11(b). If the program running on the computer exhibits good locality of reference in the main memory address space, most of the accesses by the processor are satisfied from the cache, and the average memory access time seen by the processor is close to that of the cache (e.g., on the order of one to two cycles). When the processor does not find the requested data in the cache, it incurs a "cache miss penalty", which is a longer access time to the main memory.
For a given cache structure, a program can be characterized by its "cache hit ratio" (CHR) which is the fraction of the accesses that are satisfied from the cache and hence do not suffer the longer latency to main memory. Compressing data improves the CHR and overall microprocessor performance by allowing more of the needed data to fit in the data cache.
Turning now to arrays, a sparse matrix is a two dimensional array that has few non-zero entries relative to the number of zero entries. The zero entries represent the zero coefficients of the linear system, whereas the non-zero entries in the matrix represent coefficients that have a non-zero value. A "band array" is a special type of sparse array where all non-zero elements are contained between two symmetric lines drawn parallel to the diagonal.
FIG. 6 illustrates a conventional band array. The band array contains entries that are zero entries that do not perform a useful function, but consume memory space that could be used for other purposes. This problem creates inefficiencies by unnecessarily using increased memory space and thereby reducing overall speed of the microprocessor.
To solve problems represented by band arrays efficiently, space and time requirements must be minimized. Current storage methods for general band arrays consume a relatively large amount of memory.
The BLAS, or Basic Linear Algebra Subprograms, storage method is illustrated in FIG. 7. In this method, each diagonal in the original matrix becomes a row in the transformed matrix. The BLAS are a set of industry standard Fortran subroutines that perform linear algebra operations. The asterisks shown in FIG. 7 are unused positions that must be stored in order to maintain the regularity of the matrix structure. Therefore, the storage overhead for this method is proportional to the size of the array. The BLAS storage method does not scale to an arbitrary number of dimensions, N.
Different storage representations are needed for different types of band arrays. For example, a dense array represented in the BLAS scheme requires almost twice the memory actually needed to represent all the non-zero array elements. Therefore, different storage schemes are used for various type of arrays (i.e. banded, triangular, dense), and different programs must be written to handle each different type of array. Further, prior to this invention, there was no general, scalable, uniform method to operate on any N-dimensional band array.
The conventional systems do not provide a formally-stated patterned sparse array indexing method that promotes automated handling of array operations (i.e., array operations which are optimized by the compiler for either local or distributed processing).
Moreover, prior to this invention, there was no efficient hardware address generator method that allowed high-speed patterned sparse array processing. Also, prior to this invention, there was no general, scalable method for addressing compressed multi-dimensional patterned sparse arrays.