This invention relates to vector processing, and particularly to processors for transferring vector-elements of vectors having a constant address stride between main memory and the architectural registers without a cache.
General purpose processors transfer vectors between main memory and the architectural registers using LOAD and STORE instructions, scalar instructions and/or PUSH and POP instructions. LOAD and STORE instructions access memory via a cache to load operands from memory and store results into memory. Scalar instructions process operands accessed (inputted) from architectural registers to produce results accessed (outputted) into the architectural registers. PUSH and POP instructions access a processor-maintained stack in memory via cache to push (store) operands and results in architectural registers onto the stack and to pop (load) them off the stack back into the architectural registers.
Each processor has a limited number of general purpose architectural registers (ARs), each capable of holding the same amount of data. At any one time each AR can hold a value which can be variously characterized as an operand, a result, an address or any combination of these for different purposes; a result of a scalar instruction in an AR can become a memory-address in the AR for a LOAD instruction, and so on.
Typically, two ARs are needed to access memory with a LOAD or STORE instruction. A typical LOAD instruction designates the AR to receive loaded data from memory via cache and designates the AR having the memory-address used to access (load) the data from memory; a typical STORE instruction designates the AR from which to store data to memory via cache and designates the AR having the address used to access (store) the data into memory. The typical scalar instruction inputs its two operands as those in the two ARs and designates and outputs its result into an AR that it also designates. The PUSH instruction designates the AR from which it pushes (stores) the AR's current data onto the top of the processor's current stack in memory; thereby making the stack larger by one AR's amount of data. The POP instruction designates the AR into which it pops (loads) data from off the top of the stack, thereby making the stack one AR's amount of data smaller.
The LOAD and POP instructions are each executed by one of the functional units; the unit's data-input being received into the designated AR via the cache from memory. Also, the STORE and PUSH instructions are each executed by one of the functional units; the unit's result is sent to memory via the cache as that read from the AR the instruction designates. Each instruction executed by a given functional unit executes in the same fixed amount of time generally beginning immediately after all operands are in the inputs of the functional unit designated by the instruction. The functional unit thereby produces and delivers the instruction's result to its output for delivering to the instruction's designated AR.
If and when the inputs of the needed functional unit are available, each issued instruction is enabled to execute (issued) by the issue unit immediately after the instruction is assured of having all of its operands before the next interrupt. If the required operands are all currently residing in ARs, the time period between issuing the instruction and actually having the operands can be immediate. However, if one or more needed operands of a next to be issued scalar instruction is still being generated by respective functional units as respective results of previously issued scalar instructions, then their arrival is assured but not yet actually present. Any already present operand from an AR must wait for the other operand to also arrive at its AR so that together they cause execution of the instruction to begin. A waiting operand waits in its input of the functional unit, and the instruction begins execution when both operands are in their respective inputs. Usually, each functional unit is pipe-lined, so inputs become available to a following instruction as the former begins execution.
As slightly different from scalar instructions, before being issued a LOAD, STORE, POP or PUSH instruction receives assurance from the cache that the cache holds the memory-data being accessed or from the associative that the cache will assuredly soon hold the data. Thus, the address to be accessed is sent to cache from its AR before the instruction issues, and each of these instructions can be long to issue after being at the point of issuing. Additionally, after issuing, the instruction can thereafter be long to begin execution. Again, once issued, an instruction will eventually execute to completion, but if delayed for issuing because necessary memory-data (LOAD or POP) or necessary memory-space (STORE or PUSH) are not now in physical memory, the instruction will not issue for execution without an interrupt to thereafter allow supplying the missing data or memory space. While interrupted, missing memory-data or missing memory-space is supplied from mass storage into physical memory. After the missing data or memory space is supplied and after a respective return-from-interrupt is executed, the LOAD, STORE, POP or PUSH instruction which caused the interrupt is the first to again try to issue.
The processor's interrupt unit gathers conditions requiring an interrupt, then waits for activity to settle (like waiting for all issued but not yet executed instructions to execute and deliver their results to ARs, memory, etc.). After all is settled, the interrupt unit initiates the occurring and facilitates timing of execution of the processor-state-exchange for resolving that causing the interrupt. Caching techniques are used by LOAD, STORE and related instructions (like POP and PUSH) to access immediately needed data in memory and also to prepare extra amounts of contiguous data from the main memory in anticipation of possible future needs to be thereby more quickly accessed by future LOAD, STORE and related instructions of the processor. If a data element has not previously been acquired into cache memory, a LOAD or STORE instruction causes the cache to acquire a fixed-sized aligned block of contiguous data (a cache-line) that includes the sought-for data. If no existing cache-line has a needed data-element, the instruction causes the cache to acquire a new cache-line from memory and then accesses the needed element from the newly acquired cache-line. Only a limited number of cache-lines can exist in the cache at one time; when the cache reaches its limit, each new line replaces the least-recently-accessed line.
A cache-line contains more data than immediately needed by any one instruction and each cache-line can potentially meet the data needs of any number of future LOAD or STORE instructions. In the case of vector processing, a future instruction to access a vector-element will first access any existing cache-lines for the element. If not found, the instruction will cause the memory to acquire a new cache-line containing the sought for element and then access the element.
A vector in memory comprises orderly located data-elements (herein, vector-elements) having different addresses and possibly having different sizes (data-bits per element). If in memory the amount of non-element-data between a pair of successive vector-elements is small, it is likely that a cache-line containing the first element will also contain the second element. However, caching large aligned blocks of contiguous data to possibly acquire more vector-elements than one often requires also acquiring data that have no elements of the vector. The only exception is where successive vector-elements in memory abut or overlap so that each cache-line is entirely vector-elements.
A stride is a count of the number of same-sized memory-data-amounts from one vector-element to the next, as from point to same respective point in successive vector-elements in the direction of increasing address values. If the number and direction of memory-data-amounts between each of all pairs of successively accessed vector-elements are the same, the stride is constant over the entire vector. In the particular case of a vector whose paired elements abut in memory and are all of one size or data-amount, the vector's stride is constant and either +1 or −1, and except possibly to begin and end the vector, each cache-line thereof comprises entirely vector-elements. Whereas, a vector having a constant stride of +2 or −2 between all successive same-sized vector-elements results in an inefficiency that at least half the memory-data of each so acquired cache-line are not sought-for vector-elements. With each increment of stride value farther from zero beyond ±1, an additional amount of cached data are never vector-data, the extreme condition being a large stride value resulting in cache-lines each containing only one vector-element.
A vector having a constant element size can have a constant or a varying stride. For example a vector having a variable stride might have vector-elements consisting solely of full-words (64-bits), or half-words (32-bits), etc., but different numbers of like memory-data-amounts and/or different directions of address increments between any two pairs of vector-elements in memory (e.g., +1, −2, +3, −4, +5, . . . ). A vector having a varying element size has addresses such that each addresses its respective element located somewhere in memory as aligned respective to the element's size. For purposes of the present invention a vector must have parameters to pre-define its elements' addresses and sizes as defining the vector's location in memory before accessing begins. For purposes of an embodiment of the present invention, a vector having a constant stride and a constant element size is called a “regular vector”. Thus, a regular vector can be pre-defined as having a first element address, a constant difference respective to element-size (stride) between memory-addresses of successive vector-elements, and a constant number of bits (size) per element. Different regular vectors may have different strides and/or different element sizes as long as the stride is constant and the element size is constant over each entire vector. Also for purposes of this invention, an irregular vector is a vector that is not entirely regular; namely, the element size varies (such as a mixture of half-words and full-words) or, if the element size is constant (such as all half-words), then the stride value is not always the same between each of all pairs of successive vector-elements.
There is a need for techniques to speed vector processing, and particularly to processing certain vectors in memory with scalar instructions but without a cache, so that the number of encodings and executions of LOAD, STORE and address-support instructions and delays associated with cache processing can be reduced and, where the certain vectors are present, elements of all vectors can be more efficiently read from memory and stored to memory.