This invention is related to computer processors, and in particular to computer processors having operands substantially wider than the data path width between the processor and memory.
We have previously described a programmable processor and method for improving the performance of programmable processors by enabling the use of operands that are larger than the internal data path width of the processor. See, e.g. U.S. Pat. No. 8,269,784 entitled “Processor Architecture for Executing Wide Transform Slice Instructions,” which is incorporated by reference herein. This prior invention uses the contents of general-purpose registers to specify operands stored in memory that are wider than the processor internal data path. The operands are typically a multiple of times wider than the data path and are preferably stored in adjacent rows of a memory. The registers also specify the memory address at which the plurality of widths of data can be read (or written).
FIG. 1, taken from our earlier patent, illustrates a sample FFT slice instruction with 5 fields ‘wminor,’ ‘*data,’ ‘*twiddl,’ ‘fftpar,’ and ‘wfftslic’. The instruction causes the processor to perform an operation that when repeated a sufficient number of times computes a Fast Fourier Transform (FFT). Each time the instruction is executed a ‘slice’ of the FFT is computed. The register field *data specifies an address for a region in memory containing data and the register field *twiddl specifies an address for another region in memory containing “twiddle factor” coefficients (complex roots of unity).
The first time this operation is executed, twiddle factors are loaded at the rate of the available processor memory bus into a “coefficient RAM” embedded in the execution unit (or, in an alternative embodiment, are already present in embedded ROM), and data are loaded into embedded “wide cache” memories. Successive operations reuse the twiddle factors to perform successive slices of FFT on data buffered in the embedded cache.
The exemplary hardware for performing this operation is also shown in FIG. 1. The hardware in FIG. 1 contains 16 complex multipliers and 4 radix-4 “butterfly/mux strips,” capable, for example, of performing a single slice of a 256-point radix-4 complex FFT in 16 cycles. The register field fftpar (“FFT parameters”) specifies a processor register specifying the size and shape of the data and twiddle operands, the nature of the FFT slice performed, along with status information, such as a scaling or shift amount needed to avoid overflow during execution of the next FFT slice. Four successive WFFTSLICE operations compute an entire 256-point complex FFT in 64 cycles, depositing the result in a memory-mapped wide cache.
The invention described below provides additional functionality to enable expandably wide operations with improved efficiency on a broader range of algorithms than our prior technology.