1. Technical Field
The present principles generally relate to register files, and more particularly, to methods and apparatus for spatial register partitioning in a multi-bit cell register file. The methods and apparatus balance timing and area between register file slices for a register file supporting scalar and vector execution.
2. Description of the Related Art
Modern microprocessor systems derive significant efficiency from using data-parallel single instruction multiple data (SIMD) execution, particularly for data-intensive floating point computations. In addition to data-parallel SIMD computation, scalar computation is necessary for code that is not data parallel. In modern Instruction Set Architectures (ISAs), such as the Cell SPE, scalar computation can be executed from a SIMD register file.
In ISAs with legacy support, separate scalar register files are necessary. To reduce the overhead of having to support scalar and SIMD computation, it is desirable to share data paths. Turning to FIG. 1, an exemplary register architecture having two separate register files, one for storing scalar data and the other for storing vector data, thus requiring the different types of data to be stored in different (separate) register files, is indicated generally by the reference numeral 100. The exemplary architecture includes the scalar register file 110 and the vector register file 120, as noted above, as well as a multiplexer 130, and FMA units 140. This prior art approach undesirably utilizes routing resources and multiplexers (such as multiplexer 130) in order to select from multiple data sources. Moreover, this prior art approach undesirably has a higher fan-out to drive data into one of the multiple data destinations.
To reduce the chip area used for such implementations, it would be desirable to implement a single register file to store data for both the scalar and SIMD register file.
In one prior art implementation, a narrow register file is implemented, and wide architected registers are accomplished by allocating multiple scalar physical registers. However, this prior art implementation results in either low performance, when each slice is operated upon in sequence, or high area cost, when data for multiple slices are read in parallel by increasing the number of read ports.