I. Field of the Disclosure
The field of the disclosure relates to vector processors and related systems for processing vector and scalar operations, including single instruction, multiple data (SIMD) processors and multiple instruction, multiple data (MIMD) processors.
II. Background
Wireless computing systems are fast becoming one of the most prevalent technologies in the digital information arena. Advances in technology have resulted in smaller and more powerful wireless communications devices. For example, wireless computing devices commonly include portable wireless telephones, personal digital assistants (PDAs), and paging devices that are small, lightweight, and easily carried by users. More specifically, portable wireless telephones, such as cellular telephones and Internet Protocol (IP) telephones, can communicate voice and data packets over wireless networks. Further, many such wireless communications devices include other types of devices. For example, a wireless telephone may include a digital still camera, a digital video camera, a digital recorder, and/or an audio file player. Also, wireless telephones can include a web interface that can be used to access the Internet. Further, wireless communications devices may include complex processing resources for processing high speed wireless communications data according to designed wireless communications technology standards (e.g., code division multiple access (CDMA), wideband CDMA (WCDMA), and long term evolution (LTE)). As such, these wireless communications devices include significant computing capabilities.
As wireless computing devices become smaller and more powerful, they become increasingly resource constrained. For example, screen size, amount of available memory and file system space, and amount of input and output capabilities may be limited by the small size of the device. Further, battery size, amount of power provided by the battery, and life of the battery are also limited. One way to increase the battery life of the device is to design processors that consume less power.
In this regard, baseband processors may be employed for wireless communications devices that include vector processors. Vector processors have a vector architecture that provides high-level operations that work on vectors, i.e. arrays of data. Vector processing involves fetching a vector instruction once and then executing the vector instruction multiple times across an entire array of data elements, as opposed to executing the vector instruction on one set of data and then re-fetching and decoding the vector instruction for subsequent elements within the vector. This process allows for a reduction in the energy required to execute a program, because among other factors, each vector instruction needs to be fetched fewer times. Since vector instructions operate on long vectors over multiple clock cycles at the same time, a high degree of parallelism is achievable with simple in-order vector instruction dispatch.
FIG. 1 illustrates an exemplary baseband processor 10 that may be employed in a computing device, such as a wireless computer device. The baseband processor 10 includes multiple processing engines (PEs) 12, each dedicated to providing function-specific vector processing for specific applications. In this example, six (6) separate PEs 12(0)-12(5) are provided in the baseband processor 10. The PEs 12(0)-12(5) are each configured to provide vector processing for fixed X-bit wide vector data 14 provided from a shared memory 16 to the PEs 12(0)-12(5). For example, the vector data 14 could be 512 bits wide. The vector data 14 can be defined in smaller multiples of X-bit width vector data sample sets 18(0)-18(Y) (e.g., 16-bit and 32-bit sample sets). In this manner, the PEs 12(0)-12(5) are capable of providing vector processing on multiple vector data sample sets 18 provided in parallel to the PEs 12(0)-12(5) to achieve a high degree of parallelism. Each PE 12(0)-12(5) may include a vector register file (VR) for storing the results of a vector instruction processed on the vector data 14.
Each PE 12(0)-12(5) in the baseband processor 10 in FIG. 1 includes specific, dedicated circuitry and hardware specifically designed to efficiently perform specific types of fixed operations. For example, the baseband processor 10 in FIG. 1 includes separate WCDMA PEs 12(0), 12(1) and LTE PEs 12(4), 12(5), because WCDMA and LTE involve different types of specialized operations. Thus, by providing separate WCDMA-specific PEs 12(0), 12(1) and LTE-specific PEs 12(4), 12(5), each of the PEs 12(0), 12(1), 12(4), 12(5) can be designed to include specialized, dedicated circuitry that is specific to frequently performed functions for WCDMA and LTE for highly efficient operation. This design is in contrast to scalar processing engines that include more general circuitry and hardware designed to be flexible to support a larger number of unrelated operations, but in a less efficient manner.
Certain wireless baseband operations require merging of data samples determined from previous processing operations. For example, it may be desired to accumulate vector data samples of varying widths that are wider than the data paths of the execution units. As another example, it may be desired to provide a dot product multiplication of output vector data samples from different execution units to provide merging of output vector data in vector processing operations. The vector data samples in these vector processing operations can include complex routing that provides data paths crossing vector data lanes. However, this increases complexity and can reduce efficiency of a vector processing engine (VPE), because of parallelization difficulties in the output vector data to be merged crossing over different vector data lanes. Vector processors can also include circuitry that performs post-processing merging of output vector data stored in vector data memory from execution units. The post-processed output vector data samples stored in vector data memory are fetched from vector data memory, merged as desired, and stored back in vector data memory. However, this post-processing can delay the subsequent vector processing operations of the VPE, and cause computational components in the execution units to be underutilized.