I. Field of the Disclosure
The field of the disclosure relates to vector processors and related systems for processing vector and scalar operations, including single instruction, multiple data (SIMD) processors and multiple instruction, multiple data (MIMD) processors.
II. Background
Wireless computing systems are fast becoming one of the most prevalent technologies in the digital information arena. Advances in technology have resulted in smaller and more powerful wireless communications devices. For example, wireless computing devices commonly include portable wireless telephones, personal digital assistants (PDAs), and paging devices that are small, lightweight, and easily carried by users. More specifically, portable wireless telephones, such as cellular telephones and Internet Protocol (IP) telephones, can communicate voice and data packets over wireless networks. Further, many such wireless communications devices include other types of devices. For example, a wireless telephone may include a digital still camera, a digital video camera, a digital recorder, and/or an audio file player. Also, wireless telephones can include a web interface that can be used to access the Internet. Further, wireless communications devices may include complex processing resources for processing high speed wireless communications data according to designed wireless communications technology standards (e.g., code division multiple access (CDMA), wideband CDMA (WCDMA), and long term evolution (LTE)). As such, these wireless communications devices include significant computing capabilities.
As wireless computing devices become smaller and more powerful, they become increasingly resource constrained. For example, screen size, amount of available memory and file system space, and amount of input and output capabilities may be limited by the small size of the device. Further, battery size, amount of power provided by the battery, and life of the battery are also limited. One way to increase the battery life of the device is to design processors that consume less power.
In this regard, baseband processors may be employed for wireless communications devices that include vector processors. Vector processors have a vector architecture that provides high-level operations that work on vectors, i.e. arrays of data. Vector processing involves fetching a vector instruction once and then executing the vector instruction multiple times across an entire array of data elements, as opposed to executing the vector instruction on one set of data and then re-fetching and decoding the vector instruction for subsequent elements within the vector. This process allows for a reduction in the energy required to execute a program, because among other factors, each vector instruction needs to be fetched fewer times. Since vector instructions operate on long vectors over multiple clock cycles at the same time, a high degree of parallelism is achievable with simple in-order vector instruction dispatch.
FIG. 1 illustrates an exemplary baseband processor 10 that may be employed in a computing device, such as a wireless computer device. The baseband processor 10 includes multiple processing engines (PEs) 12, each dedicated to providing function-specific vector processing for specific applications. In this example, six (6) separate PEs 12(0)-12(5) are provided in the baseband processor 10. The PEs 12(0)-12(5) are each configured to provide vector processing for fixed X-bit wide vector data 14 provided from a shared memory 16 to the PEs 12(0)-12(5). For example, the vector data 14 could be 512 bits wide. The vector data 14 can be defined in smaller multiples of X-bit width vector data sample sets 18(0)-18(Y) (e.g., 16-bit and 32-bit sample sets). In this manner, the PEs 12(0)-12(5) are capable of providing vector processing on multiple vector data sample sets 18 provided in parallel to the PEs 12(0)-12(5) to achieve a high degree of parallelism. Each PE 12(0)-12(5) may include a vector register file (VR) for storing the results of a vector instruction processed on the vector data 14.
Each PE 12(0)-12(5) in the baseband processor 10 in FIG. 1 includes specific, dedicated circuitry and hardware specifically designed to efficiently perform specific types of fixed operations. For example, the baseband processor 10 in FIG. 1 includes separate WCDMA PEs 12(0), 12(1) and LTE PEs 12(4), 12(5), because WCDMA and LTE involve different types of specialized operations. Thus, by providing separate WCDMA-specific PEs 12(0), 12(1) and LTE-specific PEs 12(4), 12(5), each of the PEs 12(0), 12(1), 12(4), 12(5) can be designed to include specialized, dedicated circuitry that is specific to frequently performed functions for WCDMA and LTE for highly efficient operation. This design is in contrast to scalar processing engines that include more general circuitry and hardware designed to be flexible to support a larger number of unrelated operations, but in a less efficient manner.
As an example, it may be desired to employ vector processing for performing CDMA processing. In this regard, it may be desired to employ vector processors to provide correlation operations to choose the direct spread spectrum code (DSSC) (i.e., chip sequence) for demodulating a user signal in a CDMA system to provide good separation between the user signal and signals of other users in the CDMA system. The separation of the signals is made by correlating the received signal with the locally generated chip sequence of the desired user. If the signal matches the desired user's chip sequence, the correlation function will be high and the CDMA system can extract that signal. If the desired user's chip sequence has little or nothing in common with the signal, the correlation should be as close to zero as possible (thus eliminating the signal), which is referred to as cross-correlation. If the chip sequence is correlated with the signal at any time offset other than zero, the correlation should be as close to zero as possible. This is referred to as auto-correlation, and is used to reject multi-path interference.
However, correlation operations may be difficult to parallelize in vector processors due to the specialized data flow paths provided in vector processors. When the input vector data sample set representing the signal to be correlated is shifted between delay taps, the input vector data sample set is re-fetched from a vector data file, thus increasing power consumption and reducing throughput. To minimize re-fetching of the input vector data sample set from memory, the data flow path could be configured to provide the same number of multipliers as delay taps for efficient parallelized processing. However, other vector processing operations may require fewer multipliers thereby providing inefficient scaling and underutilization of the multipliers in the data flow path. If the number of multipliers is reduced to be less than the number of delay taps to provide scalability, parallelism is limited by more re-fetches being required to memory to obtain the same input vector data sample set for different phases of the correlation processing.