In the field of signal processing, system-on-a-chip controllers are becoming more common and more powerful. Typically, they involve the combination of a central processing unit (CPU) (e.g., a microprocessor) core to provide control functionality and a digital signal processing (DSP) core to provide signal processing. Often, the CPU and DSP functionalities are tightly coupled. That is, DSP processing is often armed and executed, and the CPU core waits for the DSP core to finish.
The DSP core of the system-on-a-chip controller, can include a number (e.g., four) parallel pipeline execution paths. The pipeline execution paths are also referred to data processing paths. The parallel paths facilitate the processing of greater amounts of data than with a single path. Data from a source external to the DSP core (e.g., an external memory) can be read into a local memory of the DSP core by a direct memory access (DMA) module. The data is stored in the local memory in a manner determined by the DMA. Usually, the data is stored in either a per-data path model or a globally accessible scheme. When stored in a per-data path model, the data can be accessed by the respective data paths in parallel. When stored in a globally accessible scheme, each data path can access the entire memory, but the processing times are increased due to the increased number of reads needed to get data to each data processing path.
Data processing instructions that are executed by the data processing paths are retrieved (i.e., fetched) from an instruction cache, decoded, and issued on a per clock cycle basis. Various known forms of data processing instructions are used to increase the speed and the amount of data that can be processed by the DSP core.
For example, it is known that super-scalar processing architectures allow for multiple simultaneous instruction fetches and decodes. However, often not all the fetched instructions are allowed to be issued to the respective execution units. Hardware performs a series of checks to determine which instructions can be issued simultaneously. These checks adversely impact the clock speed and processing of the DSP core.
In addition, another processing architecture, known as the Very Long Instruction Word (VLIW) architecture, extends the amount of instructions that can be fetched and issued simultaneously beyond the super-scalar architecture. Similar to super-scalar processing, multiple simultaneous instruction fetches and decodes are performed. However, in a VLIW architecture, the checks to determine whether instructions can be issued simultaneously are typically performed by the compiler.
Still another known approach to improving the throughput of DSPs is the use of the Dense Instruction Word (DIW) architecture. In contrast to super-scalar and VLIW architectures in which the execution units in the data processing path are in parallel, the execution units in the DIW architecture are ordered in a sequential pipeline. A single instruction word defines an operation for each (e.g., four) of the sequential processing stages as the data progresses through the pipeline. As such, up to four operations per instruction are performed on the data. However, a separate instruction is needed to load new data to be processed by the data processing path, thus slowing the overall processing speed of the DSP core.