As the tasks computing systems are required to perform increase in complexity, the burdens on the central processing unit (CPU), the size of the system memory, and the traffic on the system address and data buses all correspondingly increase. In particular, many types of tasks associated with matrix mathematics, speech synthesis, image signal processing, and digital signal processing are computationally intensive, often requiring the execution of a large number of basic arithmetic operations before a final result is obtained. For example, calculation of dot products is often required in digital signal processing applications. The calculation of a dot product requires the performance of number of multiplications and additions, each of which in conventional processing systems must be performed by the system central processing unit. In addition, the resulting intermediate sums and products must be stored and retrieved from memory as the operations proceed. Thus, the CPU becomes burdened not only with the task of performing all arithmetic operations but also with the task of controlling the transfer of data to and from memory. The memory in turn must be large enough to handle the initial raw data and all the intermediate results. Finally, even if multiple CPUs are used, the traffic on the associated address and data buses is substantial as addresses, data and results are exchanged.
Thus, the need has arisen for apparatus, systems and methods which more efficiently handle computationally intensive applications. Such apparatus, systems and methods, should ease CPU task burdens, minimize the amount of memory required and efficiently use bus bandwidth. Further, such apparatus, systems and methods should be compatible with currently available device and system configurations.