Evolved High-Speed Packet Access (HSPA+) is a wireless standard defined in 3GPP release 7. This standard is set for wide adoption across existing infrastructure currently based on High Speed Download Packet Access (HSDPA) to offer users extended bandwidth prior to the rollout of 3GPP Long Term Evolution (LTE). The HSPA+ standard potentially increases bandwidth more than 10-fold, from 14.4 Mbits/s up to 168 Mbits/s, which will significantly increase the associated processing burdens within the system. This burden is most prevalent in Turbo Convolution Code (TCC) decoders, which are one of the most computationally intensive aspects of the HSPA+ system.
LTE uses a specially designed contention-free interleaver. This allows a large block of data to be partitioned into smaller segments, which can be processed in parallel without multiple processing units contending for access to the same data segment. In contrast, processing algorithms such as TCC decoding for the 3GPP HSPA+ standard do not provide a contention-free interleaver. Dividing data blocks across multiple processing units is extremely difficult because multiple processing units may require access to data from the same memory at the same time creating contention. As a result, it is difficult to implement TCC using a parallel architecture.
In algorithms such as the Turbo Convolution Code (TCC) decoding, the burden placed on the processing unit (PU) is great in terms of computational intensity and resource requirements. In order to produce an efficient solution, the throughput of the processing unit must be maximized, and therefore, utilization of the processing unit should be as near to 100% as possible. These algorithms process each block for a certain number of iterations, typically in the range of 5 to 7 iterations. Each iteration consists of two half-iterations, which are referred to as SISOs (Soft-Input Soft-Output). One of the half-iterations performs SISO operations in an interleaved order while the other half iteration performs SISO operations in a non-interleaved order. While the half-iterations may be performed in either order, for ease of explanation, the examples and embodiments are primarily described herein with reference to an iteration in which interleaved SISO operations are performed followed by non-interleaved SISO operations.
At the end of each SISO, there will be a significant number of clock cycles required to empty pipelines and finish write back of the so-called extrinsic data, which is exchanged between consecutive SISO operations. The next SISO operation is therefore unable to start until the write back of the extrinsic data has completed. The extrinsic data from the preceding SISO is combined with the original input data for the next SISO operation.
Consequently, if one processing unit is used to perform consecutive SISO operations of a data-block, there will be a delay. As a result, the processing unit will not be utilized in this time period. For example, FIG. 1 illustrates latency incurred in processing successive SISO operations on a data block by the same processing unit. The idle cycles 130 (shown with hatched lines) are incurred between each SISO operation of data block B1 during write back of extrinsic data. In order to fully utilize a processing unit, two data blocks, which can be iteratively processed in turn, are required to avoid the delay incurred during write back. FIG. 2 illustrates alternating processing of two data blocks B1 and B2 on a processing unit for two decoding iterations (four SISO operations). After SISO processing of data block B1 by the processing unit is completed, extrinsic data must be written back to memory. During the write back time period, because data block B2 is not dependent on the extrinsic data of data block B1, the processing unit can begin a SISO operation of data block B2. In this manner, the idle processing time 130 depicted in FIG. 1 can be avoided and throughput can be increased.
To accelerate block-based processing algorithms, such as TCC decoding, a plurality of the processing units may be used to process data blocks in parallel. One possible architecture avoids idle cycles, as shown in FIG. 2, by buffering two blocks for each one of the plurality of processing units. However, this architecture does not scale well by adding processing units because the overall storage requirement is equal to 2 times the maximum block size for each processing unit in the system.
The memory requirement is further increased because an input buffer must be capable of supplying multiple blocks to all processing units in parallel. This situation may occur when processing of data blocks simultaneously completes on all processing units. Because the size of blocks typically varies from 40 to 5114 bits in HSPA+, this solution may result in large discrepancies in the processing time required. For example, where two smaller blocks are being processed alternately on a processing unit, each processing stage will be completed in far less time than where two large blocks are being processed. As demands on throughput for solutions to algorithms such as TCC decoding for the 3GPP HSPA+ standard extend to 100-200 Mbps, such solutions are impractical as the number of processing units required will typically be in the range of 4-8, creating memory demands that are impractical in most platforms/devices.