Programmable logic devices (“PLDs”) are a well-known type of integrated circuit that can be programmed to perform specified logic functions. One type of PLD, the field programmable gate array (“FPGA”), typically includes an array of programmable tiles. These programmable tiles can include, for example, input/output blocks (“IOBs”), configurable logic blocks (“CLBs”), dedicated random access memory blocks (“BRAMs”), multipliers, digital signal processing blocks (“DSPs”), processors, clock managers, delay lock loops (“DLLs”), and so forth. As used herein, “include” and “including” mean including without limitation.
Each programmable tile typically includes both programmable interconnect and programmable logic. The programmable interconnect typically includes a large number of interconnect lines of varying lengths interconnected by programmable interconnect points (“PIPs”). The programmable logic implements the logic of a user design using programmable elements that can include, for example, function generators, registers, arithmetic logic, and so forth.
The programmable interconnect and programmable logic are typically programmed by loading a stream of configuration data into internal configuration memory cells that define how the programmable elements are configured. The configuration data can be read from memory (e.g., from an external PROM) or written into the FPGA by an external device. The collective states of the individual memory cells then determine the function of the FPGA.
Another type of PLD is the Complex Programmable Logic Device, or CPLD. A CPLD includes two or more “function blocks” connected together and to input/output (“I/O”) resources by an interconnect switch matrix. Each function block of the CPLD includes a two-level AND/OR structure similar to those used in Programmable Logic Arrays (“PLAs”) and Programmable Array Logic (“PAL”) devices. In CPLDs, configuration data is typically stored on-chip in non-volatile memory. In some CPLDs, configuration data is stored on-chip in non-volatile memory, then downloaded to volatile memory as part of an initial configuration (programming) sequence.
For all of these programmable logic devices (“PLDs”), the functionality of the device is controlled by data bits provided to the device for that purpose. The data bits can be stored in volatile memory (e.g., static memory cells, as in FPGAs and some CPLDs), in non-volatile memory (e.g., FLASH memory, as in some CPLDs), or in any other type of memory cell.
Other PLDs are programmed by applying a processing layer, such as a metal layer, that programmably interconnects the various elements on the device. These PLDs are known as mask programmable devices. PLDs can also be implemented in other ways, e.g., using fuse or antifuse technology. The terms “PLD” and “programmable logic device” include but are not limited to these exemplary devices, as well as encompassing devices that are only partially programmable. For example, one type of PLD includes a combination of hard-coded transistor logic and a programmable switch fabric that programmably interconnects the hard-coded transistor logic.
A convolutional code decoder, such as a Turbo Code decoder, may employ a maximum a posteriori (“MAP”) algorithm. Additionally, Viterbi decoders may use a MAP algorithm. In a MAP algorithm, a calculation of the form shown in Equation (1) is repeatedly performed. The form of the calculation is:sx(n+1)=sa(n)·ga(n)+sb(n)·gb(n),  (1)
where sx, sa, and sb represent state metrics, and where ga and gb represent branch metrics. Equation (1) is in the linear domain; however, signals may be rectified in the log domain as described below in additional detail.
In the linear domain, there are two multiplications and an addition. Two previous state metrics sa, and sb are respectively multiplied by two branch metrics ga and gb, and the results of such multiplications are added to obtain a new state output sx(n+1).
Generally, MAP-based turbo decoders use an addition-compare-select-offset unit (“ACSO”) or an addition-compare-select unit (“ACS”) in alpha, beta, and LLR calculations. The number of such units depends on the constant length of the convolutional code and how parallel the turbo decoder is. For example, in a Third Generation Partnership Project Long-Term Evolution (“3GPP LTE”) convolutional code decoder, there are eight states, namely a three “soft bit” convolution code used to represent eight states. For purposes of clarity by way of example and not limitation, only a single ACSO is illustratively shown in several of the figures, as implementation of multiple ACSOs shall be understood by one of skill in the art from the description.
The probability for each of the states, such as for example eight states, is determined to evaluate bit state probability, e.g., probability of a soft bit signal at an instant in time (“bit state”) representing a binary one or zero at that state. So generally a previous state probability, e.g., state metric, is multiplied by a probability of going from one state to another, e.g., a branch metric, to obtain a partial probability of being at a next state, namely the state immediately following the previous state probability, for a bit state, and the partial probabilities are added to obtain a probability of being at the next state for such bit state.
Conventionally, the MAP algorithm is transformed into the log domain to remove the multiplications, yielding the log-MAP algorithm as represented in Equation (2):Sx(n+1)=f(Sa(n)+Ga(n),Sb(n)+Gb(n)),  (2)
where Sx, Sa, and Sb represent log domain state metrics, and where Ga and Gb represent log domain branch metrics. Additionally, f( ) represents a function to implement addition in the log domain, and this addition is of the form:f(d1,d2)=max(d1,d2)+log(1+e−|d2−d1|).  (3)
Thus, the multiplications in the linear domain become additions in the log domain, and the addition of partial probabilities, namely d1 and d2, is of the form in Equation (3), namely a maximum with a log term. The log term in Equation (3) is a correction factor that may be precomputed for various variables and such precomputed correction factors may be selectively obtained from a table of correction factors.
In hardware, Equation (3) may be implemented as an ACSO 200 as illustratively shown in FIG. 2. ACSO 200 includes an add stage 210, a compare stage 220, a select stage 230, and an offset stage 240. The log term in Equation (3) may be implemented with a fixed lookup table (“LUT”) 201 having stored therein correction factors as part of select stage 230. The limited precision offered by LUT 201 slightly degrades error-correcting performance, and the resulting approximation is referred to as a max-star log-MAP algorithm result, namely Sx 202, which may be obtained from register 221.
A further approximation to this function is to drop the log correction term altogether, resulting in the max log-MAP algorithm of Equation (4):Sx(n+1)=max(Sa(n)+Ga(n),Sb(n)+Gb(n)).  (4)
In hardware, the max log-MAP algorithm may be implemented as an add-compare-select unit (“ACS”) 300 as illustratively shown in FIG. 3. ACS 300 is the same as ACSO 200 of FIG. 2, except that offset stage 240 is effectively removed and a register 341 is added to and LUT 201 is dropped from select stage 230 for forming select stage 330. From an output port of register 341, a max log-MAP algorithm result, namely Sx 302, may be obtained. This hardware simplification results in an ACS unit that is smaller and faster than a similar ACSO unit. However, such an ACS does sacrifice some error-correcting performance in comparison to a similar ACSO unit.
As versions of the log-MAP algorithm are iterative, each state metric output is iteratively fed back as an input for a next calculation. Here the use of the terms “iterative” and “iteration” refers to the iterative nature of the calculations. However, it should be understood that state metric outputs are fed back to form a series of calculations. A block of state metric calculations is known as “an iteration.” Thus, “an iteration” for turbo decoder calculations means a series or block of state metric calculations. ACS and ACSO latency is an aspect of performance. In some applications, ACS/ACSO latency may be part of a “critical” or “speed-limiting” path of single cycle of a convolution code decoder, such as a Turbo Decoder, and therefore such latency may dictate maximum operating speed of such decoder. Conventionally, an ACSO unit has a higher latency than a similar ACS unit.
To increase operating speed, a pipelined ACS or ACSO unit may be employed. A conventional pipelined ACS 400 is illustratively shown in FIG. 4. ACS 400 includes add stage 410, compare stage 420 and select stage 430. Add stage 410 and compare stage 420 are respectively the same as add stage 210 and compare stage 220 of FIG. 3 for example, except outputs of each of the blocks in stages 410 and 420 are registered with respective registers for pipelining. Furthermore, select stage 430 is the same as select stage 330 of FIG. 3, except for the addition of another register stage, namely register 441.
ACS 400 has a latency of four clock cycles because of the addition of register 441 in select stage 430 coupled to receive output from register 341. Register 441 is added to aid feedback routing and allow injection of initialization values (not shown). ACS 400 having four register stages is capable of processing up to four independent state metric calculations at a time, namely four independent state metric calculations may be occurring within ACS 400 at a time in a pipelined multi-threaded manner. Furthermore, with the addition of a scheduler (not shown), each thread can be operated on separate code blocks where a code block is a convolutional code decoder and such code block may be implemented using a series of ACS 400 or ACS 400 coupled in parallel, or a combination of both.
A conventional pipelined ACSO 500 is illustratively shown in FIG. 5. ACSO 500 includes add stage 410, compare stage 420, select stage 530, and offset stage 540. Add stage 410 and compare stage 420 are respectively the same as those stages in FIG. 4. Select stage 530 is the same as select stage 230 of FIG. 2, except outputs of each of the blocks in stage 530 are registered with respective registers for pipelining. Offset stage 540 is the same as offset stage 240 of FIG. 2, except for the addition of another register stage, namely register 541.
A first stage of registers, namely first register stage 411, is located in add stage 410. A second register stage, namely register stage 412, is located in compare stage 420. A third register stage, namely register stage 413, is located in select stage 530. A fourth register stage, namely register stage 414, is provided by register 221. An additional register stage, namely a fifth register stage 415, is provided by register 541. Registers 221 and 541 are coupled in series and located in offset stage 540.
ACSO 500 has a latency of five clock cycles because of the addition of register 541 in offset stage 540 coupled to receive output from register 221. Register 541 is added to aid feedback routing and allow injection of initialization values (not shown). ACSO 500 having five register stages is capable of processing up to five independent state metric calculations at a time; in other words, five independent state metric calculations may be occurring within ACSO 500 at a time in a pipelined multi-threaded manner. Furthermore, with the addition of a scheduler (not shown), each thread can be operated on separate code blocks where a code block is a convolutional code decoder and such code block may be implemented using a series of ACS 400 or ACS 400 coupled in parallel, or a combination of both.
It should be understood that because of the odd number of clock cycles in ACSO 500, in addition to having a latency of five clock cycles, control and scheduling logic becomes more complex. Complexity of such control and scheduling logic is conventionally simplified when the number of threads that may be processed is a binary number, namely a power of two. For example, 3GPP LTE code blocks all divide evenly into 4 or 8 threads, but do not divide evenly into 5 threads. A pipelined convolutional code decoder, such as a Turbo Code or Turbo Decoder, is often difficult to implement with error correcting performance of a max-star log-MAP algorithm due to latency constraints.
Accordingly, it would be desirable and useful to provide means to provide an ACSO unit with reduced latency.