1. Technical Field
The present disclosure relates to the field of computers, and specifically to vector processing. Still more particularly, the present disclosure relates to scaling vector dot products, including, but not limited to, trigonometric-based vector dot products.
2. Description of the Related Art
In many areas of computing, a common calculation occurs where a sum must be obtained of several results from trigonometric operations. Some of these applications include real time physics simulations in games or obtaining a relatively accurate numerical approximation of the integral of a trigonometric function by numerical integration. The following equation shows the equation for performing numerical integration using the rectangle rule:
                    ∫        a        b            ⁢                        f          ⁡                      (            x            )                          ⁢                                  ⁢                  ⅆ          x                      ≈                  ∑                  i          =          1                n            ⁢                        f          ⁡                      (                          a              +                              i                ⁢                                                                  ⁢                Δ                ⁢                                                                  ⁢                x                                      )                          ⁢        Δ        ⁢                                  ⁢        x                        Δ      ⁢                          ⁢      x        =                  b        -        a            n      For a sin( ) function, this equation becomes:
                    ∫        a        b            ⁢                        Sin          ⁡                      (            x            )                          ⁢                                  ⁢                  ⅆ          x                      ≈                  ∑                  i          =          1                n            ⁢                        Sin          ⁡                      (                          a              +                              i                ⁢                                                                  ⁢                Δ                ⁢                                                                  ⁢                x                                      )                          ⁢        Δ        ⁢                                  ⁢        x                        Δ      ⁢                          ⁢      x        =                  b        -        a            n      The graph of this sine function is shown in FIG. 1 as graph 102.
If using current scalar instructions and a numerical integration operation with n=16, integrating from a=0 to b=2pi results in the following instructions being issued 16 times, as shown in the following assembly language pseudocode:
a: fadd  x, x, dx   # get the next xb: fsin  y, x    # obtain the result of the function at xc: fmadd sum, sum, dx, y  # scale and add to the running sum
For simplicity, this is assumed to be not in a loop, where the following sequence is just repeated 16 times. However, if this sequence were in a loop, the performance would be worse than shown. That is, assuming a floating point pipeline latency of four cycles for each of the above dependent instructions, the example would take (9*16)+4=148 cycles to complete.
In the previous example, due to the inter-instruction dependency between the first add instruction (An) and the sine instruction (Bn), and then the sine instruction and the multiply add instruction (Cn), one iteration of the summation consumes nine cycles of latency. This is due to the fact that the fadd for the next iteration (An+1) can start down the pipeline in the next cycle after the previous fmadd is issued, a seen in the chart 202 in FIG. 2. Then, the last add instruction in the summation must be allowed to complete, which accounts for the additional four cycles. In addition, note that valuable temporary registers must be used (y) in this process.