1. Field of the Invention
This invention relates to digital signal processing. More particularly, this invention relates to memory, CPU, and power efficient performing of fast transforms.
2. Description of the Related Art
In literature many fast transforms (fast Fourier transform, fast cosine transformation, etc.) are known. As an example the fast Fourier transform is discussed below but the invention is not limited thereto.
Traditionally, as shown in FIG. 7, FFT stages are calculated sequentially on a programmable processor or in a time multiplexed hardware solution, usually followed by a sequential iteration over each block and the butterflies involved. Three nested loops are considered: stagexe2x80x94blockxe2x80x94butterfly. The access graph shown in FIG. 8 shows which addresses are used for calculation at each moment. The end points of each line represent which elements are used; the length of a line determines the current stage. FIG. 7 is the traditional representation of an FFT. FIG. 8 shows the access sequence in time. E.g. the content of addresses 0 and 16 are used first, followed by addresses 1 and 17, etc. In the second stage, address 0 and 8 are used, 1 and 9, etc. The presented scheduling of the butterflies is not optimal when power consumption of the hardware device on which the transform is executed is important.
The invention presents memory access orderings, also denoted schedules, for fast transforms which are optimal with respect to power consumption and bus loading of the hardware device on which said transforms are executed.
The invention presents a method for minimizing memory space and power consumption in a signal processor when transforming a first m-dimensional indexed array with N elements into a second m-dimensional indexed array with M elements, said second array being a transform of said first array, said method comprising the steps of executing a plurality of butterfly codes, also denoted calculation method steps, each butterfly code being characterized by the elements of the indexed arrays accessed by said butterfly code, said method is characterized in that at least part of said butterfly codes are assigned to be part of at least one group of butterfly codes such that butterfly codes within one group are executed sequentially, meaning substantially close after each other in time, and said grouping of butterfly codes enables the use of storage spaces being substantially smaller than then storage space needed to store an array with N elements. The storage spaces used are either registers or a distributed memory configuration or a combination thereof. Note that M is often equal to N.
In a first aspect of the invention data locality improvement schedules are presented. In a first embodiment a full data locality improvement schedule is disclosed while in a second embodiment a partial data locality improvement schedule is shown.
The full data locality improvement schedule can be described as a method for transforming a first m-dimensional array into a second m-dimensional array, wherein said second array being a transform of said first array. Said transform can be a Fourier transform, a cosine transform, a Karhunen-Loxc3xa8ve transform, a Hartly transform but is not limited thereto. Said method comprising the steps of executing a plurality of codes, also denoted butterfly codes, each code being characterized by its array variables or array elements it accesses. One can state that said arrays are indexed, meaning that each element of said arrays can be referred to by its index number. Note that with code or butterfly code is meant a calculation method which reads elements of an array and produces new elements. More in particular is meant a method reading elements and producing elements via operations such as complex additions, subtractions, multiplications and/or divisions. The multiplication factor is a parameter, which can vary from execution to execution of said codes while the operations are the same within said codes. Said execution is characterized in that said codes are scheduled by grouping in pairs or sets of codes with a maximal distance between the array variables accessed by said codes. Said pairs are placed in a predetermined ordering. In a particular example said predetermined ordening is address bit-reversed. Note that with distance is meant the index difference of the elements of the array under consideration. Distance can also be denoted window. With maximal is meant having the largest index difference when considering all codes needed to describe the transformation under consideration. For each pair of said ordened codes codes which access at least one array variable being accessed also by one of the code of said pair are determined. Note that not necessarily all such codes are selected. Thus part of the codes which have a common access with one of the code of said pair are selected. The selected codes can be denoted as codes being assigned to such a pair. The selected codes are ordened in a binary tree with as top node one of said pair of codes according to their related maximal distance wherein higher distance of such code implies closer placement to the top node of said tree. For each pair of codes a binary tree can be constructed. The execution order of said ordered codes is determined by traversing said binary tree in a depth-first manner. Note that this is done for each pair. The improved data locality obtained by the above described scheduling is exploited because during said execution of said scheduled codes data is shared between at least part of said pairs of codes subsequently scheduled after one another. This data sharing is done via small memories, often denoted registers, register files, being capable of storing a few elements, possible a single element. Said small memories, also denoted foreground memory, are local to the datapath of the hardware device executing said application. Note that with subsequently scheduled is meant scheduled soon after each other in time. Some intermediate operations are possible depending on the amount of elements that said data sharing memories can store. During said execution of said scheduled butterfly codes at least part of the accesses of elements accessed by said scheduled butterfly codes are accesses to a storage space being capable of storing a few elements.
A first characteristic of the best mode realisation of the full data locality improvement ordering is an address bit-reversed selecting of the butterflies with largest index difference, also denoted top butterflies, is done. Suppose that the butterfly selected is characterized by the address of the element it accesses with the lowest address number. Suppose one write down the sequence 0,1,2, . . . in binary format with log2(N)xe2x88x921 number of bits with N the size of the transform, thus 0000, 0001, 0010, . . . for a 32 Fourier transform as an example. This binary sequence is now transformed into another binary sequence by reversing the bit ordering, thus one obtains 0000, 1000, 0100, . . . or in normal format 0, 8, 4, . . . This sequence gives the lowest address number of the butterfly being selected as can be observed in FIG. 4. A second characteristic of the best mode realisation of the full data locality improvement ordering is the tree-depth, meaning the number of levels in said binary trees, each level being characterized by the index difference which is the same for the butterflies represented by the nodes of a single level. The tree-depth is given by the log2(N)xe2x88x92[log2(N/2xe2x88x92x)], wherein x means the lowest address number of the butterfly defining the top node of the tree, hence the wording top butterflies, and [ ] means rounding the number in between brackets up to the following larger integer. An example is given in the table below showing in the actual location of the top butterflies, their characterizing address (lowest address) and the depth of their related tree.
A partial data locality improvement schedule is described as a method for transforming a first m-dimensional array into a second m-dimensional array, said second array being a fast transform of said first array, said method comprising the steps of executing a plurality of butterfly codes, each butterfly code being characterized by the elements of the array accessed by said butterfly code, said execution is characterized in that said butterfly codes are scheduled by performing a step 1 and a repetition of step 2 . In step 1 from said butterfly codes half of these butterfly codes with a maximal index difference between the elements of the array accessed by said codes are selected and executed in a predetermined ordering, e.g. bit-reversed ordering. The other half of these butterfly codes are denoted non-executed butterfly codes. In step 2 for half of these non-executed butterfly codes with a maximal index difference between the elements of the array accessed by said codes one performs an operation but for each of these non-executed butterfly codes separately. For such a non-executed butterfly code other butterfly codes are selected with a minimal index difference of half of said maximal index difference and which access at least one element of an array being accessed also by the non-executed butterfly code under consideration. Said non-executed butterfly code under consideration and said selected codes are then executed. This operation is thus performed for half of the non-executed butterfly codes. For the remaining half of the non-executed butterfly codes also step 2 is performed but the minimal index difference decreases by factor of 2 for each repetition of step 2. In a further embodiment a binary tree scheduling approach is used for said selected codes. Naturally again data sharing via registers or foreground memory can now be used due to the data locality improvement.
The first embodiment using a full data locality improvement can be described as follows:
A method for transforming a first m-dimensional array into a second m-dimensional array, said second array being the Fourier transform of said first array, said method comprising the steps of executing a plurality of butterfly codes, each butterfly code being characterized by the elements of the array accessed by said butterfly code, said execution is characterized in that said butterfly codes are scheduled as follows:
From said butterfly codes these butterfly codes with a maximal index difference between the elements of the array accessed by said codes are selected and grouped in pairs and placed in a predetermined ordering;
For each pair of said selected and ordered butterfly codes, butterfly codes which access at least one element of said arrays being accessed also by one of the butterfly codes of a pair are determined and assigned to said pair.
For each pair of said selected and ordened butterfly codes, the assigned butterfly codes are ordened in a binary tree with as top node one butterfly code of said pair according to the index difference between the elements of the array accessed by said butterfly code wherein a higher index difference of such butterfly code implies closer placement to the top node of said tree and determining the execution order of said assigned butterfly codes by traversing said binary tree in a depth-first manner.
The method described above wherein during said execution of said scheduled butterfly codes at least part of the accesses of elements accessed by said scheduled butterfly codes are accesses to storage spaces being capable of storing a single element or a few elements.
The second embodiment partially using said data locality improvement can be described as follows:
A method for transforming a first m-dimensional array into a second m-dimensional array, said second array being a fast transform of said first array, said method comprising the steps of executing a plurality of butterfly codes, each butterfly code being characterized by the elements of the array accessed by said butterfly code, said execution is characterized in that said butterfly codes are scheduled as follows:
Step 1 From said butterfly codes half of these butterfly codes with a maximal index difference between the elements of the array accessed by said butterfly codes are selected and executed in a predetermined ordering;
Step 2 For each of half of the non-executed butterfly codes with a maximal index difference between the elements of the array accessed by said codes, butterfly codes are selected with a minimal index difference of half of said maximal index difference and which access at least one element of an array being accessed also by the non-executed butterfly code under consideration, said non-executed butterfly code under consideration and said selected codes are executed; and
Step 2 is repeated for half of the non-executed codes but the minimal index difference decreases by factor of 2 for each repetition of step 2.
The method described above wherein selected codes are ordered in a binary tree with as top node the last of the non-executed code, said ordening in said binary tree is according to the index difference of said selected codes wherein higher index difference of such code implies closer placement to the top node of said tree and the execution order of said selected codes is determined by traversing said binary tree in a depth-first manner.
The method described above wherein during said execution of said scheduled butterfly codes at least part of the accesses of elements accessed by said scheduled butterfly codes are accesses to storage spaces being capable of storing a single or a few elements.
In a second aspect of the invention improved in-place mapping schedules for performing a fast transform are presented. The invention can be described as a method for transforming a first m-dimensional array into a second m-dimensional array, said second array being a fast transform of said first array, said method comprising the steps of executing a plurality of codes, also denoted butterfly codes or calculation method steps, each code being characterized by its array variables or elements it accesses. Said execution is characterized in that said codes are scheduled such that codes accessing nearby array variables are grouped, codes within one group are executed sequentially and said execution of codes exploit a memory being smaller than said array sizes. With nearby array variables or elements is meant elements of said arrays with an index difference between said elements smaller than a threshold value being at most (N/p) with N the amount of elements in said arrays and p an integer being larger than 1. With a memory being smaller than said array sizes is meant a memory with a storage space for less than N elements. Note that both a sequential execution, wherein each group is executed after each other, or a parallel execution, wherein groups are executed substantially simultaneously, is possible. Also combinations of sequential and parallel are possible. One can further specify for the sequential set-up that after finishing execution of a first group of code data within said memory is transferred before execution of a second group of code is started. The minimal memory size is determined by the maximal distance between the array variables accessed while executing a group of codes. Said maximal distance is characterized by the index difference between said array variables or elements. The data transfer is between said memory and another memory having a size at least larger than said memory. Said second aspect of the invention is not limited to grouping of said butterfly codes in one group of a single size. One can assign each individual butterfly codes to a plurality of groups with different sizes With size of a group is meant the maximal index difference of elements being accessed by butterfly codes within a group. Said size is determined by the so-called group-specific threshold value. With each group a group-specific memory with a minimal size being determined by said threshold value is associated. Indeed, execution of codes within such group is done by accessing said group-specific memory. One can state that a group of codes are executed sequentially before executing another group of codes and said groups of code, which while executing have the same maximal distance between the array variables accessed, can exploit the same memory, meaning the group-specific memory. Thus at least part of said butterfly codes are assigned to be part of at least one group of butterfly codes such that butterfly codes within a group are butterfly codes accessing elements of said arrays with an index difference between said elements smaller than a group-specific threshold value being at most N/p. Then butterfly codes within one group are executed sequentially before executing butterfly codes of another group and said execution of said groups of butterfly codes exploit a group-specific memory with a minimal storage space being determined by said group-specific threshold value. The data transfer in between execution of groups of codes is then between memories of different sizes, more in particular the memory with its group-specific size and a memory with a larger size, possibly another group-specific memory or the main memory. The memory configuration is denoted a distributed memory configuration and can be but is not limited to be a hierarchical memory architecture. Note that said in-place execution is characterized in that at least part of said codes are scheduled such that codes accessing nearby array variables are executed nearby in time.
An embodiment of said in-place mapping approach can be formalized as follows:
A method for transforming a first m-dimensional indexed array with N elements into a second m-dimensional indexed array with N elements, said second array being a fast transform of said first array, said method comprising the steps of executing a plurality of butterfly codes, each butterfly code being characterized by the elements of the indexed arrays accessed by said butterfly code, said execution is characterized in that at least part of said butterfly codes are scheduled such that butterfly codes accessing elements of said arrays with an index difference between said elements smaller than a threshold value being at most N/p with p an integer being larger than 1 are assigned to a group, butterfly codes within one group are executed sequentially before executing butterfly codes of another group and said execution of said groups of butterfly codes accesses only a first memory with a storage space for less than N elements.
The method described above, wherein after finishing execution of a first group of butterfly codes a data transfer between said first memory and a second memory being larger than said first memory is performed before execution of a second group of butterfly codes is started.