1. Field of the Invention
The present invention relates to processors and methods for processing signals that can be implemented using said processors.
2. Description of the Related Art
In cell-phone systems of the second generation (for example GSM) or of a more advanced type (GPRS, EDGE, UMTS), the most widely used architecture consists of a system made up of two processors. The first processor, which is specialized in handling the part with the largest computational burden, typically consists of a Digital Signal Processor or DSP. The other processor, with tasks of control, synchronization and execution of high-level applications, is typically configured as a CPU.
An example of architecture of this sort is illustrated in FIG. 1, where the aforesaid processors, designated respectively by DSP and CPU 1, are illustrated together with the cache memories associated thereto, namely together with instruction cache memories I$ and data cache memories D$, respectively.
Designated by CMC are the interface modules, referred to as Core Memory Controllers, which enable two sub-systems coming under the two processors DSP and CPU 1 to interface with one another by means of a main bus B with the main system memory MEM and with the various peripheral units PI, P2, P3, P4, . . . associated to the system.
The specific application in the telephony sector is, on the other hand, referred to herein purely in order to provide an example and consequently does not imply, even indirectly, any limitation of the altogether general character of the invention described in what follows. The said invention may, in fact, be applied in all those fields in which it may be useful or advantageous to employ a microprocessor.
With reference to the diagram of FIG. 1, the CPU 1 is typically a 32-bit pipelined scalar microprocessor. By “pipelined scalar” is meant that its internal architecture is made up of different logic stages, each of which contains an instruction in a very specific state. The said state may be that of:                fetching of the instruction from the memory,        decoding of the instruction,        addressing of a register file,        execution,        writing/reading of data from the memory.        
The number of bits on which the CPU 1 operates is related to the width of the data on which the machine is operating. The instructions are generated and executed in turn, in a specific order defined by compiling.
The other processor, designated by DSP, is typically a superscalar microprocessor or 128-bit pipelined VLIW (acronym for Very Long Instruction Word) microprocessor.
“Pipelined superscalar” means that its internal architecture is made up of different logic stages, some of the which are able to execute instructions in parallel, for example in the execution step. Typically, the parallelism is of four instructions each (equal to 128 bit) whilst the data are expressed in 32 bits.
The processor is said to be superscalar if the instructions are re-ordered dynamically in the execution step in order to supply the execution stages which can potentially work in parallel, also altering the order generated statically by compiling of the source code, if the instructions do not present any mutual dependence. The main disadvantage of this approach lies in the complexity of the resulting machine, in which the logic of scheduling of the instructions may prove one of the most important parts in terms of number of gates.
The term VLIW processor is used if the instructions are re-ordered statically in the compiling step and executed in the pre-set order, which is not modifiable in the execution step. The advantage of the said approach is that it eliminates all the logic of management of the scheduling since this task is performed during compiling.
The main disadvantage lies in the fact that the compiled code is strictly dependent upon the implementation of the machine on which it is executed. For example, given the same instruction-set architecture (ISA), a machine with N execution units cannot execute a compiled code for a machine with K execution units if K is not equal to N. From this it follows that there is no “binary compatibility” between different generations of processors with the same ISA.
It is to be recalled that by “binary compatibility” is meant the property existing between a group of processors each of which is able to execute one and the same binary machine-code datum.
Likewise, it is not possible to create multiprocessor systems (each with a different number of execution units), which can change processes in the course of execution.
In the diagram of FIG. 1, each processor possesses its own data cache D$ and its own instruction cache I$, so as to be able to load from the main memory MEM both the data on which to operate and the instructions to be executed in parallel. Since the two processors CPU 1 and DSP are connected to the main memory MEM through the system bus B, the two processors are typically found competing for access to said memory when an instruction and/or the data on which they are to operate must be located in the main memory, the said instruction or data not being available in their own caches.
A system based upon the architecture represented in FIG. 1 has a sharing of work and of processes that is rigid and not modifiable, such as to render asymmetrical the workload and the software programs to be executed.
By way of reference, a processor such as the CPU 1 usually possesses 16 Kbytes of data cache and 16 Kbytes of instruction cache, whereas the DSP usually possesses 32 Kbytes of data cache and 32 Kbytes of instruction cache.
The flowchart of FIG. 2 illustrates the logic diagram of the CPU described from top to bottom. The first stage, designated by 10, generates the memory address to which the instruction to be executed is associated, the said address being referred to as program counter. The stage 10 is hence configured typically as a fetch stage, whilst the instruction thus loaded is decoded in the stage 12 separating the bit field which defines its function (for example, addition of 2 values contained in two registers located in the register file) with respect to the bit fields which address the operands. The said addresses are sent to a register file from which (in a stage designated by 14) are read the operands of the instruction. The operands and the bits which define the function to be executed are sent to the execution unit which, in a stage 16, performs the desired operation, for example the operation of addition referred to previously. The result can thus be re-stored in the register file in a stage 18 currently called write-back stage.
The process schematically represented in FIG. 2 operates in combination with a load/store unit which enables reading/writing of any possible data in memory with the aid of specific instructions dedicated to the purpose.
It may be readily appreciated that the set of instructions is in biunique correspondence with a given microprocessing CPU architecture.
The flowchart of FIG. 3 shows, instead, the logic diagram of the DSP. Also in this case, there is envisaged an initial fetch stage 20, associated to which there is logically cascaded a stage 20a for issuing instructions. The reference number 22 designates, instead, a decoding stage whilst the reference number 24 designates a register file (see the stages 14 and 16 of FIG. 2). The reference number 28 designates a stage for re-storage in the register file, which is in general akin to the stage 18 of FIG. 1. In the diagram of FIG. 3 the reference number 26 designates collectively a plurality of execution stages that can be executed in parallel.
Both in FIG. 1 and in FIG. 3 the reference CW designates the branching lines of the control words.
It will be appreciated that the main difference between the diagram of FIG. 2 and the diagram of FIG. 3 is provided by the fact that the diagram of FIG. 3 envisages the possibility of working in parallel on different sets of instructions. Another difference lies in the fact that the diagram of FIG. 3 envisages the use of a greater number of execution units available, which can operate in parallel in a superscalar and VLIW processor. In both cases, the set of instructions is in biunique correspondence with a given microprocessing architecture.
Assuming that the two sets of instructions designed to be executed by the processors CPU 1 and DSP are different from one another (as is commonly the case with the architecture of wireless processors) it is understandable that instructions (and hence tasks to be executed), which can be executed the processor CPU 1 cannot be executed by the DSP, and vice versa.
For the above to be possible, it is necessary to compile each process for each processor, thus increasing the memory of the program. Whenever a process is to be executed by a specific processor, it is then necessary to load and execute the code of the particular task that has been compiled for that processor. There is moreover encountered the problem linked to the fact of having to correlate the different points of partial execution of the programs when they are to be shifted from one processor to another (i.e., re-map the program counters correctly) and of having to convert all the processing data from the representation system of one processor to the representation system of another (for example, the contents of the state and general-purpose registers).
The above problems are difficult to solve, so that in general a process is compiled and executed on a single processor.
With reference to FIGS. 4 and 5, it is possible to consider a sequence of sets of instructions of said processes.
In general, two types of processes are distinguished, namely:                those corresponding to the operating system and to applications that use calls to functions of the operating system, and        those regarding the processing of multimedia (audio/video/graphic) contents.        
Specifically, in the diagram of FIG. 4 the references OsTask 1.1, 1.2, etc. illustrate processes which can be executed by the processor CPU 1. The processes designated by MmTask2.1, MmTask2.2, MmTask2.3, identify, instead, processes compiled so as to be executed by the DSP.
Starting from the diagram of FIG. 4, which illustrates a possible assignment of the task to two processors, it is immediately possible to return to the diagram of FIG. 5, which illustrates the corresponding flow of instructions.
Setting equal to one hundred the total time of execution of the processes, it is noted that the first processes typically last 10% of the time, whilst the second occupy a much greater part, corresponding to 90%.
Again, the first processes contain instructions generated by the compiler of the processor CPU 1 and hence can be executed by the latter, but not by the DSP. For the latter processes the situation is exactly complementary, in the sense that they contain instructions generated by the compiler of the DSP and can hence be executed by the said processor, but not by the other processor CPU 1.
It is moreover to be noted that the processor CPU 1 is characterized by a compiling flow of its own, which is independent of and distinct from that of the DSP.
Given the modest workload, it may be appreciated that the processor CPU 1 could even be turned off when not in use, so enabling a considerable energy saving.
The above hypothetical solution (switching-off of the processor CPU 1 when it is not being used) comes up, however, against the fact that the corresponding switching-off or powering-down procedures introduce additional processing latencies and these are added to the value of 10% mentioned previously. The aforesaid procedures envisage in fact:
switching off the processor CPU 1, except for the respective register file by gating the clock signal which supplies all the internal registers;
switching off the processor CPU completely, except that power supply is maintained for the cache memories; and
switching off the CPU as a whole, including the data and instructions caches.
However, given that the state of the individual processor must be restored when the latter is turning back on following upon one of the operations referred to previously, the latencies introduced vary from tens of microseconds to tens or hundreds of milliseconds. The above latencies prove particularly costly, both from the energy standpoint and from the computational standpoint.
Finally, the DSP is forced to work at approximately 90% of its computational capacity. This implies an evident asymmetry in the workload of the processor CPU as compared to the workload of the DSP, an asymmetry which is revealed also in the power-management algorithms, which are distinct for the two processors.