Currently, systems of comparable speed are custom-built with Application Specific Integrated Circuits (ASICs) that implement fixed algorithms, rendering them inflexible.
There are several ASICs developed for front-end electronics. In the recent past, front-end electronics were built with analog techniques using discrete components. Later, with the rapid advances in digital technology, Digital Signal Processors (DSPs) replaced analog circuitry up to certain speeds. However, in many applications the user still had to design a specific hardware to implement an algorithm on the front-end signal from a detector (or sensors) because the DSPs were not fast enough or feasible.
2.1 Existing ASICs for front-end electronics
Several examples of different ASICs already built or currently under development can be found in the literature. For medical instruments, large companies such as Siemens, Philips, General Electric, Picker, and Positron have their own specific front-end circuits. A large variety of front-end ASICs are also under development in the HEP community, where there is a high demand for performance in speed and discernment of particular signals, coincidences, and pattern recognition among a large number of channels. These ASICs are built by several institutes, universities, and national and international laboratories. A partial list of experiments using ASICs at the front end includes:
At the European Center for Nuclear Research, ASICs have been developed or are under development for DELPHI, OPAL, L3, ALEPH, NA48, CMS, and ATLAS experiments. In the context of the research and development program at CERN, several ASICs are under development, such as RD27 and RD16 (digital front-end readout microsystem for calorimetry at LHC, Fermi, etc.). PA1 At Fermilab for the D0, CDF, experiments, etc. PA1 At Brookhaven National Laboratory for the experiment at RHIC, i.e., STAR and FENIX. PA1 Most of these experiments have built or are building ASICs for first-level trigger or data reduction from several sub-detectors. Not all the circuits or ASICs provided in the references could be replaced by the 3D-Flow system. PA1 Quick and flexible acquisition and exchange of data, but not necessarily in fully bi-directional manner. PA1 Possibility of dedicating small area to program memory in favor of multiple processors per chip and multiple execution units per processor, data-driven components (FIFOs, buffers), and internal data memory. (Most algorithms that this system aims to solve are short and highly repetitive, thus requiring little program memory.) PA1 Balance of data processing and data movement with very few external components. PA1 Programmability and flexibility provided by enabling downloading of different algorithms into a program RAM memory. PA1 High priority of modularity and scalability, permitting solutions for many different types and sizes of applications using regular connections and repeated components. PA1 i) Several applications are described, ranging from medical imaging (PET/SPECT), to high energy physics (LHC-B electron and hadron identification from preshowers, electromagnetic, hadronic and pads detector compartment, and identification of muons from five pad-projective chambers), to industrial control in applications using video cameras such as the example of the iterative search algorithm in an area of 5.times.5 pixels for photon counting. PA1 ii) Three different algorithms (LHC-B electrons, LHC-B electrons and hadrons, and iterative search on a 5.times.5 pixel area) have been simulated on the 3D-Flow simulator system for which no programmable solution currently exists and the details are reported herein at Sections 5.9.2, 5.9.3, and 5.9.4. PA1 iii) Functional simulation at the transistor level providing to the input of the VHDL (the VHDL V-System Windows simulation system purchased from Model Technologies, provides a full VHDL environment on IBM PC (or compatible) running Windows '95 or Windows NT) processor model compiled in
2.2 Parallel processing in general
Some applications require concurrent processing because no available processor has sufficient speed to sustain the high demand of computing power in the allowed time using a sequential approach.
Parallelism increases the execution speed of a task and is in some cases more cost-effective; however, it raises a new set of complex and challenging problems.
Parallel processing comprises algorithms, computer architecture, programming, and performance analysis. There is a strong interaction between these aspects, and only global understanding allows designers to make the proper trade-offs in order to increase overall efficiency.
2.3 Pipelined systems in general, and well-known techniques
Pipelining is an implementation technique to make faster CPUs in which multiple instructions are overlapped in execution.
An instruction can be divided into small steps, each one taking a fraction of the time to complete the entire instruction. Each of these steps is called a pipe stage or a pipe segment. The stages are connected to one another to form a pipe. The instruction enters one end of the pipe and exits from the other. The throughput of a pipeline is determined by how often an instruction exits the pipeline. At each step, all stages are executing their fraction of the task, passing on the result to the next stage and receiving from the previous stage. As the stages of the pipeline are connected, they need to process at the same time, because they need to send and receive data to/from different stages simultaneously.
2.4 Existing combination of parallel processing and pipelining
The combination of parallel processing and pipeline implementation techniques increases the throughput performance of a system when the algorithm to be executed is divisible into several tasks that can be executed concurrently.
This technique is used in commercially available systems, but it is limited in its capacity to distribute processes to several processors while keeping the communication protocol efficient and minimizing overall task execution time.
Commercial systems such as Hypercube are suitable for solving general-purpose problems using a large number of standard micro-processors. These systems certainly have advantages in the execution of some algorithms that can be programmed for concurrent operations. However, they are limited in speed due to the system protocol overhead and by the fact that they address general-purpose problems, which have obligatory serial sections.