1. Field of the Invention
The present invention relates generally to processing systems, and more particularly to configuring a multi-processor system.
2. Description of the Prior Art
Computationally intensive applications, such as modeling nuclear weaponry, simulating pharmaceutical drug interactions, predicting weather patterns, and other scientific applications, require a large amount of processing power. General computing platforms or engines have been implemented to provide the computational power to perform those applications. Such general computer computing platforms typically include multiple single-chip processors (i.e., central processor units, or CPUs) arranged in a variety of different configurations. The number of CPU's and the interconnection topology typically define those general computing platforms.
To improve the functionality, reduce cost, increase speed, etc. of the general computer computing platforms, the multiprocessors and their architectures are migrating onto a system-on-a-chip (SOC). However, these conventional approaches to designing multiprocessor architectures are focused on either the general programming environment or on a particular application. These conventional approaches, however, cannot make many assumptions about (i.e., predict) or adapt their resources to optimize computations and communications in accordance with the user's application. This deficiency exists because the number of applications varies widely and each often has requirements that vary dynamically over time, depending on the amount of resources required. Also, those approaches that are focused on one particular application often provide high performance for only one specific application and thereby are inflexible to a user's changing needs. Further, the traditional approaches do not allow a user to optimize the amount of hardware for the user's specific application, resulting in a multiprocessor architecture with superfluous resources, among other deleterious effects.
Additionally, conventional approaches do not optimize communications among processors of a multiprocessor architecture for increased speeds and/or do not easily allow scalability of the processors of such an architecture. For example, one approach provides for “cache coherency,” which allows for creation of a programming model that is easier to use. With cache coherency, the programming model is similar to programming a uniprocessor. However, cache coherency is expensive in terms of hardware, for example, and does not scale well as the number of nodes increases. Scaling cache coherency beyond four nodes usually requires significant hardware complexity. In contrast, another approach provides for “message passing” to obtain a more scalable solution. But this message passing typically requires the users to learn a new programming model. Furthermore, message passing machines and architectures often have additional hardware overhead as each processor element must have its own copy of the program for execution.
Some multiprocessor systems have used interface protocols, such as HyperTransport from the HyperTransport Technology Consortium of Sunnyvale, Calif., for communications between processors. Other examples of interface protocols used are Peripheral Component Interconnect (PCI) Express and RapidIO from the RapidIO Trade Association of Austin, Tex. These interface protocols have been primarily used in high-performance processing systems such as super computers, which are very expensive. The interface protocols have also been used in general purpose processing systems. In one example, one system used Hypertransport channels in an array of Advanced Micro Devices (AMD) processors from Advanced Micro Devices, Inc. of Sunnyvale, Calif. These general purpose processing systems are more expensive than embedded systems because the general purpose processing systems have to include additional functionality to run a variety of applications that may change dynamically.
Another prior communication solution is called Message Passing Interface (MPI). MPI is a standard for message passing in a parallel computing environment. In MPI, communications must first be set up between a source and destination. Then, the source sends messages to the destination, where every message specifies the source and destination. The cost of setting up the communications between the source and the destination is relatively small in terms of performance and processor cycles as compared with sending the message. However, one problem with MPI is that the communications between the source and destination are not guaranteed. Thus, some packets of data sent under MPI can get lost. Another problem with MPI is that there is no mechanism for a reservation of bandwidth.
Another prior art solution is called sockets. A socket is an application program interface between a user application program and Transmission Control Protocol/Internet Protocol (TCP/IP). In TCP/IP, a connection is initiated between a current host and a specified host through a port. The specified host then accepts the connection from the current host through another port. Once the connection is established, the connection is bidirectional, where either host may read or write to the other. Multiple hosts may also connect to a single host, which then queues the connections. One problem with this queuing is the delay experienced by having many connections, which decreases overall application performance.
The combination of TCP and IP provides reliability over an unreliable network. If packets of data were lost in the IP layer, then TCP would require that the packets be resent. However, one problem with sockets is that in order to provide this reliability, large amounts of buffering are required. Another problem is that the operation of sockets is expensive in terms of performance and processing cycles. For example, the processor running sockets has to perform many communication functions that cost processor cycles.
When running applications in a multiple processor environment, the applications need to be compiled into the executables that each processor will execute. Standard C and C++ compilers do not provide the required functionality for a multi-processor environment. One prior solution called VX Works from Wind River is an embedded operating system. In VX Works, certain attributes of the multi-processor system can be specified before compilation. These attributes include the number of processors and what tasks are being executed on which processors. Upon compilation, the operating system, boot code, and user application are all combined into a single executable. However, one problem is that VX Works only has limited functionality for a multi-processor system. Specifically, the linker, debugger, and system description framework do not support multi-processor systems. For example, when a debugger is attached to a chip running VX works, the state of multiple processors cannot be seen. After compilation, the multi-processor system needs to boot up from an inactive or reset state. In most multi-processor systems, each processor has a FLASH memory associated with the processor. For the boot process, the processor reads boot code from the FLASH memory and begins executing the boot code. The processor then configures itself based on the boot code. The processor then determines the processor number or identification and then detects neighboring processors. Once configured, the processor transmits a message to a root processor indicating that the processor has completed the booting process. One problem is that not all multi-processor systems have FLASH memory associated with the processor. Thus, there is no FLASH memory to store the boot code to begin the boot process.