Databases are one of the most complex and widely used applications of modern computer systems. Databases allow efficient sharing of large amounts of data among users and applications. Because the data are shared by many users, providing efficient access to the database requires concurrent processing of user queries. Depending on the type of application of the database, the spectrum of operations required to execute user queries may include statistical calculations on raw data, inferencing on rules, full text searches, and relational operations on tabulated records. As the size of a database is usually very large (gigabytes to hundred of gigabytes) concurrent query processing requires significant CPU and I/O time.
Presently many databases are resident in parallel processing computer systems and therefore queries executed on such databases are executed in parallel processing systems. A parallel processing system, typically comprises a plurality of processing nodes or processing sites interconnected by a interconnection network (see, e.g., C. L. Wu et al, "Tutorial: Interconnection Networks For Parallel and Distributed Processing", IEEE Computer Society Press, 1984).
A parallel processing system for dynamic query processing supports concurrent query processing through dynamic creation and coordination of tasks that operate on the shared data in a database. Tasks may comprise multiple smaller tasks accessing shared data concurrently. Tasks communicate with each other in a parallel processing system via message passing or shared memory referencing according to particular data dependency relationships.
The space and time complexity for each query is non-deterministic and the arrival rate of the queries is also unpredictable. A complex join operation may involve a large amount of data and take a long time to process; however, a simple selection operation may only involve a small amount of data and take little time to process. Because multiple queries from multiple users need to be processed concurrently, dynamic allocation of processing elements in a parallel processing system and scheduling of tasks and placement of data for each query make parallel query processing a complex problem.
An important characteristic of database applications is that queries are often data intensive rather than computation intensive. This means that the communication of data to appropriate processing nodes in a parallel processing system requires more resources than the actual processing of the data at the processing nodes. Accordingly, many frequently occurring data intensive operations such as file transfer, aggregation, projection, and process migration, require large bandwidth and low control overhead in the interconnection network of the parallel processing system to reduce communication costs. For this reason, the bandwidth of the interconnection network has a significant effect on the execution speed of more complex queries that require the sorting or joining of large relations.
In a dynamic environment, it is not possible to have a priori knowledge of data distributions and task execution times. Tasks are scheduled to run on dynamically allocated processing nodes and data also must be dynamically distributed among the processing nodes. Therefore, it is highly unlikely that the data associated with specific tasks are always uniformly distributed over the processors where the specific tasks are scheduled to run. Accordingly, data must be moved to the scheduled processors dynamically. These data movement operations are often bursty and occur in only some of the links in the network. Additionally, since data are often distributed non-uniformly over the domain of the attributes, non-uniform communication patterns can also occur during the data collection and load balancing phase of operations. Thus, in order to obtain a high speed-up factor for parallel execution, interconnection networks of parallel system not only need to have low average communication costs but also need to be robust to non-uniform traffic.
A qualitative analysis of the speed-up factor of parallel join algorithms achieved by a parallel processing system indicates that if a network cannot reduce communication costs for non-uniform traffic, the speed-up factor will be limited to only logn, where n is the number of data items in the largest relation involved. The main reason for this limitation is that non-uniform communication traffic could be as large as n, when the data are to be distributed to or from one hot spot node. As a result, a typical data intensive operation which has nlogn time complexity on a single processor, can only be improved by a factor of logn on arbitrary large number of processors if the communication time for load distribution reaches n. Therefore, the robustness to non-uniform traffic is a significant feature of an interconnection network that is designed for dynamic processing.
In view of the foregoing, an efficient interconnection network for a parallel processing system for processing database queries has the following characteristics:
1. The interconnection network should have a scalable architecture with high bandwidth, low latency and low blocking probability. In particular, a large system comprises thousands of nodes and requires terabits/sec switching capacity. PA0 2. The interconnection network should offer sharing of bandwidth among tasks to reduce the communication delays and performance degradation resulting from bursty or other non-uniform traffic patterns. An interconnection network which dynamically shares bandwidth will exhibit smaller communication delays than a network with fixed end-to-end bandwidths. Illustratively, the interconnection network has a bursty transfer bandwidth on the order of one or more gigabytes per second. PA0 3. The interconnection network should have a uniform topology, with equal bandwidth, latency and blocking probability between any given pair of nodes. In a large database system the software design is a significant portion of the total system cost. The uniform topology increases the portability of the software so that it ca be used in different nodes, therefor reducing software costs. PA0 4. The interconnection network should offer prioritized communication services for multi-user, multi-tasking system support. Executing high priority tasks on multiple processing nodes without giving high priority to their corresponding communication operations may result in a high priority task being blocked by a low priority task in the interconnection network.
It is an object of the present invention to provide an interconnection network with the foregoing characteristics.
Presently available interconnection networks generally do not satisfy the foregoing requirements.
In particular, presently available interconnection networks have disadvantageous scaling characteristics. Unfortunately, communication time complexity is always traded for space complexity and blocking probability. Crosspoint switches (see, e.g., E. A. Arnould et al, "The Design of Nectar: A Network Backplane for Heterogeneous Multicomputers," ASPLOS-III Proceedings, The Third International Conference on Architecture Support for Programming Languages and Operating Systems, Boston, Mass., Apr. 3-6, 1989) provide non-blocking routing and simple control with time complexity O(logN) for selecting a crosspoint. However, the space complexity is O(N.sup.2), where N is the number of I/O nodes and O( . . . ) indicates "on the order of." Other non-blocking multistage switches such as a Batcher-Banyan switch (see, e.g., A. Huang et al, "Starlite: A Wideband Digital Switch," Proceeding of Globecome 84, pp. 121-125; and C. Day et al, "Applications of Self-Routing Switches to LATA Fiber Optic Networks," International Switching Symposium, Phoenix, Ariz., March, 1987) and 2logN-1 stage networks (see, e.g., Tse-Yun Feng et al, "An O((logN).sup.2) Control Algorithm," Proceedings of Conf. on Parallel Processing, 1985, pp. 334-340) can have less than O(N.sup. 2) space complexity, but the control of these networks takes on the order of O((logN).sup.2) time steps with O(N(logN).sup.2) switching nodes and O((logN).sup.2) time steps with N fully connected processors, respectfully (see, e.g., D. Nassimi et al, "A Self-Routing Bene Network and Parallel Permutation Algorithms," IEEE Transaction on Computers Vol. C-30, No. 5, May, 1981, pp. 332-340). The logN stage interconnection networks (see, e.g., W. Crowther et al, "Performance Measurements on a 128-node Butterfly Parallel Processor," in Proc. 1985 Int. Conf. Parallel Processing, August, 1985, pp. 531-540; G. F. Pfister et al, "The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture, In Proceedings of Int'l Conf. on Parallel Processing," pp. 764-771, 1985; A. G. Ranade, "How to Emulate Shared Memory," IEEE Symposium on Foundation of Computer Science, pp. 185-195, 1987) only require O(logN) time steps with O(NlogN) switching nodes, but these networks are blocking, and to reduce the blocking probability and prevent starvation from blocked requests, buffers and arbitration circuits must be added to each switching node. This extra circuitry will increase the control complexity and the delay in the network when the size of the network is large.
As can be seen from the foregoing, available prior art interconnection networks do not have desirable scaling characteristics along with low latencies and low blocking probabilities. Accordingly, it is an object of the present invention to provide an interconnection network with desirable scaling characteristics, a low latency and a low blocking probability.
In addition, the available interconnection networks do not offer dynamic bandwidth sharing in a satisfactory manner. Space domain interconnection networks include the tree (see, e.g., DBC/1012 Data Base Computer "Concepts and Facilities", C02-0001-05 Release 3.2; and B. K. Hillyer et al, "Non-Von's Performance on Certain Data Base Benchmarks", IEEE Transactions on Software Engineering, Vol. 12, No. 4, April, 1986, 577-582) the mesh, the hypercube (see, e.g., W. C. Athas et al, "Multicomputers: Message-Passing Concurrent Computers," IEEE Computer August, pp. 9-24, 1988) and the above-described multistage interconnection networks. These space domain interconnection networks are not capable of allocating aggregate bandwidth dynamically to non-uniform and bursty traffic patterns. It is possible to have low utilization for some communication links and excessive queuing delays for others.
In contrast, time domain interconnection networks such as bus and local area networks can fully share the communications bandwidth among all the nodes. However, they are not scalable because of the limited bandwidth and increased latency for access control. In particular, parallel systems using time domain switches such as hierarchical buses and local area networks (see, e.g., D. R. Cheriton et al, "Multi-Level Shared Caching Techniques for Scalability in VMP-MC", ACM Symposium on Computer Architecture, 1989; E. Gehringer et al, "Parallel Processing The Cm* Experience," Digital Press, 1987; D. DeWitt et al, "GAMMA--A High Performance Dataflow Database Machine," Proc. of the VLDB Conf. Japan, August, 1986; J. Goodman et al, "The Wisconsin Multicube: A New Large-Scale Cache-Coherent Multiprocessor", IEEE International Symposium on Computer Architecture Conference, pp 422-431, 1988) can satisfy the bandwidth sharing and uniformity requirements. However, due to the bandwidth limitation and propagation delay caused by capacitive loads, the number of loads connected to these time domain switches is limited. To build larger systems, multiple buses need to be connected via different topologies such as a mesh, hyperbus, hierarchical bus, or multiple bus. However, the resulting interconnection networks are not sufficient for data intensive applications where large amounts of data are frequently moved between nodes for load balancing and data reorganization.
Thus, time domain switching is efficient for bandwidth sharing but is not scalable. To alleviate the scalability problem, in particular for data intensive applications, it is a further object of the present invention to provide a high bandwidth interconnection network capable of providing high bandwidth interconnections between multiple time domain switches.
In addition, presently available interconnection networks do not offer a uniform topology. For a parallel processing system with a non-uniform interconnection network, topology information such as distance of communication, channel bandwidth, and blocking probability is exposed to the software to allow efficient execution of tasks. The topology dependency will increase the complexity of the software algorithms and will further complicate portability problems.
For example, in a distributed network such as a mesh or hypercube, to support efficient processing of topology information for dynamic routing, a complex VLSI routing chip is required for each node. Since packets can be delivered via different routes, extra buffers and control logic are required for maintaining the order of delivery at the higher protocol layer. Furthermore, algorithms developed on these systems are often mapped directly to the network topology to improve efficiency; thus, portability, dynamic partition and assignment, and fault tolerance problems become very complex.
A simpler solution would be to utilize a centralized switch that has uniform bandwidth, latency, and blocking probability between any given pair of nodes such that the topology of the machine is transparent to the compilers, operating systems, and user programs. Furthermore, if the nodes that connect to the switch are grouped into bus connected clusters, then, in the case of a fault, only a simple renaming procedure is required to either replace or mask-off the faulty nodes within the cluster.
In short, it is an object of the invention to provide an interconnection network which overcomes the shortcomings of the prior art networks described above. More particularly, it is an object of the invention to provide an interconnection network which has advantageous scaling characteristics, bandwidth sharing, and uniformity. It is a further object to provide an interconnection network based on a time-space-time structure which has a terabit per sec total transport bandwidth, a gigabyte per second bursty access bandwidth, robustness to non-uniform traffic, and which supports multiple priority services.