This invention relates to distributed data processing systems that use multiple processing unit groups, and in particular to an asymmetric architecture that allows for autonomous and asynchronous operation of processing units in at least one group.
With continued development of low cost computing systems and proliferation of computer networks, the world continues to see an exponential growth in the amount and availability of information. Indeed, the Massachusetts-based Enterprise Storage Group has observed a doubling of information every few months. Demand for easy and efficient access to this ever-growing amount of digital information is another certainty. For example, World Wide Web traffic increased 300% in 2001 according to Forrester Research. Included among the applications that continue to make the greatest demands are systems for processing:                financial transactions;        “click stream” data that encapsulates the behavior of visitors to web sites;        data relating to the operational status of public utilities such as electric power networks, communications networks, transportation systems and the like;        scientific data supporting drug discovery and space exploration.        
Greg Papadopolous, the Chief Technical Officer of Sun Microsystems, Inc., has observed that the demand for access to decision support databases, referred to as the Input/Output (I/O) demand growth, doubles every nine months. To put this in context, Moore's Law predicts that Central Processing Unit (CPU) power doubles only about every 18 months. In other words, the demand for access to information is growing at least twice as fast the ability of a single CPU to process and deliver it.
In a typical general purpose data processing system, data is stored on one or more mass storage devices, such as hard disk drives. One or more computers are then programmed to read data from the disks and analyze it—the programs may include special database software written for this purpose. The problem with a general purpose system architecture, however, is that all the data must be retrieved from the disk and placed in a computer's memory, prior to actually being able to perform any operations on it. If any portion of the data retrieved is not actually needed, the time spent fetching it is wasted. Valuable time is thus lost in the process of retrieval and storage of unnecessary data.
The speed at which the data analysis can be performed is typically limited to the speed at which the entire set of data can be transferred into a computer's memory and then examined by the CPU(s). Usually, the aggregate data transfer rate of the disks does not govern the speed at which the analysis can be performed. Disks are inexpensive, and as such, data can be spread across a large number of disks arranged to be accessed in parallel. The effective data transfer rate of a set of disks, collectively, can therefore be almost arbitrarily fast.
The bandwidth of an interface or communications network between the disks and the CPUs is also typically less than the aggregate data transfer rate of the disks. The bottleneck is thus in the communications network or in the CPUs, but not in the disks themselves.
It has been recognized for some time that achieving adequate performance and scalability in the face of vast and rapidly growing data thus requires some kind of system architecture that employs multiple CPUs. The three most prevalent classes of so-called multiprocessing systems today include:                Symmetric Multiprocessing (SMP)        Asymmetric Multiprocessing (ASMP)        Massively Parallel Processing (MPP)But even these approaches have weaknesses that limit their ability to efficiently process vast amounts of data.        
SMP systems consist of several CPUs, each with their own memory cache. Resources such as memory and the I/O system are shared by and are equally accessible to each of the processors. The processors in an SMP system thus constitute a pool of computation resources on which the operating system can schedule “threads” of executing code for execution.
Two weaknesses of the SMP approach impair its performance and scalability when processing very large amounts of data. The first problem results from a limited ability to actually provide information to the processors. With this architecture, the I/O subsystem and the memory bus are shared among all processors, yet they have a limited bandwidth. Thus, when the volume of data is too high, the speed of the processors is wasted waiting for data to arrive. A second problem with the SMP approach is cache coherence. Within each processor is typically a cache memory for storing records so that they may be accessed faster. However, the more that processors are added to an SMP system, the more that time must be spent synchronizing all of the individual caches when changes are made to the database. In practice, it is rare for SMP machines to scale linearly beyond about 64 processors.
Asymmetric Multiprocessing (ASMP) systems assign specific tasks to specific processors, with a master processor controlling the system. This specialization has a number of benefits. Resources can be dedicated to specific tasks, avoiding the overhead of coordinating shared access. Scheduling is also easier in an ASMP system, where there are fewer choices about which processor to assign to a task. ASMP systems thus tend to be more scalable than SMP systems. One basic problem with asymmetry is that it can result in one processor being overloaded while others sit idle.
Massively Parallel Processing (MPP) systems consist of very large numbers of processors that are loosely coupled. Each processor has its own memory and devices and runs its own operating system. Communication between the processors of an MPP system is accomplished by sending messages over network connections. With no shared resources, MPP systems require much less synchronization than SMP and ASMP systems.
One weakness of the MPP model is that communication among processors occurs by passing messages over a network connection, which is a much slower technique than communication through shared memory. If frequent inter-processor communication is required, then the advantages of parallelism are negated by communication latency. Another problem with the MPP approach is that traditional programming models do not map cleanly onto message passing architectures. Using approaches such as Common Object Request Broker Architecture (CORBA), which are designed to handle message passing, are considered awkward by some designers.
In the late 1970s and early 1980s, several database machine architectures were proposed to move database processing closer to the data, and to offload other work from CPUs. A common theme of these machines was special hardware for performing database functions, integrated tightly with a disk. David DeWitt of the University of Wisconsin proposed a categorization scheme for these machines according to whether they had a processor per track (PPT), a processor per head (PPH), or a processor per disk (PPD).
The Content-Addressable Segment Sequential Memory (CASSM) machine was designed as a complete database engine, with a proprietary query language, and a processor per track (PPT). The CASSM processors could perform aggregation functions and had a bit-mapping scheme to process field-based searches.
The Rotating Associative Relational Store (RARES) was a selection and sorting machine, in which record data was laid out across disk tracks (rather than within disk tracks), which allowed it to examine the different fields of a record in parallel.
The DIRECT machine used a symmetric two-tier MPP system. The first tier compiled queries into packets, which could be distributed to processors in the second tier to balance the load. The processors in the second tier used a crossbar switch that allowed them to access data from any storage unit. The present invention also uses a two-tier MPP approach, but unlike the DIRECT machine, the present invention assumes an asymmetric approach, in which each back-end processor is the exclusive “owner” of its data. This avoids time spent in load balancing and in coordinating access to data among multiple processors. By tightly coupling processing with a set of data, the cost and complexity of the DIRECT crossbar switch can be avoided.
So-called Intelligent Disks (“IDISKs”) are a package including disk drive, power, CPU, memory, and network interface, running a full and standard operating system. They are intended to offload computation from desktop machines. Intelligent disks are typically positioned as commodities for use in various applications, supporting file systems and directories, database, Internet, and mail. One goal of Intelligent Disks is to allow the incremental addition of capacity by plugging in new slices and choosing a personality (such as “file system” or “mail”) to extend.
In the Magi system (Notre Dame), Kendall and Freeh describe a programming model in which applications apply modular stream-oriented operations to data. They discuss the use of IDISKs for intelligent caching, data compression, and database search.
Bill Bridge (Oracle) explains that, for purposes of database processing, Intelligent Disks can help allocate storage at a block level, but space within a block must be controlled by the application access method, and Intelligent Disks cannot get involved at this low level.
Like Intelligent Disks, so-called Active Disks bundle processing power, memory, and network connectivity with a disk to offer performance and functionality enhancements. Erik Reidel (CMU), a leading proponent of Active Disks, cites the following advantages of the technology:                Ability to execute application code on disks (disklets)        Parallel processing across lots of disks yields performance and scalability        Less saturation of I/O bandwidth        Scheduling (“a little bit of computation goes a long way”)        
Mustafa Uysal (University of Maryland) stresses the importance of a restricted execution environment for application disklets, controlled by a thin disk operating system (DiskOS), which manages scheduling, memory management, and communications. Along with Acharya (UCSB) and Saltz (UMD), Uysal notes that, while disklets can be written in any language, they should not be allowed to initiate I/O, allocate or free memory, or determine the source or destination of a stream of data. These characteristics differentiate Active Disks from Intelligent Disks. Where the former has a thin special-purpose operating system, the latter uses a general OS like Linux. Where the former allows disklets to be written in any programming language, with execution restrictions, the latter provides special stream-processing primitives, and does not impose execution restrictions.
In a performance simulation, Uysal found limited benefit to having more than 32 megabytes of memory available to the Active Disk. Instead, Uysal's simulation indicated the value of processing intermediate results as they arrive (to reduce memory usage and synchronization), and of forwarding partial results on to other processors when out of memory.
Several researchers focus on the use of Active Disks for non-database applications. For example, Lim (University of Minnesota), Kapoor (Valicert), and Wighe (Wind River Systems) use Active Disks to implement a scalable file system.