The invention described herein is concerned with the implementation of distributed computing processes using a new disk file architecture such that distributed processing can be applied to tasks on a much broader scale than has previously been available. In general, the term distributed processing as used herein includes parallel processing as a special case of the simultaneous use of multiple computers resources to solve a computational problem. The resources can range from a single “computer” with multiple processors to an arbitrary number of computers connected to a network, or some combination of those. Parallel computing is advantageous to those tasks that can be broken apart into discrete tasks that can be solved simultaneously. Such systems with more than one computer can execute multiple programs at the same moment, and reach a result faster than would be possible with a single computer resource. Currently parallel computing has been applied to very large problems such as planetary orbits, weather prediction, ocean current patterns, tectonic plate drift, chemical and nuclear reactions, biological prediction such as the human genome activity and the such which can only be solved with very large (and very expensive) parallel computers. As the cost of parallel computing comes down, less spectacular problems can also be solved faster and cheaper with parallel processing systems. These applications include rush hour traffic, assembly line control, daily business operations, mechanical simulation such as bending of complex plates and thermal distribution, complex circuit simulation, parallel databases, data mining, oil exploration, web search engines, computer-aided medical diagnosis, advanced graphics and virtual reality, networked video, and the like.
The fundamental reason for utilizing distributed and parallel computing is to solve problems faster and cheaper. Some applications must be done in real time (such as missile defense and guidance systems), other tasks are so complex that the only currently practical solution is parallel computing. Included in these applications are high-speed wide area network applications where speed is the principal objective. Some applications have non-local resources such as wide-area weather forecasting where parallel computing is the only practical solution. Of course, many of these applications can be done, given sufficient time, with a single, expensive super-computer. But utilizing multiple slower (and cheaper) systems is a much more economical solution. Since communication speed is limited by the speed of light, super computers are forced to move their components closer and closer together to achieve higher speeds. Not only are there limitations to the affordable miniaturization, the impact on cooling and electrical interference are difficult to overcome.
Currently, most super computers and networked parallel computers use a multiple-instruction, multiple-data computing architecture. Each processor may be executing a different instruction stream and may be executing a different instruction team. An example of this application is the computation of the temperature distribution in a complex machine. The physical machine can be conceptually subdivided in into a multiplicity of small nearly infinitesimal cubes in which the boundary conditions in each cube are initially known. The physical characteristics of the cubes may differ; some contain air (and the movement of the air in the cube), others may contain metal or plastic parts in which the contents do not change appreciably, but where the thermal characteristics are quite different that those cubes containing a gas or a fluid. Once the initial conditions are known, the predicted conditions at the next increment of time can be computed, and the boundary conditions of each cube are computed. At this point the boundary conditions on the interface between two small cubes may be different. Suppose, for example, the boundary between cube C1 and C2 is B1 and the initial temperature of the boundary B1 is T1. Based on all of the boundaries of B1, a new temperature at B1 is established. At the same time, a the conditions in C2 result is a new temperature at B1 which, in the general case, is not the same as that found by solving the boundary problem in B1. A simple algorithm is to average the two predicted conditions at either side of B1. Once this is completed for all the interfaces of all the cubes in the problem, the task is restarted. This continues until the temperature at B1 reaches a stable value. This same process is repeated for all six boundaries of each of the cubes that describe the complex machine. Once a stable condition is arrived at, the process restarts for the next increment of time.
In an architecture that describes the preceding taxonomy a set of computers, with their associated memories, is connected through some network. In the multiple-instruction, single-data taxonomy, only one computer would have an associated memory, and possibly that computer would only be used only to communicate with the other computers. In the multiple-instruction, multiple-data system, each computer would process the data in it's associated memory (possibly using the instructions in it's memory) and communicate the results of those computations over the network to other computers in the array.
In a few applications, high-speed volatile memory (such as RAM) is sufficient for the task. But in most complex applications a mass storage system (such as a disk drive or similar technology) is required. Each processor in the array has not only the associated RAM memory, but also an associated mass storage. The interface between the CPU and the mass storage can be any of the large set of computer interfaces such as SCSI, an IDE, of fiber channel. Now the array can process a large problem, but the addition of the mass storage also contributes to overhead, the known limitation of parallel processing.
In an early work (1967), Amdahl is credited with having shown that the principal limitation to a parallel processing system is overhead. This overhead increases with the number of effective parallel processors. In addition, the time required to store and retrieve data from mass storage such as a disk drive is a real world factor. Most of the computational time (the so called “observed speed up time”) is taken by fetching instructions and data from the mass storage. In the case of a SCSI interface, the CPU sends a command through the interface to the storage unit to fetch a particular block of data, the disk drive interprets the command, inserts the command in it's associated command stack if it is processing other requests, and when the command finally reaches the top of the stack, the disk drive positions the head over the appropriate physical address on the disk and reads the requested data into it's buffer. The data is sent back to the CPU through the interface. This process continues until the request is complete and the disk drive then executes the next request in the stack.
Many have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. The nature of this overhead (in parallelism) appears to be sequential so that it is unlikely to be amenable to parallel processing techniques. Overhead alone would then place an upper limit on throughput as a small multiple of the sequential processing rate, even if the housekeeping were done in a separate processor. So the fundamental limitation on a parallel computing system is not the speed of the actual computations but the overhead.
In many applications the speed required to arrive at a decision is extremely critical. The literature is replete with applications that require computational speeds faster than those available with available processors and storage systems. For example, one could consider a missile tracking system where a decision must be made very quickly on whether to respond to a perceived attack. Another application is high-speed packet switching where a decision must be made where to send a packet before another one arrives.
One approach to this bottleneck is to partition the application into several parts and then compute each independently and assembling the overall solution later from the collection of partial solution. The problem with this approach is that the speed of an array of computers is limited by the overhead inherent in each of the computers. Rajkumar Buyya in “High Performance Cluster Computing, Architectures and Systems, Volume 1”, 1999) has written that: “Improvements in disk access time have not kept pace with microprocessor performance, which has been improving by 50 percent or more per year.” Overall improvement in disk access times, which involve mechanical systems, has been less than 10 percent per year. So-called “grand challenge” applications usually need to process large amounts of data. The system throughput is limited by the slowest system component which is often the disk drive. Clearly a fundamental limitation with high-speed parallel processing is the speed at which information can be retrieved from the storage system. For small applications, the storage may be a simple electronic memory (RAM and the like), but for large applications a mass storage system is necessary.
Generally random access disk files have been used for mass storage, but they are limited by some the physical characteristics of the mechanical system. As recording heads and magnetic read heads have become smaller, the amount of data that can be stored on a rotating disk has increased. Consequently, the data rate of disk drives has increased with increases in associated electronic components. However neither the rotational speed nor the time required to access the appropriate track have significantly improved. The principal bottleneck in using a parallel processing architecture in an application where the volume of data requires disk drives is the overhead of accessing the data.
The layout of a typical set of disks is illustrated in FIG. 7. Three disks 45A, 45B, 45C are shown, but any number of disks can conceivably be used. The disks are rigidly mounted on a spindle (not shown) so that the relative position of each disk to the other disks does not change as the entire disk pack is rotated. The upper and lower surface of each disk can be used to record data. Typically the heads (not shown) are mechanically connected to a single actuator, so that the heads all move in unison. Between the outer 51A and inner diameter 52A of disk 45A are the plurality of concentric tracks such as the exemplary track 53A which is shown. Each track is conventionally subdivided into blocks, commonly 512 or 528 bytes each. The blocks are typically integral units which include ECC bytes which cover the entire block and, therefore, each block can is read or written as a unit. In current disk drives, the number of blocks in a track ranges from over 800 in the outer track to less than 600 on the inner track. There are commonly over 41,000 tracks on each surface. For some of the more advanced disk drives, the rotational speed is 15,000 RPM (or 4 ms. per revolution). The time required to move the head from one track to another can range from a low of 0.3 ms. for a single track seek, up to 6 ms. for a maximum seek from the outer track to the inner track.
Prior art data storage devices such as disk drives have drive control systems include means for accepting commands from a host computer including commands related to self-testing, calibration and power management. Each drive has programming code (microcode) in nonvolatile memory for execution by a controller, which is a special purpose processor, to enable it to perform essential functions. The microcode can be updated through the drive's communication interface as described, for example, in U.S. Pat. No. 5,659,801 to Kopsaftis. Various standard communication interfaces with both hardware components and command protocols are commonly used such as IDE, SCSI, Serial ATA, and Fibre Channel Arbitrated Loop (FC-AL).
Commercially available disk drives cannot perform any operation on the data other than those directly related to its storage such as, for example, performing error correction computations. There have been some suggestions for enhancing disk drives (HDD) by adding an application-specific component for receiving and executing application-specific instructions. Published U.S. patent application 20030227703 by F. D. Hinshaw, describes application-specific components which can perform application-specific tasks, such as database operations, using data on the HDDs and return results to a local host or even a remote host via a network. Examples of the application-specific components include a circuit board or an integrated circuit such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), Central Processing Unit (CPU), or Digital Signal Processor (DSP). The additional component is capable of receiving application-specific instructions via a standard network connection and executing these application-specific instructions on the data contained on the disks. The application-specific component may be in the form of an add-on board which couples to the controller circuit or, alternatively, in the form of an application-specific chip which mates with a socket provided on the controller circuit.
In U.S. patent application 20020162075 by Talagala, et al. a storage system is described which includes a storage array controller and a storage array. The storage array controller issues scrubbing operation commands to one or more of the disk drive controllers. In response, each disk drive controller that receives a scrubbing operation command reads data from within a data range from at least one of the disk drives, calculates a new checksum for the data, and compares the new checksum to a preexisting checksum for the data. If the new checksum doesn't equal the preexisting checksum, the data within the data range is determined to be erroneous.