1. Field of the Invention
This invention relates to computer processing systems, and more particularly to a parallel virtual file system for parallel processing systems including single-processor systems having multiple storage devices.
2. Description of Related Art
Computational speeds of single processor computers have advanced tremendously over the past three decades. However, many fields require computational capacity that exceeds even the fastest single processor computer. An example is in transactional processing, where multiple users access computer resources concurrently, and where response times must be low for the system to be commercially acceptable. Another example is in database mining, where hundreds of gigabytes of information must be processed, and where processing data on a serial computer might take days or weeks. Accordingly, a variety of "parallel processing" systems have been developed to handle such problems. For purposes of this discussion, parallel processing systems include any configuration of computer systems using multiple central processing units (CPUs), either local (e.g., multiprocessor systems such as SMP computers), or locally distributed (e.g., multiple processors coupled as clusters or MPPs), or remotely, or remotely distributed (e.g., multiple processors coupled via LAN or WAN networks), or any combination thereof. Further, in the context of this invention, parallel processing systems includes single-processor systems having multiple storage devices.
It is common practice to achieve high performance on parallel processing systems by the following means:
The data sets which the computer is to access are divided ("partitioned") into several disjoint sets of records and stored on a collection of storage devices. A data set which has been divided in this fashion is called a "partitioned file", and each subdivision is called a "partition." PA1 Several instances of the programs comprising an application and which process the data are run on a collection of processors, or as independent tasks on one processor. PA1 Each program-instance accesses one partition of each partitioned file being processed. PA1 Each partitioned file must be created and, eventually, deleted. PA1 Each partitioned file must be accessed by applications. PA1 Partitioned files must be organized, e.g., into file systems. PA1 Partitioned files must be administered, e.g., repaired when damaged. PA1 Each processor in the system has its own set of storage devices. PA1 Each storage device is set up with an "isomorphic" file system, i.e., a file system having the same arrangements of files and directories, with only the contents of the files differing from system to system. PA1 When an application wishes to effect some change in the partitioned files (e.g., create a file), the application invokes an instance of a program on every processor in the system, with each program instance being given the same parameters (e.g., file name). PA1 It is necessary that each instance of the program behave identically from the view point of structural alterations to the file system. For example, if one program instance creates a file, then all program instances must create the same file (i.e., a file having the same name, but within a different file system). PA1 Programs which access the data in the partitions may do so using standard file-system interfaces. PA1 The fact that all the file systems have isomorphic structures makes them no harder to administer than a single file system. This is because file system administration is primarily a matter of establishing, monitoring, and altering the structure of directories and files in a system. Since the structure of a collection of isomorphic file systems is no more complex than that of a single file system, administrating a collection of isomorphic file systems is generally no more complex than administering a single file system. PA1 Every application wishing to use such a strategy must include a "hand-crafted" implementation of this approach. PA1 If any user or application departs from the conventions noted above, then the various file systems may gradually acquire different structures. Once isomorphism has been lost, managing the system becomes much more complex, since now the administrator must understand and manage the different structures which have appeared in the various constituent file systems. PA1 If the system crashes at an inadvertent time, it is possible that certain structural changes will have been made to some of the file systems, but not to others. In order to restore isomorphism, an administrator will have to inspect each file system, looking for inconsistencies, and repair them according to the administrator's best understanding of the intended file system structure. Furthermore, if isomorphism has not been rigidly enforced, the administrator will have no way of knowing which departures from isomorphism are intentional, and which are an artifact of a system failure. PA1 Partitions of a partitioned file are stored in a set of isomorphic "data trees". PA1 An additional directory tree, called the "control tree", is used to build a model of the intended structure of the data trees. The control tree allows the computer system to "locate" (generate a path name for) data within the data trees. PA1 The combination of a control tree and a collection of data trees is referred to as a "multifile system". Files in the multifile system are referred to as "multifiles." Directories in the multifile system are referred to as "multi-directories." Data elements of a multifile or multidirectory are referred to as "data plies." Control elements of a multifile or multidirectory are referred to as "control plies." PA1 A set of multifile subroutines is provided for accessing and modifying the multifile system, e.g., making structural changes to the data trees or obtaining identifiers for data plies of a multifile or multidirectory. Where practical, the multifile subroutines have an interface mimicking that of the native file system. PA1 The multifile subroutines use a distributed computing environment which provides for remote procedure calls (RPCs) and, in one of the embodiments of the invention, for a distributed transaction processing protocol (e.g., two-phase commit) to ensure atomicity of structural changes to the multifile system. The distributed computing environment must also provide for "distributed file operations" (fileops). PA1 Interference of concurrent file system operations is prevented by creating a "transactional" lock for each file system. PA1 A central "driver program" uses the subroutine interface of the invention to effect structural changes in the multifile system. PA1 The driver program may launch multiple program instances on various processors. Typically, there will be one program instance for each data tree in the multifile system, and each program instance typically operates on the contents of one distinct data tree. PA1 The above steps may be repeated multiple times within any application. PA1 As part of an application. In this case, the multifile subroutines are included as part of the application driver. The application driver and the multifile subroutines share access to a single distributed computing system, which will invoke distributed programs on the behalf of the driver, and invoke fileops and RPCs on behalf of the multifile software. PA1 As a distributed service. In this case, the owner of the computers containing the control and data trees would need to ensure that a "multifile server" was running at all times. The subroutines explained below would then be embedded as part of the server, which would also incorporate a distributed computing environment. Applications wishing to use these services would contain "stub subroutines" which would contact the appropriate servers to perform multifile operations. Preferably, these stub subroutines would mimic the interface of the native file system interface, in order to minimize the work needed to adopt the multifile system. Communications between the driver and the program instances might then be governed by a second distributed computing environment or, alternatively, by the same distributed computing environment as is used by the multifile server. PA1 As a part of an operating system. Typically, the operating system's file system interface may be extended to recognize multifile commands and dispatch them to a multifile server via a "file system driver" interface. Many operating systems (e.g., Microsoft Corporation's NT operating system) provide interfaces which facilitate this sort of integration. Once the multifile server has been engaged, the implementation is much like the case of a distributed service.
The benefits of this practice are that simultaneous use of multiple storage devices enables high aggregate data transfer rates, and simultaneous use of multiple processors enables high aggregate processing rates.
The developer of such applications is then faced with the problem of managing and using partitioned files. Among the problems which must be faced are the following:
One solution to these problems in the prior art is to arrange for some ad hoc variant on the following scheme:
As long as these principles are adhered to, the file systems should remain isomorphic.
This prior art approach has several useful properties:
However, there are several difficulties with this approach:
If two applications simultaneously access the "isomorphic" file systems, they may interfere with each other. For example, if application program A renames a file and application program B tries to delete the same file, it is possible that, on some processors application program A will run first (the file will be renamed), but that on other processors application program B will run first (the file will be deleted). This will result in a loss of isomorphism, leading to great administrative difficulties (the two applications should probably not have been run simultaneously, but such errors are common; the loss of isomorphism is an unacceptable result for such a common error).
Thus, there is a need for a better method of managing partitioned files in parallel processing systems. The present invention provides such a method.