All of the material in this patent application is subject to copyright protection under the copyright laws of the United States and of other countries. As of the first effective filing date of the present application, this material is protected as unpublished material. However, permission to copy this material is hereby granted to the extent that the copyright owner has no objection to the facsimile reproduction by anyone of the patent documentation or patent disclosure, as it appears in the United States Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
Not Applicable
This invention relates to distributed computer systems and more particularly to an improved system and method for allocation of vacant memory space.
Distributed database computer systems store enormous amounts of information that can be accessed by users for identification and retrieval of valuable documents that contain data, text, audio, video and other multimedia information. A typical example of a distributed system (100) is shown in FIG. 1. A distributed database computer system consists of computer nodes (104a to 104n) and a communication network (108) that allows the exchange of messages between computer nodes. Multiples storage devices (106a to 106n and 110a to 110z) store data files for the multiple nodes of the distributed system. Storage devices (106a to 106n) are local disks for the nodes (104a to 104n); storage devices (110a to 110z) are global databases accessible by nodes (104a to 104n) via a storage network (102). These nodes work together to achieve a common goal, e.g., a parallel scientific computation, a distributed database, or a parallel file system.
Shared disk file systems allow access to data contained on disks attached by some form of Storage Area Network (SAN). The SAN provides some physical level access to the data on the disk to a number of systems. These shared disks can be split into partitions which provide a shared pool of physical storage without common access or with the aid of a shared disk file system or database manager. SAN can provide coherent access to all the data from all of the systems. IBM""s General Parallel File System (GPFS) is a file system which manages a pool of disks across a number of systems allowing high speed direct access from any system and aggregate performance across a single file system which exceeds that available from any file system managed from a single system. This disclosure addresses an aspect of bringing that multi-system power to bear on an aspect of file system operation.
In general, the operation of the file system is a compromise between keeping data placed together on the disks for rapid access and also allowing small files to be stored without great space overhead. A common way of doing that is to store larger files in blocks which effectively use disk bandwidth and also in sub-blocks which are some fraction of a file system block. The normal operation of creation and deletion of data creates unused sub-blocks within file system blocks as shown in FIG. 2. FIG. 2 is a file structure (200) overview illustrating the block and sub-block utilization for a storage device (202a, 202b) in the prior art within which the invention may be practiced.
To take advantage of large block sizes, while a file is open for writing from a database or disk storage (202a, 202b), only full blocks (Blocks Nxe2x88x921, N, N+1, N+2, N+3) are allocated (204a) from a number of contiguous sub-blocks (206). After the last close of the file, the last logical block of the file is shrunk down to the number of sub-blocks that are actually needed (204b). This approach requires that some full blocks be available before any file can be written. During normal file system operation, after many allocation and de-allocation of files, the disks end up fragmented (208) with many free sub-blocks (210) that cannot be used for full block allocation. In such systems, there is a need for a mechanism that allows elimination of holes (unused portions of a disk block, 210) into free full blocks.
To avoid fragmentation of data blocks, defragmentation utilities combine and migrate, in a directed way, fragments occupying part of a block to form fully occupied blocks, and then free the sub-blocks previously used by the migrated fragments. The goal is to increase the number of free full blocks available for allocation. A naive method is to statically gather all information about the fragments in the file system and then combine them together, in the best way, to form entire blocks. However, this approach entails severe performance degradation and space penalties. The performance degradation is because this method freezes the file system usage during defragmentation. The space penalties are due to the large amounts of space for storing fragment information that is used for migrating fragments.
Therefore, there is a need for an improved method and system that overcomes the deficiencies of the prior art methods so that shared disk parallel file systems can be defragmented easily and without the performance and space penalties of the prior art.
This invention provides a defragmentation utility that works on-line in parallel with other file system activities across multiple computer systems which share direct access to the storage devices which comprise the file system. Thus, it avoids making the file system unavailable for periods of time which would, if not for this invention, slow down data communication exchange and the execution of other tasks dependent upon the data. In order to accomplish the defragmentation process, the utility presented herein locks the file system structures for only short periods of time; system disturbance is thereby minimized and necessary data may be transferred without noticeable effects to the overall processing functions performed within a distributed computer architecture. Also, the defragmentation utility is memory efficient and does not require fully free blocks to perform its defragmentation function; rather, it operates upon sub-blocks of the data blocks that are fragmented. Finally, this utility minimizes the number of data transfers since each of these data movements implies more memory accesses with a corresponding memory access time; reducing the number of memory accesses reduces the total amount of time spent accessing a memory.
In particular, this invention steps through all of. the valid inodes finding each of the fragments. The defragmentation engine decides which fragments must remain in their current location and which fragments should migrate to another disk block sub-block location. Since the data blocks span across multiple disks, for each valid disk of the file system a set of disk blocks is constructed that are chosen to be filled, herein called plates. When the plates become full or reach a certain fullness, they are removed from the set and replaced by other disk blocks. When a disk block is removed from the plate set, it is moved to a xe2x80x9cdonexe2x80x9d list as it is considered xe2x80x9cfullxe2x80x9d. While a disk block is in the done list, the fragments that belong to that block are not allowed to migrate. The defragmentation protocols as practiced in this invention includes: 1) if a current fragment belongs to a fully populated disk block, or to a plate, or to a xe2x80x9cdonexe2x80x9d block, then do nothing; 2) if a current fragment belongs to an almost full block, that is, the block occupation is higher than a preestablished threshold, move the block to the done list; 3) attempt to find a suitable hole at least the same size as the fragment or larger in the plate list for the current fragment; and 4) if successful in searching for a suitable hole, then migrating the fragment into that hole and freeing the previously occupied sub-blocks.
Thus, a defragmentation utility that works on-line, avoids locking data structures for long periods of time, is memory efficient, uses sub-blocks for fragment analysis and migration as well as minimizes data movements has been summarized. This utility thereby provides a transparent defragmentation function that operates in the background seamlessly with other system file operations.