This invention was not developed in conjunction with any Federally sponsored contract.
Not applicable.
The related application, Ser. Nos. 09/561,184 and 09/616,131, filed on Apr. 27, 2000, and Jul. 13, 2000, respectively, by Benedict Michael Rafanello, et al, are incorporated herein by reference in their entireties, including drawings, and hereby are made a part of this application.
1. Field of the Invention
This invention relates to the arts of computer disk media, formatting of computer disks, organization of computer readable media by operating systems and device drivers, and the management of logical volumes of computer disks. In particular, this invention relates to improvements to the control of data and boot records stored in logical volumes when the logical volumes comprise multiple layers of aggregation.
2. Description of the Related Art
Persistent and mass data storage devices for computer systems, especially those employed in personal computers, are well known within the art. Many are disk-based, such as floppy disks, removable hard disk drives (xe2x80x9cHDDxe2x80x9d), and compact-disk read only memories (xe2x80x9cCD-ROMxe2x80x9d). FIG. 1 shows a typical personal computer system (1) architecture, wherein a CPU (2) interfaces to a variety of I/O devices such as a keyboard (3), monitor or display (5) and a mouse (4). The CPU (2) also may interface to a number of storage peripherals including CD-ROM drives (7), hard disk drives (6), and floppy drives (5). Typically, floppy disk drives interface to the CPU via Integrated Drive Electronics (xe2x80x9cIDExe2x80x9d) (8), but this interface may alternately be one of several other standard interfaces or a proprietary interface. The hard disk drives (6) and CD-ROM drives (7) may interface to the CPU (2) via an IDE or Small Computer System Interface (xe2x80x9cSCSIxe2x80x9d), as shown (9).
FIG. 2 shows a generalization of the hardware, firmware and software organization of a personal computer system (20). The hardware group (21) includes the persistent storage devices discussed supra, as well as other system hardware components such as a real-time clock, keyboard controller, display adapter, etc. A basic input/output system (xe2x80x9cBIOSxe2x80x9d) (22) provides the direct firmware control of these system components typically. An operating system (24) such as the IBM OS/2 operating system provides high level management of the system resources, including the multi-tasking or multi-threaded scheduling and prioritization of the system application programs (25). Drivers (23) provide specific high-level interface and control functions for specific hardware, such as a manufacturer and model-specific LAN interface card driver or CD-Rewritable (xe2x80x9cCD-RWxe2x80x9d) driver. This generalized view of the system also applies to systems on alternate, non-IBM-compatible platforms, such as workstations, which employ a variety of operating systems such as Microsoft Windows, UNIX or LINUX. This general organization of computer system resources and software functionality is well understood in the art.
Turning to FIG. 3, disk-based mass storage devices such as hard disk drives, floppy disks and CD-ROMS are based physically on a rotating storage platter (30). This platter may be made of flexible mylar, such as floppy disks, or more rigid platters made of aluminum, glass or plastic, such as hard disk drives and CD-ROMS. For magnetic media, one or both sides of the platter are coated with a magnetic layer capable of recording magnetic pulses from a read/write head. For optical media, data recording is made using changes in reflectivity of a band of light, which is then read by a laser-based head. Writable and Re-writable CD-ROM drives combine the technologies of magnetic disks and optical disks. In general, though, the organization of data on the disk is similar. The disk surfaces are divided into multiple concentric rings, or tracks (31). Some disk drives, such as hard disk drives, consist of multiple platters, in which case corresponding tracks on each platter are grouped into cylinders. Each track is divided into multiple sectors (32) in which data can be stored.
Turning to FIG. 4, a computer disk drive (41) is represented as an ordered collection of sectors numbered 0 through xe2x80x9cnxe2x80x9d. The very first sector on the hard drive, sector zero, contains the Master Boot Record (xe2x80x9cMBRxe2x80x9d). The MBR contains partition definitions for the rest of the disk. TABLE 1 shows a sample partial MBR.
For the disk partitioning shown in TABLE 1, the MBR is located in the first sector on side 0 at cylinder 0 sector 1. The MBR requires only one sector, but the entire track of 63 sectors is xe2x80x9cblockedxe2x80x9d for the use of the MBR, 62 sectors of side 0 cylinder 0 are left unused.
The partition table has entries in it defining two types of partitions: primary and extended. Conventional disk formatting schemes allow only one extended partition (411) to be defined. P1 (43) and P2 (44) are primary partitions. The order and locations of the primary and extended partitions may vary, but invariably there are entries in the partition table of the MBR which defines them.
The extended partition (411) is defined in the partition table in the MBR as a single partition using a single entry in the MBR partition table. Basically, this entry in the MBR just indicates to the computer operating system that inside of this extended partition can be found other partitions and partition definitions. The operating system typically assigns logical drive letters and/or logical volumes to these partitions, or groups of partitions.
In order to determine the size and location of the partitions within the extended partition, the operating system accesses the first sector of the extended partition which typically contains another boot record, known as an Extended Boot Record (xe2x80x9cEBRxe2x80x9d). The format of the EBR is similar to that of the MBR, and is also well known in the art.
FIG. 4 shows a first EBR (45), a second EBR (47), and a third EBR (49) within the extended partition (411). In practice, there may be fewer or more EBR""s within an extended partition.
Each EBR contains a partition table similar to a MBR partition table. Conventionally for computer drives commonly used in personal computers and workstations, only two entries may be in use in each EBR. One entry will define a logical partition, and the second entry acts as a link, or pointer, to the next EBR. FIG. 4 shows a pointer (412) from the second entry of the first EBR (45) to the beginning of the second EBR (47), and a similar pointer (413) from the second entry of the second EBR (47) to the beginning of the third EBR (413). The last EBR in the extended partition does not contain a pointer to a subsequent EBR, which indicates to the operating system that it is the last EBR in the extended partition. In this manner, the operating system can find and locate the definitions for an unlimited number of partitions or logical drives within the extended partition on a deterministic basis.
In each partition table entry, whether it be an EBR or an MBR, there are certain fields which indicate to the operating system the format, or file system, employed on the disk. For example, for DOS (xe2x80x9cdisk operating systemxe2x80x9d) systems, the field may indicate that the file system is File Allocation Table (xe2x80x9cFATxe2x80x9d) formatted. Or, for systems which are running IBM""s OS/2 operating system, the entry may indicate that the file system is High Performance File System (xe2x80x9cHPFSxe2x80x9d) formatted. There are a number of well-known file system formats in the industry, usually associated with the common operating systems for computers such as Microsoft""s Windows, IBM""s OS/2 and AIX, variants of UNIX, and LINUX. Using this field, the operating system may determine how to find and access data files stored within the partitions of the primary and extended partitions on the computer disk. The file system format indicator is sometimes called the xe2x80x9csystem indicatorxe2x80x9d.
IBM""s OS/2 operating system includes a function referred to as the Logical Volume Manager, or xe2x80x9cLVMxe2x80x9d. For systems without an LVM, each of the partitions that is usable by the operating system is assigned a drive letter, such as xe2x80x9cC:xe2x80x9d or xe2x80x9cF:xe2x80x9d, producing a correlating drive letter for each partition on a disk in the computer system. The process which assigns these letters is commonly known. For systems with an LVM, a drive letter may be mapped instead to a logical volume which may contain one or more partitions. The process by which partitions are combined into a single entity is known generically as xe2x80x9caggregation.xe2x80x9d Given the highly modular design of the OS/2 LVM, the functionality which performs aggregation is contained completely within a single module of the LVM program. LVM calls any module which performs aggregation an xe2x80x9caggregatorxe2x80x9d.
There are various forms of aggregation, such as drive lining, mirroring, and software Redundant Array of Independent Disks (xe2x80x9cRAIDxe2x80x9d). The OS/2 LVM allows a single level of aggregation through the use of drive linking. Internally, the OS/2 LVM uses a layered model. Each feature offered by the LVM for use on a volume is a layer in the LVM. The input to a layer has the same form and structure as the output from a layer. The layers being used on a volume form a stack, and I/O requests are processed from the top most layer down the stack to the bottom most layer. Currently, the bottom most layer is a special layer called the pass through layer. The top most layer is always the aggregator, which, in the current implementation, is always the drive linking layer. All of the layers in the middle of the stack represent non-aggregation features, such as Bad Block Relocation.
FIG. 5 illustrates the relationship of the layered model of the LVM and the aggregation of physical partitions into a logical volume (90). On the left, the xe2x80x9cfeature stackxe2x80x9d is shown, having a xe2x80x9cpass throughxe2x80x9d layer (97) at the bottom which interfaces directly to the disk devices or device drivers. Above the xe2x80x9cpass throughxe2x80x9d layer (97) may be a feature (96), such as Bad Block Relocation (xe2x80x9cBBRxe2x80x9d) Above the feature may be a layer of aggregation, such as drive linking (95). From the view of the feature stack model, an I/O request (98) is received at the top of the stack and propagated downwards to the pass through layer. Comparing that to a tree model of a logical volume (90), the aggregator A1 (91) corresponds to the a aggregation layer (95), the feature layer (96) corresponds to the three interfaces between the aggregator A1 (91) and it""s partition definitions P1, P2, and P3 (92, 93, and 94 respectively), and the pass through layer (97) corresponds to the interfaces between the partition definitions and the actual devices or device drivers. These types of LVM structures, feature stack models, and tree models are well understood in the art, and the models can be equally well applied to logical volume management systems in other operating systems such as Hewlett Packard""s HP-UX and IBM""s AIX.
Partitions which are part of a logical volume have a special filesystem format indicator. This indicator does not correspond to any existing filesystem, and it serves to identify the partitions as belonging to a logical volume. The actual filesystem format indicator for a logical volume is stored elsewhere. Furthermore, partitions belonging to a volume have an LVM Data Area at the end of each partition in the volume. The data stored in the LVM Data Area allows the LVM to re-create the volume every time the system is booted. Thus, the LVM allows groupings of partitions to appear to the operating system as a single entity with a single drive letter assignment.
In previous versions of the OS/2 operating system, a file system utility such as the FORMAT disk utility would access the partition table for the partition that was being formatted through low level Input/Output Control (xe2x80x9cIOCTLxe2x80x9d) functions. The system provides IOCTL""s to allow a software application to directly read and write to the computer disk, bypassing the file system, rather than using filed-based operations.
Using the IOCTL functions, an application program can actually access everything from the EBR that defines the partition being processed to the end of the partition itself. This allows disk utilities to find the partition table entry that corresponds to the partition they are processing, and alter it. For example, FORMAT will update the filesystem format indicator in the partition table entry for each partition that it formats successfully. While this method works fine for processing individual partitions, it creates problems when dealing with logical volumes. Logical volumes appear to the system as a single entity, which means that they will look just like a partition to older disk utilities, which will naturally try to treat them as such. However, since a logical volume may contain more than one partition, there is no EBR or partition table entry which describes it. If the older disk utilities are allowed to access the EBR or partition table entry for one of the partitions contained within the logical volume, the partition described in the partition table entry will not agree with what the disk utility sees as the partition. Furthermore, if the disk utility alters the partition table entry, such as when FORMAT updates the filesystem format indicator, the resulting partition table entry will not be correct. Thus, older disk utilities must not be allowed to access the EBR or partition table entry for a partition contained within a logical volume, yet they need an EBR and partition table entry in order to function correctly.
In the first version of the OS/2 LVM, this problem was solved by creating a xe2x80x9cfakexe2x80x9d EBR which contained a xe2x80x9cfakexe2x80x9d partition table entry that described the entire logical volume as if it were a single partition. This xe2x80x9cfakexe2x80x9d EBR was stored inside of the logical volume on the first partition in the logical volume. The IOCTL functions were intercepted and any requests for an EBR were redirected to the xe2x80x9cfakexe2x80x9d EBR. This allowed logical volumes to be treated as partitions by older disk utilities , thereby allowing them to function.
The currently available OS/2 LVM design supports only a single layer of aggregation. This places some limitations on what can be done. For example, if software RAID is used as the aggregator, then there is a limit on the number of partitions that can be aggregated into a single volume. However, if multiple levels of aggregation are allowed, then drive linking could be used to aggregate several software RAID aggregates into a volume, thereby providing a volume with all the benefits of software RAID without the limitations of software RAID.
The enhanced LVM described in the first related application provides for multiple layers of aggregation. However, the location of the fake EBR is found by the system software using a broadcast method. According to the broadcast method, when an I/O request to the EBR is detected by the multilevel LVM, each aggregator which does not find the xe2x80x9cfakexe2x80x9d EBR among its children duplicates the I/O request, flags it as an EBR I/O request, and issues the I/O request to each of its children in parallel simultaneously. This parallel duplication and issuance of I/O requests may descend multiple levels of aggregation. Of all the parallel requests, only one will succeed and the others will be discarded. When an aggregator finds the xe2x80x9cfakexe2x80x9d EBR among its children, it will redirect the I/O request to the xe2x80x9cfakexe2x80x9d EBR, and turn off the EBR I/O request flag. When an I/O request reaches the pass through layer, if the EBR I/O request flag is set, the pass through layer will discard that I/O request. Thus, only one I/O request will succeed in reaching the xe2x80x9cfakexe2x80x9d EBR, and all of the duplicate I/O requests generated along the way will be discarded.
The broadcast method disclosed in the first related application is relatively simple to implement, and, since I/O requests to the EBR are rare, it is reasonably efficient in many applications. An alternative to issuing the duplicate EBR I/O requests in parallel is to issue them in serial, stopping with the first one to succeed. In this case the pass through layer will fail any I/O request which has the EBR I/O flag set instead of discarding such requests.
However, the broadcast method may not meet the system requirements for systems in which the logical volumes are managed heavily, i.e., the fake EBR is accessed often. Because the broadcast method causes many replications of the I/O request from parents to children, the processing time or bandwidth required to process all of the replicated requests may become detrimental to system performance.
A xe2x80x9cbottom-to-topxe2x80x9d method for locating, managing and controlling the fake EBR is disclosed in the second related application. According to the xe2x80x9cbottom-to-topxe2x80x9d method, the fake EBR is stored in the LVM data area of a partition belonging to the volume that the fake EBR describes. This LVM data area will have a flag set (the EBR_PRESENT flag) to indicate that it contains the fake EBR. Aggregators will check for this flag among the partitions being aggregated, and, if found, will set the EBR_PRESENT flag in the LVM data area of the aggregate being created. When an I/O request to the EBR is detected by the topmost aggregator, it will scan the children of the topmost aggregate to see which one has the EBR_PRESENT flag set in its LVM data area, it will mark the I/O request as an EBR I/O request, and then direct the I/O request to that child. Any other aggregators which may be encountered will see that the I/O request is an EBR I/O request, and they will automatically direct the I/O request to which ever of their children has the EBR_PRESENT flag set. Thus, the I/O request is propagated down through the multiple aggregation layers of the volume until it reaches the partition containing the fake EBR, at which point the I/O request will be fulfilled using the fake EBR instead of the real EBR.
However, this xe2x80x9cbottom-to-topxe2x80x9d method requires each aggregator to implement this xe2x80x9cflagxe2x80x9d finding and setting functionality, including modifications to their application programming interfaces (xe2x80x9cAPIxe2x80x9d) as needed to support the new functionality. This solution, while an improvement over the broadcast method of the first related application, is still less than optimal due to complexity and inefficiencies. Under the current method, the topmost aggregator is responsible for maintenance of the fake EBR (updating the fake EBR after the volume is resized) while all of the aggregators on the path from the topmost aggregator to the partition containing the fake EBR, as well as the partition containing the fake EBR, are responsible for controlling access to the fake EBR. This division of responsibility is unnecessary, and it adds complexity to the input/output path.
Thus, there exists a need in the art for a multi-layer logical volume management system and method which allows for multiple levels of aggregation and allows for deterministic, efficient location and access of the LVM data area containing the fake EBR.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of a preferred embodiment of the invention, as illustrated in the accompanying drawings wherein like reference numbers represent like parts of the invention.
The system and method for control of data and boot records associated with multi-layer logical volumes allows the logical volume data area containing the fake EBR to be deterministically and efficiently accessed by centralizing the maintenance and control of the fake EBR in the topmost aggregator. In previous designs, the maintenance of the fake EBR was the responsibility of the topmost aggregator or of a component above the topmost aggregator. However, responsibility for controlling the access to the fake EBR was distributed among multiple components. Furthermore, the fake EBR was stored in the LVM data area of a partition, not the LVM data area of the topmost aggregator. This resulted in inefficiencies as the topmost aggregator did not have direct access to, or control of, the fake EBR, yet was responsible for maintaining it and redirecting EBR I/O to it. Under this invention, the fake EBR is stored in the LVM data area of the topmost aggregator, and the topmost aggregator now has complete responsibility for the fake EBR, as well as direct access to and control of the fake EBR.