This invention relates generally to storage and retrieval of large databases, and more particularly to the use of temporary storage for moving large files.
As large computing enterprises continue their migration from centralized mainframes and volumes of data on direct access storage devices (DASD) and serial tape drives, the reliance upon open systems database technology has increased. The ability to quickly adapt mass storage systems to new platforms, while retaining high performance and reliability will remain key elements for system integrators.
In earlier times, rooms full of DASD and tape drives were maintained by legions of storage operators who physically mounted and unmounted tapes and disk packs and moved them to and from libraries according to daily schedules and batch job instructions. Technology improvements allowed the use of self-contained xe2x80x9cmass storagexe2x80x9d units, using robotic arms to move archived storage media to and from the drive mechanisms in a matter of seconds. Further developments of storage media have enabled a cache model in which large masses of data are held in offline resources and smaller portions can be uploaded to the high-speed cache as necessary. Data availability has also been increased through the use of arrays of mirrored databases, either single or multi-threaded, for multiple simultaneous access capabilities.
Even with mirrored systems, operational concerns often require that the magnetic or other storage media be archived or xe2x80x9cbacked up.xe2x80x9d There are also occasional system reorganizations or restructuring during which a database may be converted by copying it out in one format (xe2x80x9cexportxe2x80x9d) and copying it back in (xe2x80x9cimportxe2x80x9d) with a different format, or into a different structure. Back-up issues will also arise when converting from one database management system (DBMS) to another, or sharing databases. Application programs themselves may also request the operating system to make a large xe2x80x9csavexe2x80x9d of data files, usually after a modification is made to the file. Backing up of large volumes of data can be a time consuming and resource intensive operation. Mass storage systems, disadvantageously may be unavailable while large back-up operations are performed.
FIG. 1 illustrates a typical system in which a host computer 10 is connected to a backup system 12 and a storage system 14. Creating a physical backup of the entire database 24 of the storage system 14 often requires a large investment of time and resources. In an open, networked array of storage devices, a physical backup of a database may be handled by arranging for a DBMS 22, such as provided by Oracle Corporation, to communicate with a dedicated back-up system 12, such as the EMC Data Manager, from EMC Corporation of Hopkinton, Mass. The DBMS system vendor often supplies an Application Programming Interface (API) 20 that can be installed in the host computer to handle the scheduling of the regular backups. The DBMS system typically reads the data from the database via the local DASD interface (such as SCSI bus), and delivers a buffer of data through the API. The application running the backup may be customized or optimized for the particular mass storage system selected, such as the EMC Data Manager (EDM) which is optimized to run with EMC""s Symmetrix storage system(s). The EMC backup application, or something similar, take the necessary steps to send the data over the network using a connection-oriented protocol such as Transmission Control Protocol (TCP). The receiving backup system then sends the data to a mass storage unit 18, such as to write the data to an archive tape 18A. The major drawbacks of physical backup include that logical structures, such as tables of data cannot be backed up. Further, data cannot be transferred between machines of differing operating systems. Additionally, data blocks cannot be reorganized into a more efficient layout.
Many of the major DBMS companies also provide a more generalized facility in which the data is exported as a standardized file, such as in ASCII format, as part of a so-called xe2x80x9clogical backup.xe2x80x9d The ASCII format permits the file to be imported into most other systems, without insurmountable compatibility problems. However, presently DMBS companies generally do not provide the API necessary for a customer to properly handle the data stream generated by the logical backup. The result is that many of the DBMSs generate very large backup files that have to be stored locally until they can be written to an archive device.
To overcome this disadvantage, some customers create their own primitive solution by attaching a physical tape drive to the machine. The logical backup data stream is then directed into a process that Unix calls a xe2x80x9cpipe,xe2x80x9d buffered, and then directed (xe2x80x9cpipedxe2x80x9d) by another Unix command such as one that writes the data to the local tape drive 16, a DASD, or other demountable, writable media. A Unix pipe can be thought of as a FIFO (first-in first-out) data file having one process writing data into it serially and another process serially reading data out. When the pipe is xe2x80x9cempty,xe2x80x9d the reading process simply waits for more data to be written by the other process. Other non-Unix operating systems such as DOS and Windows NT emulate the Unix pipe in various ways with similar results. Logical data streams are thus directed from a database export into another process that disposes of the data to the physical storage media, thus freeing up storage resources.
This primitive solution has several disadvantages. For one thing, it requires a physical tape drive 16 to be attached to the computer host 10 generating the backup. Alternatively, the logical backup could be piped to a command that writes the data onto disk or equivalent. However, this solution would require each such machine to have huge amounts of excess storage capacity. In either case, additional operations personnel must be assigned to handling the tapes and disks, and maintaining the drives. Extra storage devices, media libraries, and personnel also take up extra space in the facility. Another alternative would be to pipe the logical backup data stream into the network interface and send it to a different machine having a DASD or tape. When dealing with very large databases, these solutions could break down entirely, due to the operational difficulties of maintaining the necessary physical media, or open network connections.
The named pipe provides a standard mechanism that can be used by processes that do not have to share a common process origin for process-to-process-to-device transfers of large amounts of data. The data sent to the named pipe can be read by any authorized process that knows the name of the named pipe. In particular, named pipes are used in conjunction with the Oracle DBMS import/export utility to perform the logical backup/restores necessary to restructure or reorganize very large data bases (VLDBs). Typically, the user creates a named pipe and runs an export utility specifying the named pipe as the output device. The DBMS sees the pipe as a regular file. Another process, including for example Oracle DBMS commands such as dd (convert and copy a file), cpio (copy files in and out), rsh (execute command on a remote system), etc., then reads from the other end of the pipe and writes the data to actual media or the network. This technique is used to write export data to local disk/tape or over the network to available disk/tape on another machine.
As mentioned, a disadvantage of the existing methods is the large amount of time it takes to perform backups, during which the database may be partially or completely offline due to read/write interlocks. Some of this delay can be reduced by segmenting export/backup files, and running several processes in parallel. Even though the logical backup process can be segmented into parallel streams by some DBMSs, the implementations may be proprietary and not necessarily adaptable for import to another DBMS. Also, disadvantageously, a dedicated disk or dedicated tape is required. Further, in known implementations, there is an inability to catalog multiple versions.
VLDB reorganization and restructuring are major operations for which there is no known highly efficient solution. Current solutions do not use the data management services of an Enterprise Data Management Tool (such as EDM). A VLDB administrator has the ability to divide large tables into separate physical partitions. The partitions can be backed up, stored, exported and imported separately from the rest of the table. Customers consider this a critical feature that should be heavily used. Existing backup systems and APIs do not have the ability to export or import partitions in parallel using all available tape drives. Any API solution would be necessarily proprietary to the DBMS vendor, and not generalizable to other systems.
The present invention, provides a method and apparatus for directly connecting very large data streams from an archive command into a backup data system using an xe2x80x9cintelligent process.xe2x80x9d
According to the invention, a pipe interface process supervises backup of each filled data store, while the remaining output stream continues to be piped into another available data store. The backup system completes archiving of the datastream, keeping a catalog of the datastream storage locations. To retrieve the data, the intelligent process is run in reverse as a pipe-writing process, requesting data from the backup system. Retrieved data traverses the data stores from the backup system and are pumped into the pipe-writing process for delivery to the pipe output file identified by the retrieve or import command.
An intelligent pipe process according to the present invention is a configurable client process that runs on each system that requires the enhanced backup or export services. The intelligent pipe is configured to manage the distribution of the piped data stream to and from pre-selected data stores, and to trigger the backup system for further data handling on output. The receiving process in the backup system is also configured to wait for storage requests from the intelligent pipe, schedule the storage, update the catalog, and report completion status back to the intelligent pipe.
In an alternative embodiment of the invention, many of the data-store management tasks and backup scheduling can be handled by a self-contained backup system adapted for the purpose, leaving less activity for the intelligent pipe process, thus freeing up additional resources in the host computer.
Features of the invention include a method and apparatus wherein temporary storage, communication, and archiving for output of a generic database export command pipe is managed in a manner transparent to the customer operating the host computer. The customer is relieved of the need to maintain additional large local storage devices for physical or logical backups. The customer does not have to construct and maintain specialized pipes or scripts for network backups, or for different physical tape devices or DASDs.
Implementing a pipe process rather than a specific export/import function reduces the number of proprietary restrictions and configuration difficulties that depend upon choice of database implementation. The backup and archive methodology remains independent of the database structure or selected DBMS. Similarly, there is no need to work with a different API from each DBMS vendor, nor any customization of the Unix operating system.
A plurality parallel pipes according to the invention can be implemented for faster transfers of segmented database backups, where that feature is implemented in the DBMS. Furthermore, the invention enables any output-generating Unix-like process (application save, non-Oracle backup/restore) to have its results sent to the existing backup system, thus freeing up local resources automatically. Similarly, the same invention can be operated in reverse, enabling any Unix-like input-receiving process to obtain its data directly from a mass storage backup system. The temporary disk space needed to implement the data stores can also be designed as part of a backup system rather than the host computer, freeing up computer resources.