The present invention relates generally to database management systems for the storage of data objects, and particularly for the efficient management of access and control over data linked to a database system and stored remotely in a file system or another object repository. More specifically, the present invention relates to a system and associated method for parallelizing file or data archival and retrieval in an extended database management system.
Data is typically maintained for storage and retrieval in computer file systems, wherein a file comprises a named collection of data. A file management system provides a means for accessing the data files, for managing such files and the storage space in which they are kept, and for ensuring data integrity so that files are kept intact and separate. Applications (software programs) access the data files through a file system interface, also referred to as the application program interface (API). However, management of computer data using file management systems can be difficult since such systems do not typically provide sufficient information on the characteristics of the files (information called metadata).
A database management system (DBMS) is a type of computerized record-keeping system that stores data according to a predetermined schema, such as the well-known relational database model that stores information as a collection of tables having interrelated columns and rows. A relational database management system (RDBMS) provides a user interface to store and retrieve the data, and provides a query methodology that permits table operations to be performed on the data. One such RDBMS is the Structured Query Language (SQL) interface.
In general, a DBMS performs well at managing data in terms of data record (table) definition, organization, and access control. A DBMS performs well at data management because a DBMS associates its data records with metadata that includes information about the storage location of a record, the configuration of the data in the record, and the contents of the record. A file management system or file system is used to store data on computer systems. In general, file systems store data in a hierarchical name space. Files are accessed, located, and referenced by their unique name in this hierarchical name space.
As part of its data management function, a DBMS performs many automatic backup and copying operations on its tables and records to ensure data integrity and recoverability. Backing up data has become an integral part of safe computing, and is not merely reserved for mission critical applications.
Current computer users rely heavily on sophisticated backup and recovery solutions to ensure data access and integrity. For desktop systems, backup can be implemented on numerous data storage systems including diskettes, hard drives, magnetic tapes, optical drives, CDRs (writable compact disks), CDRWs (re-writable compact disks), or high capacity removable magnetic media. For networked computers, backup can span the network to larger drives on a file server, tape, or optical backup systems.
The following definitions and explanations provide background information pertaining to the technical field of the present invention, and are intended to facilitate the understanding of the present invention without limiting its scope:
A xe2x80x9cDaemonxe2x80x9d is an acronym for xe2x80x9cDisk And Execution MONitorxe2x80x9d. It is a program that is not invoked explicitly, but lies dormant waiting for some condition(s) to occur. In other words, it is a process that is constantly running on a computer system to service a specific set of requests. In UNIX, for example, lpd is a daemon that manages printing requests. Daemons are self-governing functions. Although they are not part of an application program, daemons may service application requests.
An xe2x80x9cagentxe2x80x9d is an independent program or process that executes one or more tasks (such as information gathering from Networks, DataBases, or the Internet), providing services for application programs or acting as a principal. In general, the term xe2x80x9cDaemonxe2x80x9d refers to a persistent agent that has a very long life, whereas an agent refers to a process that has either a short file or a long life. However, for the purpose of simplification, the following description uses the terms agent and Daemon interchangeably.
A xe2x80x9cCopy Daemonxe2x80x9d is also referred to herein as xe2x80x9ccopy agentxe2x80x9d, and represents a process that performs the task of archiving a file.
A xe2x80x9cRetrieve Daemonxe2x80x9d is also referred to herein as xe2x80x9cretrieve agentxe2x80x9d, and represents a process that performs the task of retrieving or recovering a file.
xe2x80x9cHashingxe2x80x9d is a method for delivering high-speed, direct access to a particular stored data based on a given value for some field. Usually, but not necessarily, the field is a key. The following is a brief description of a typical hashing operation:
Each data record is located in a database whose hash value is calculated by a hash function of a selected field from that record (called a hash field). In order to store a record, the DBMS computes the hash value and instructs a file manager to place the record at a specific location corresponding to the calculated hash value. Given a hash field, the DBMS can retrieve the record by an inverse computation on the hash fields.
The hashing operation presents certain characteristics, among which are the following:
1. Multiple distinct records may be mapped to a single hash value; and
2. As the hash table increases in size, the number of records mapped to the same value decreases (when the number of hash table entries increases, the number of records mapped to the same value decreases. On the other hand, when the number of records increases, there will be more records mapped to a hash value/entry.
Current technology such as DataLinks, backs up files, sequentially, one file at a time, which might not meet the demand of a large database, especially with the occurrence of a large number of concurrent transactions/users and/or a large number of files being updated per transaction. Typically, an updated file is not accessible by the users (other than the user updating the file) for further update or processing, until the backup operation of the file is completed. Therefore, a database or table space level backup operation cannot be completed until all the file backup operations are completed. Hence, serializing the file backup operation could adversely affect the overall DBMS performance.
It would therefore be desirable to effectively parallelize the backup operations while avoiding contentions between backup/copy agents, and to further enable the read back operation without searching all the backup targets where the files are stored.
It is one feature of the present invention to present a system and associated method for parallelizing file or data archival and retrieval in an extended database management system that satisfy this need. More specifically, the system includes a set of agents that selectively acquire the backup tasks from a queue. The chance of overlap between any two agents acquiring the same task is significantly minimized.
Once a specific copy agent is assigned the backup task, a backup process is implemented to determine the optimal way to write the backup file to a target, while avoiding write contention between two copy agents. This is in contrast to conventional backup methods according to which a single copy agent implements the backup operation sequentially, one file at a time.
In addition, subsequent to the backup operation, a need may arise to restore or retrieve the stored file. While in conventional systems a restore agent searches all the targets to find the desired file, the present invention enables an efficient and expeditious retrieval of the desired file without having to search all the targets.
To this end, the system and method of the present invention parallelize the file copying or backup operations with no additional latch or lock overhead and with no or minimal disk I/O contention. In addition, it provides a mechanism for efficiently locating the backup copy of a file when recovery or restore of the file is needed.
As an exemplary specific implementation, at a database manager or Datalink File Manager (DLFM) startup time, n Copy Daemons (or copy agents) are activated where n is a user configurable parameter. The n Copy Daemons acquire the task from a common queue. To avoid the need of latch and unlatch for every access to the common queue, the present invention assigns work to the Copy Daemon using a hash function. The hash function generates a hash value based on a file name. The hash value ranges from 0 to mxe2x88x921, where m is much greater than n (m greater than  greater than n).
The m hash values are grouped into K bins, where K is greater than or equal to n (K greater than =n), in a round robin manner. Each of the K bins is assigned to a Copy Daemon. When a Copy Daemon reads a file name from the common queue, it applies the hash function to the file name to obtain a hash value. After computing the hash value, mapping of the hash value to the bin is performed. The Copy Daemon will backup the file only if the hash value maps to a bin that is assigned to it. As a result of the above calculations if it is decided that the copy daemon should backup the file, then the file name is archived and removed from the common queue. Otherwise, the Copy Daemon skips the file and moves to the next file in the queue.
According to the present invention, files are first hashed to generate hash values that are then grouped into xe2x80x9cbinsxe2x80x9d. A Copy Daemon is responsible for one or more bins but a bin is always assigned to exactly one Copy Daemon. This enables multiple Copy Daemons to implement file backups concurrently without any contention on the bins.
In addition, to achieve optimal I/O parallelism with no disk contention, bins could be mapped to disk arms. By mapping Copy Daemons to bins and bins to disk arms, I/O contention from different Copy Daemons at the disks is also avoided.
The action of bringing a file into database control is termed xe2x80x9clinking the filexe2x80x9d. Linking results from either an SQL insert operation or the database Load utility. When a file is xe2x80x9clinkedxe2x80x9d, a referential constraint is maintained between the file and the database record that references the file. An SQL insert statement could insert multiple records into a database table, which could result in linking of multiple files. An entry is also inserted into the archive table, which acts as a persistent common queue for all Copy Daemons. The common queue is sorted by the time at which the file is linked to the database. Copy Daemons do an uncommitted read from the archive table to avoid any latch or lock contentions.
Though the present invention has been summarized with reference to a specific exemplary implementation, e.g. DataLinks technology, it should be clear that the present invention is similarly applicable to other systems that perform data archival and/or retrieval.
Briefly, the present invention achieves a method to maximize throughput and to minimize contention (i.e., conflict) among agents as they are storing data into targets.