The database management system (DBMS) is a facility for storing large volumes of data and allowing multiple users to access and manipulate the data in an efficient and controlled fashion. Databases are traditionally considered as a large collection of (mainly disk-resident) shared data, managed and accessed by the DBMS.
In this application the following notions are used:
A database management system (DBMS) is an entity, which comprises one or more databases and/or data management systems, whereby the system is responsible for reading the data structures contained in the database and/or data management systems and for changing these data structures.
A database is an information structure, which comprises one or more data objects, and the use of which is controlled by the DBMS.
A data object is an information structure, which can comprise other data objects or such data objects, which can be construed as atomary data objects. For instance, in a relational database data objects represent tables comprising rows. The rows comprise fields, which are typically atomary data objects. A tuple is the data object that may contain other objects as elements, e.g. a tuple may be one row containing single customer's data in a table.
A database operation is an event, during which data objects of the database are read from the database, during which data objects are modified, during which data objects are removed from the database, and/or during which data objects are added to the database. A set of database operations acting on the data objects is called a transaction. The transaction may comprise one or multiple operations. The transaction can also comprise other transactions.
A page is a collection of data objects. A page may contain zero, one or multiple data objects. At maximum, the page may contain all data objects of the storage.
A database table is a collection of zero or more data objects referred to as table rows.
A checkpoint is a process where altered pages are written from one storage unit, such as RAM to another storage unit such as Disk. Typically, the end-result of a checkpoint is a snapshot of a database on the disk.
Referring to FIG. 1a there is depicted a common relational DBMS arrangement comprising a database server 12 and database file unit 16. The database server comprises a primary storage unit 10 and a CPU unit 13. The database file unit 16 is a disk-based system where the persistent database data resides and it is called a secondary storage. At the transactional level the application software unit 11 communicates with the database server 12 using appropriate programming interface, e.g. Structured Query Language (SQL), transactions (a in FIG. 1a) being able to access the primary storage. Once the transaction is successfully finished, i.e. committed (c in FIG. 1a), in the database a transaction log 18 in the secondary storage will be updated appropriately if transaction logging feature has been switched on. The database file 16 and the transaction log file 18 may reside in the same or different disk device. Preferably, the data must be persistent which means that the data is recoverable after a system shutdown. To ensure that the data is persistent in the database there is used checkpointing (b in FIG. 1a) to periodically flush changed data from primary storage unit 10 to the database files 16. The purpose of the checkpointing is to provide a snapshot of the data of the database in the database files within the database file unit 16. According to the prior art both the checkpointing and transaction logging are used to recover the data in the database in case of uncontrolled DBMS shutdown which is often referred to as a crash, and can be caused by an application failure, an operating system failure, a hardware failure, or other such failure. The database is thought to be persistent, if after a single fault, i.e. failure in any of the components of the DBMS, the data is recoverable from the secondary storage.
In traditional disk-based DBMSs the database server 12 comprises a disk unit as a primary storage unit and a random access memory (RAM) unit as a cache memory unit. The RAM unit is used as a buffer cache to the actual data on the disk of the database file unit 16. If the accessed data is not in the cache, it has to be fetched from the disk and it may take several milliseconds for the disk to seek and fetch the data. These disk-based relational DBMSs are also called disk-resident DBMSs, abbreviated as DRDBMS. To generalize, in DRDBMSs the data resides on the disk and is cached into RAM.
Today main-memory DBMSs, abbreviated as MMDBMS, are strengthening their position. Both terms main-memory and in-memory are widely used and they mean the same thing in the context of DBMSs. In MMDBMSs the database server 12 comprises a random access memory (RAM) unit as a primary storage unit 10 where all data of the database is stored. Database files are contained with the database file unit 16 and transaction logs 18 provides a persistent backup of the data of the database. To generalize, in MMDBMSs the data resides in RAM and is backed up to the disk.
This present application concerns MMDBMSs. With ever increasing RAM sizes in modern computers, there has been a rise given to the database residing entirely in the main-memory RAM instead of disks. Compared to the disk, the RAM offers superior performance by offering much better access times in the range of hundred nanoseconds on the average. Also the maximum access time for the RAM is easy to define, whereas for the disk having a physically moving read/write head, this is difficult to accomplish. Disks are block-oriented meaning that reading and writing a relatively large amount of data has the same, high cost as reading or writing a single byte. For RAM the optimum access patterns are decided by cache memory units but a typical cache line size is very small, from tens of bytes to a couple of hundred bytes.
In this application the term RAM means the same as the main-memory, because RAM is the method to implement the main-memory, i.e. the primary storage unit 10. The secondary storage is provided in the database file unit 16 is referred to as the disk, even though the disk is only one way to implement it among other block-oriented means having similar properties as disks, e.g. a flash-RAM. Also the transaction log 18 resides in the secondary storage.
The checkpoint, in general, is any identifier or other reference that identifies a point in time or a state of the database. The checkpointing can be divided into two major classes, namely transaction consistent and non-consistent checkpointing. In transaction consistent checkpoints for all transactions all actions of the transaction are either completely or not at all included. In non-consistent checkpoints actions and transactions can be partially included. Because read actions don't modify the data, we can mostly ignore them when considering checkpoints. The checkpointing process is typically a special thread process that periodically performs the checkpointing of the database. There are different ways of triggering the beginning of the checkpoint, e.g. it can start whenever the transaction log has accumulated a predetermined amount of records since the previous checkpoint. The term backup is often used as a synonym for the checkpointing of especially main-memory databases.
Referring to FIGS. 2a-2c there is depicted a common database structure for modifying and checkpointing data in the relational MMDBMS according to the prior art. As shown in FIG. 2a the primary storage contains a page A 101 comprising a page header 102 and data objects 103, e.g. in this case DO1, DO2 and DO3. The page A may contain a plurality of data objects but in this example a set of three data objects is used for simplicity. The page A may not actually be a physical contiguous area of the main-memory but it is in a form of a logical page which is a linked list of data objects floating around the main-memory, When the checkpointing process begins, this means in the storage level that in the beginning a number of pages are included in the current checkpoint but they are not yet written to the secondary storage. The writing of the pages of the checkpoint is a time consuming process during which, there may be transactions that need to modify the data of the pages of the check-point, for example updating one or multiple data objects on the page A. The modification of data objects of the page A in transaction level is omitted from FIGS. 2a-2c for simplicity.
FIG. 2b shows a situation in the storage level the first step of a typical checkpointing method for a main-memory database. When the checkpointing process starts processing the page A, the page is copied in the primary storage so that the page A is presented as a physical page 105 forming a physical contiguous area of the main-memory in primary storage unit 10. The page A 105 comprises data objects DO1, DO2 and DO3 arranged contiguously in a sequential order. Subsequently, the checkpointing process writes the page A to the secondary storage within the database file unit 16. As shown in FIG. 2c the page A 101 is finally written as a physical page A 109 to the memory space 107 of the disk file, where the page A 109 contains rows and each row contains one data object DO1, DO2 and DO3. When the page A 101 is written as the page A 109 to the secondary storage, it means that the page A 109 is a backup copy of the page A 101. In the simplest checkpointing methods, while checkpointing is active and the page has not yet been written to the secondary storage, and if there is a transactional request for data object modification on the page, for example updating a data object 103 on the page A, the transactional modification is quiesced until the checkpoint has completed writing the page(s) that the transaction needs to modify. After the checkpointing process has moved to process another page in the primary storage, the page A 101 can be modified, e.g. by a transactional update, in the primary storage.
To ensure the consistent checkpointing of the page 101, the modification of the page A is cancelled during the checkpointing and the modification has to wait until the checkpointing of the page A is completed. The checkpointing is not consistent if the page A is written (terms copied or dumped area also used) to the disk while transactions are allowed to modify any data on the page A during the checkpointing. In this case, the checkpointing may be partially consistent, e.g. action consistent, but transaction consistent checkpointing provides all actions of the transaction to be consistent. Otherwise the checkpointing as a whole is considered non-consistent. Consequently, if a consistent checkpointing is a requirement, then during the consistent checkpointing the data to be modified is locked in the main memory (primary storage) for writing to the disk (secondary storage). Thus, transactions are not able to perform write operations without waiting for the disk access which slows down is database operations and a constraint for real-time operation is not met.
Referring now to FIGS. 3a-3d there is depicted another way of the prior art to make a consistent checkpointing while modifying data during checkpointing the relational MMDBMS. As shown in FIG. 3a the primary storage contains a page B 101 comprising a page header 102 and a number of data objects 103, e.g. in this case DO4, DO5 and DO6. The page B is in a form of a logical page which is a list of data objects DO4, DO5, DO6 floating around the main-memory. When the checkpointing of the database begins, this means in the storage level that the page B 101 is included in the current checkpoint but it is not yet written to the secondary storage. FIG. 3b shows page B 105 with data objects arranged contiguously. Meanwhile there is a transactional request for page modification, for example updating the page B or a data object 103 on the page B, in the transaction level.
FIG. 3a also shows a situation in the storage level, when the first transactional modification to the page B occurs. The page B 101 is copied in the primary storage to a page B′ 101a. The page B 101 comprises data objects DO4, DO5 and DO6 that need to be written to the secondary storage in the checkpoint. Meanwhile the transactional request for page modification, for example adding or removing a data object 103 on the page B′ 101a, is accepted and consequently, the transactional modification of the page B is allowable during the checkpointing. When the first transactional modification to the page B during checkpointing occurs, the current page B is copied to the main memory (primary storage) as a page B′ 101a which is initially an identical copy to the page B. Now the page B′ comprising data objects DO4, DO5, DO6 may be altered by transactional operations such as add or remove a data object. Lets presume that in the meanwhile the transactional request for page modification, for example updating the data object DO4 on the page B′ to be replaced by a new data object DO4′ is allowed. After the copy of the page B, as the page B′, is ready in the primary storage, the transactional modification is performed to page B′ 101a, i.e. the data object DO4 will be replaced by the new data object DO4′ in this exemplary case. Now the page B′ first comprises data objects DO4, DO4′, DO5, DO6 as shown in a dash-lined box of page B′ 101a in FIG. 3a. After the transactional modification is committed at the transaction level during the current check-pointing, the data object DO4 is replaced by the new data object DO4′ and the data object DO4 is removed in this exemplary case and finally the page B′ 101a comprises data objects DO4′, DO5, DO6 as shown in a block of page B′ 101b in FIG. 3a. This means that there are in the main memory (primary storage) both the copy of the page B 101 and page B′ 101b at the same time. As shown in FIG. 3a now the page B′ 101b comprises data objects DO4′, DO5, DO6 and the page B 101 data objects DO4, DO5, DO6. As consequence of this main memory resources are spent for both these page copies for a period of time until the checkpointing process has written the page B 101 to the secondary storage.
FIG. 3b shows a situation in the storage level for the first step of a typical checkpointing method for a main-memory database. When the checkpointing process starts processing the page B 101 the page is copied in the primary storage so that the page B 101 is presented as a physical page 105 forming a physical area of the main-memory in the primary storage.
When the page B 101 is being checkpointed, the checkpointing process writes the page B 101 to the secondary storage within the database file unit 16. As shown in FIG. 3c the original page B 101 is written as a physical page B 109 to the memory space 107 of the disk file, where the page B 109 resides containing data objects DO4, DO5 and DO6, i.e. it contains data of the original page B. When the page B 101 is written as the page B 109 to the secondary storage, it means that the page B 109 is a back-up copy of the page B 101 of the primary storage. In this case, when the checkpointing process moves to checkpoint another page in the primary storage, the page B′ 101a, 101b is already or it can be modified, e.g. by a transactional update, in the primary storage. In the transaction level the transactions are free to perform whatever update operations, e.g. insert, update and/or delete, to the page B′ that is a copy of the page B. An optional transaction log 18 as shown in FIG. 3d lists information on all transactional modification that have been committed during the database processing.
As a conclusion, according to the prior art the consistent checkpointing of the page 101, while a request for modification of the page occurs during the checkpointing, is guaranteed by using those two methods described above. The first method for ensuring the consistent checkpointing is depicted in FIGS. 2a-2c, where the modification of the page A is deferred during the checkpointing and the modification has to wait until the checkpointing is completed. The second method for ensuring the consistent checkpointing is depicted in FIGS. 3a-3d, where the request for modification of the page B involves the page B to be copied to the main memory (primary storage) as a page B′ which is initially an identical copy to the page B. After the copy of the page B, as the page B′, is ready in the primary storage, the transactional modification is performed to page B′.
FIGS. 4a-4d show a way of the prior art to make a so-called non-consistent checkpointing while modifying data during checkpointing the relational MMDBMS. As shown in FIG. 4a the primary storage contains a page C 101 comprising a page header 102 and a number of data objects 103, e.g. in this case DO7 and DO8 in a form of a logical page as described earlier. When the checkpointing of the database begins, this means in the storage level that the page C is included in the list of pages to be checkpointed but it is not yet written to the secondary storage. Meanwhile there is a transactional request for page modification, for example updating a data object 103 on the page C. The request is accepted and the transactional modification of the page C is allowable during the checkpointing. Let's presume that the transactional request for page modification, for example updating the data object DO7 on the page C to be replaced by a new data object DO7′ is allowed. When the page C is copied in the primary storage, the transactional modification is performed to the page C 101, i.e. as shown in a block of page C 101a the data object DO7 will be replaced by the new data object DO7′ and in pursuance of replacing the data object DO7 by DO7′ it is also removed from the page C. There is no guarantee that the transactional modification is also committed at the transaction level. Now the page C 101 comprises data objects DO7′, DO8 as shown in a block of page C 101b in FIG. 4a. FIG. 4b shows a situation in the storage level, when the first transactional modification to the page C occurs and the physical page C 105 as described earlier comprises now data objects DO7′ and DO8. The checkpointing process writes the page C 105 to the secondary storage within the database file unit 16 as shown in the memory space 107 of the disk file of FIG. 4c. The backup copy of the page C 109 is not consistent with the original page C 101. If the database needs to be recovered from the checkpoint, the inconsistent pages of the database must be “repaired” with information about transactions that occurred during the checkpoint. An exemplary transaction log 18 as shown in FIG. 4d lists information on all transactional modification during the checkpointing to the secondary storage. Each row 118 of the transaction log contains following information concerning one transaction: a link to page C, an old version of the data object modified and a new version of the data object modified. This kind of transaction log of prior art is a physical undo-redo log, by means of which the database must be processed to be able to recover to the latest checkpoint. FIG. 4c shows a memory space 107 with page 109 and data object 103.
There are several disadvantages in the methods for making a consistent checkpointing of a relational MMDBMS described above. One of the main requirements for the MMDBMSs, as well as for any DBMSs, is that the data must be accessible and mutable with atomary, consistent, isolated and durable (ACID) transactions. For the transactions to meet real-time constraints, they must be able to perform read and write operations without waiting for a disk access. Even if the data is in the buffer cache, it is not necessarily mutable immediately, if the data is locked for writing to the disk as part of the DBMS persistency mechanism. The aforesaid method of the prior art does not fulfill these requirements, because ensuring the consistent checkpointing of the page the modification of the page is stopped during the checkpointing and it has to wait until the checkpointing is completed. The problem is that at the transactional level the modification operations are blocked during propagating the checkpointing at the storage level and consequently the real-time response for all database operations, especially write operations, is not guaranteed. This causes considerable delays to transaction level processing.
Other disadvantage is a main memory usage overhead caused by copying of pages during the checkpointing. The volatile RAM memory usage, on top of the user data, should be kept to a bare minimum compared to the disk space which is usually available in large quantities. The aforesaid method of the prior art copies the current page to the main memory (primary storage) as an identical copy to the page, when a transactional modification to the page B occurs during the checkpointing. Both the copy of the page and the original page is retained in the main memory until the checkpoint has been completed and as consequence of this main memory resources are spent for this extra page. Because each page to be checkpointed is copied upon the first write to the page, it is possible to double the memory consumption during the checkpointing. Furthermore, copying the whole page upon first write upon it causes all the data on the page, not only that which is written, to be copied, causing extraneous CPU usage.
Still another disadvantage in prior art checkpointing is the need to use a transaction log for recover the database. The traditional approach to the persistency in DBMSs is checkpointing and transaction logging. The checkpointing of prior art is tightly coupled to the transaction logging. The transaction log, in particularly a physical undo-redo log, which is written to the secondary storage, must be processed to be able to recover from the latest checkpoint. The requirement of always using transaction logging to ensure database consistency is not always acceptable from the applications point of view primarily because transaction logging causes significant performance degradation of write transactions because all transactions must be successfully written to the disk upon transaction commit.
The problems set forth above are overcome by providing a consistent checkpointing of a main-memory storage, preferably a main-memory database, without disturbing the transaction level processing.