Two basic database structures will be described by way of introduction, together with a few associated properties, namely a disk memory database and a primary memory database.
In a disk memory database, part of the information, the information latest used, lies in a primary memory while the remainder lies in a disk memory. In this case, if the desired information does not happen to be already stored in the primary memory it must be taken from the disk memory and entered in the primary memory. The new information taken from the disk memory is written over the information that is present in the primary memory, and if this written-over information shall be used at a later stage, it is necessary to recollect said information from the disk memory. The memory content is divided into pages and the pages are handled independently of one another. For instance, new information can be taken into one page, while another page remains unaffected.
In the case of a disk memory database, the data-structure in the primary memory is normally a 1:1 mapping of the structure in the disk memory.
Since the memory content of the primary memory of a disk memory database is updated to the disk memory one page at a time, the links between different pages are inconsistent upon the restart of a system subsequent to a crash. Consistent check-points are thus often difficult to implement in conjunction with disk memory databases.
Consistent check-points are easier to achieve in a primary memory database. For instance, this consistency can be readily achieved by having two replicas of the database in disk memory. That is, a replica of the latest or current version of the database and a replica of the older version. This is possible because the whole of the database is found in the primary memory at once and different versions of the database where it is known that consistency prevails between all pages according to a given check-point can then be readily stored in the disk memory.
It may be that certain attributes belonging to an object in a database are primary-memory based, while other attributes within the same object are disk-memory based. Tables that constitute a mixture between a disk-memory database and a primary memory database may also be found.
A database includes data tables that are formatted in different ways. A table is comprised of a number of columns, where each column contains a certain kind of information. Each column is allocated a specific attribute and the information in said column is formatted in accordance with this attribute.
Examples of attributes are that the information stored is formatted as an integer, with or without signs, a floating number, a decimal number, a date, text, and so on.
Other factors determined by the attribute include:
when the attribute is a fixed attribute, i.e. when the content takes-up a fixed memory size; PA1 when the attribute is a variable attribute, i.e. the attribute can vary in size; or PA1 when the attribute is a dynamic attribute, i.e. when the attribute is present or absent, such as a selectable or optional attribute.
A dynamic attribute may either be variable or fixed.
An object in a table corresponds to a row or line in the table and includes the attributes (columns) present in the table. An object can thus include a mixture of the aforesaid different attributes.
A table listing personal information concerning a group of persons is one example in this regard. Different attributes may include Christian name(s), surname, street address, postal address, telephone number, date of birth and other optional comments.
Date of birth is a fixed attribute (if it is entered on a predetermined format), while name, address and telephone number are variable attributes and comments is a dynamic variable attribute.
An object is then a person and information relating to that person.
There are many ways in which a table according to this example can be stored. The table can be stored in the fixed memory spaces in a consecutive sequence in a memory. This requires a large amount of memory space, the greater part of which will be empty in order to be able to prepare space for variable and dynamic attributes. The variable attributes will then only be variable in a limited way according to the size of the allocated memory space.
Alternatively, each object can be given a header that discloses the size of the variable attributes and also whether a dynamic attribute is present or not, and also the size of such attributes if present. Thus, when creating an object the memory space required for respective attributes is allocated and the attributes are combined to form an object. The header discloses the size of respective attributes, thereby enabling the information in the object to be interpreted correctly. The header may also include an index which includes a pointer that points to respective attributes.
When storing a table in a memory in practice, it is seldom that the memory has continuous memory space for storing the entire table, since a memory is often fragmented to a greater or lesser extent. In the majority of cases, a table is too large to be accommodated on one page. The various objects in a table are thus spread on one page, and a table is divided up over several such pages.
A fragment is a part of a table and also includes several different pages. In distributed databases, a table can be distributed over several processor nodes in the distributed database. A fragment is then a part of a table found in a node. A fragment also includes all replicas of the same part of the table. Thus, a fragment can include a primary replica of a part of a table stored in a node and a secondary replica of the same part of the same table stored in another node. Several different replicas of different fragments of the same or different tables may be stored in one node.
In the case of large objects, it is not unusual to find it necessary to divide-up the various objects and store these object-divisions at different locations in a page, or to divide the object up between different pages. For instance, a large object may be one where one or more attributes is/are comprised of a text file that can include several thousand characters.
In the case of large objects, it is known in connection with disk memory databases to use for a table and for the objects in a table, advanced data-structures that build a tree-structure, such as a B-tree.
An object is then divided-up into different parts and these parts stored in available memory spaces in accordance with the tree-structure. The header for respective objects will then include a pointer which points down in a B-tree, thereby enabling the different parts of the object to be found. The parts need not necessarily constitute a whole attribute and the object can be handled as one single, continuous character string and divided-up in any desired way according to the memory spaces that are available.
This procedure is normally applied when an object is so large as to cover a full page and where a change is written directly into the disk memory. If the information is not written directly into the disk memory, some type of log is required.
In such cases where the object is very large, very comprehensive log files are required.
When changes are made to a fragment, such as when adding or removing an object or when changing, updating, an attribute in an object, it will sooner or later be necessary to store the change in a lasting manner, regardless of whether the change concerns a disk memory database or a primary memory database.
By long-lasting it is meant that the information will remain stored even though parts of the system or the whole of the system crashes, for instance. However, long-lasting storage is not obtained in the primary memories when storing in the disk memories. It takes time to write into and read from a disk memory, and this is done only at certain time points. The disk memory is updated in accordance with the primary memory by one full page at a time.
If a processor crashes, it is necessary to carry out all committed transactions while aborting all non-committed transactions. Two types of information are stored to handle this event.
REDO-information is used to enable committed changes that have been carried out in the primary memory but still not stored in the disk memory to be redone in conjunction with restoring the information that has been lost when a processor crashes.
UNDO-information is used to undo changes that are still not committed but have been written in the disk memory, in conjunction with restoring information that is lost when a processor crashes.
This information is stored as log-information and is usually stored in a so-called fragment log where all changes that affect a fragment are stored in a log. Log-information can be collected and stored in many different ways.
A physical log stores all changes that take place. The log works on a bit level and stores the information as it was prior to the change and as it is after the change. The physical log logs all changes, even a defragmentation of a fragmented file or a fragmented memory space, which is not a true information change but merely a redistribution of memory space. It will be understood that such a log is very capacity-demanding, both with respect to memory space and processor time.
A physiological log stores only changes of data within a part of a memory space, and not memory space redistributions. An internal address of a relevant part, such as an index to a page index of said relevant part is saved in the log, and also the change that took place in said part.
A logical log stores only a change of an attribute within an object.
For instance, if an object having the key "Kalle" and an attribute having the value "12" are changed, so that the attribute becomes "14", "Kalle" and the attribute "12" are stored in the UNDO-log and "Kalle" and the attribute "14" are stored in the REDO-log, simply put.
This enables the value of the attribute to be reset to "12" through the UNDO-log and set to "14" through the REDO-log. This information is not coupled to the object with the key "Kalle" being stored in a specific position in a memory, but that the object having the key "Kalle" is sought when using the log and the necessary correction carried out.
When an object has been divided into several different pages, it is necessary to download all pages into the disk memory simultaneously, in order for consistency to prevail between the pages. Thus, a check-point, a frozen time point, is created at the start of downloading to the disk memory, and all pages within a fragment are then written down into the disk memory.
There are two types of consistent check-points, to-wit action consistent check-points and transaction consistent check-points.
In a transaction consistent check-point, all transactions that affect an object within a fragment that is being downloaded into the disk memory shall be stopped and all ongoing transactions completed. The actual process of writing into the disk memory is then carried out, which may take several minutes in the case of large fragments. This means corresponding waiting times for stopped transactions.
Corresponding action consistent check-points are carried out in the same way, although with the difference that a stop is permitted midway in a transaction while waiting for the finalisation of all ongoing actions, parts of a transaction. Thus, waiting times in conjunction with action consistent check-points are shorter than the waiting times involved in transaction consistent check-points.
It is known to use a so-called fuzzy check-point in combination with a local physiological log.
The local physiological log enables in co-operation with writing into a disk memory all pages belonging to a fragment the use of a fuzzy check-point that eliminates waiting times in respect of actions that affect objects to be written into a disk memory.
It is also known to create in conjunction with primary memory databases a new replica of an object in the event of an object change, where the new replica includes the object after the change and the old replica includes the object before the change, or vice versa, with a link between the two replicas.
Since the two replicas include both the old and the new information, no logic UNDO-information is needed in this case, but solely a logic REDO-log. This is difficult to use in disk memory databases, since it is probable that the new replica will land on a page different to the old replica, although it is possible in primary memory databases, since these databases often use transaction consistent check-points, which are seldom found in disk memory databases.
The term location and replication independent log will also be described. By this is meant a log that can be used in the recreation of information that was lost in conjunction with the crash of a fragment or in the total crash of an entire system, where this recreation of said information can be effected in any processor and associated memories (location independent) and with a starting point from any replica, such as a primary or secondary replica, of the log (replica independent).
The following publications describe the handling of log-information and different data-structures. These publications can be considered to teach part of the known technology.
"ARIES: A Transaction Recovery Method Supporting Fine Granularity Locking and Partial Rollbacks Using Write-Ahead Logging", C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, Peter Schwarz, ACM Transactions on Database Systems, March, 1992, Vol. 7, No. 1, p. 94.
"The Architecture of the Dali Main-Memory Storage Manager", P. Bohannon, D. Liuwen, R. Rastogi, S. Shesadri, A. Silberschatz, S. Sudarshan, Memoranda from Lucent Technologies, http://www.bell-labs.com/project/dali/papers.html
"Transaction Processing: Concepts and Techniques", J. Gray, A. Reuter, Morgan Kaufman, 1993.
"The Lorel Query Language for Semistructured Data", S. Abiteboul, D. Quass, J. McHugh, J. Widom, J. Wiener, Technical Report from Department of Computer Science, Stanford University.
"Main Memory Database Systems: An Overview", H. Garcia-Molina, K. Salem, IEEE Transactions on Knowledge and Data Engineering, Vol. 4, No. 6, December, 1994.
"INFORMIX-Online Dynamic Server, Database Server", Informix Software, Inc., December, 1994.
"Recovery in Parallel Database Systems", S-O. Hvasshovd, Vieweg, ISBN 3-528-05411-5.
"An Evaluation of Starburst's Memory Resident Storage Component", T. J. Lehman, E. J. Shekita, L-F. Cabrera, IEEE Transactions on Knowledge and Data Engineering, Vol. 4, No. 6, December, 1992.
"System Support for Software Fault Tolerance in Highly Available Database Management Systems", M. P. Sullivan, Ph.D. Report.