With the advent of databases it is incumbent on the user or operator of a database to INSERT user supplied data into the database in a form that is consistent with the internal storage form of data for the database. A Database Management System (DBMS) or LOAD utility is used to convert the user supplied data into the internal database form. The actions of parsing, conversion, and formatting data into this internal form is a CPU intensive operation. LOAD utilities reading from user supplied source media, such as disks, or tapes containing user supplied data incur the overhead of these conversions, as do INSERTs issued to the DBMS. Because of these CPU intensive operations, insertion of user data may become a CPU bound activity.
It is well known in the art that database records are grouped into pages of predetermined size. A number of pages are grouped together into a table, and each page in the table contains a number that denotes its sequence in the table. Pages of a table are typically written to persistent storage (such as disk).
Symmetric Multi processors (SMPs) comprise a class of computers containing multiple CPUs. Operating systems running on SMPs dispatch processes and threads to different CPUs in order to distribute the workload assigned across the available processors. For a given program to exploit the power of an SMP, it is advantageous that it is designed so that it performs portions of its workload in separate dispatchable units of work which the operating system of the SMP can distribute to the various CPUs in the SMP.
It would be a definite advantage in many cases that when data is loaded into the database of an SMP system, the ability to perform parsing, conversion and formatting processes in concurrent dispatchable operating system units in order to exploit the advantages of the SMP system hardware and operating system.
The difficulty is that the design of a method and apparatus in order to load data for processing in concurrent dispatchable operating system units is nontrivial because of the numerous items of state information which must be maintained as part of the database table meta-data (stored data that describes the database table concerned), such as free space control records, a table descriptor record, etc. as will be appreciated by those skilled in the database art. Despite the complexity of the problem, solutions to it have been attempted in the past.
However, in the normal case, adding true parallel processing (i.e.. True decoupled concurrent processing) the data is processed by each CPU in the system independently, resulting in the data being loaded into the database table in an arbitrary sequence. This means that the data is stored in an arbitrary sequence in the table, and as well, in an arbitrary physical sequence on the database storage device used by the data processing system.
As will be appreciated by those skilled in the art reviewing this application, the arbitrary sequence of data both logically in the database table and physically on the storage device being used by the data processing system to store database information can pose a problem if the source data was intentionally supplied by the user in a significant order (such as cluster order). It can also result in poor exploitation of the buffer pool and I/O prefetchers used by the data processing system to accelerate or optimize data retrieval. The net result of which is ultimately that corruption of the sequence of the records in the source data may negatively impact subsequent query performance. As will be appreciated, query performance is one of the primary judgement criteria for which database products compete.
The very requirement for data order and the advantages of parallel processing appear to be opposed to each other. In order to take advantage of parallel processing it appears that the prior art would require sacrifice of the required data order, while corrupting the data order appears to negatively impact processing performance including query performance.
The performance results are major indicators of product performance, and are heavily used by customers in deciding which Database products to buy.
The Transaction Processing Performance Council (TPC) regularly establishes guidelines for transaction processing and database benchmarks against which Database vendors regularly compete. Database vendors regularly publish their TPC compliant performance results. The official TPC benchmarks include both query performance, as well as the database creation time (of which LOADing data is a major component). So, both the creation of the database and the subsequent query performance are major factors which customers consider, and for which Database product producers aim.
The term transaction is often applied to a wide variety of business and computer functions. From the point of view of a computer function, a transaction could refer to a set of operations including disk read/writes, operating system calls, or a type of data transfer from one system or subsystem to another.
While TPC benchmarks involve the measurement and evaluation of computer functions and operations, the TPC regards a transaction as it is commonly understood in the business world: a commercial exchange of goods, services, or money. A typical transaction, as defined by the TPC, would include the updating to a database system for such things as inventory control (goods), airline reservations (services), or banking (money).