The present invention relates to data processing, and more specifically, to data migration.
Data migration is an event that may occur frequently as computer systems that support big data environments become more prevalent. FIG. 2 shows a typical data migration process applicable to various commercial database systems, in which all data from a source data table (table 1) of a source database (DB 1) is migrated to a target data table (table 2) of a target database (DB 2). To this end, generally, the following steps are used: first, data of the source data table (table 1) is exported to an intermediate file to form intermediate data, where the intermediate file may take various file forms. For example, the intermediate data may be in the form of a database table having data information consistent with that of the source data table, e.g. with the contents of the rows of the intermediate data corresponding to the rows of the source data table. Next, the intermediate data may be imported to the target data table (table 2) from the intermediate file, and a target index may be regenerated from the target data table for retrieving the target data table.
In mainstream relational databases, the index structure is generally a B+ tree or a B− tree, where the leaf nodes of the tree are formed by index entries having a specific structure of a tuple of Key Value and a set of row IDs: <KeyValue, {RID1, . . . RIDn}>, where n is a natural number, and where KeyValue represents values in an indexed column (i.e., one or more columns) of the data table. For example, given a “province” column to be indexed in the table, key values of corresponding index leaf nodes may be Hebei, Henan, Hubei, Hunan, generally in ascending order or descending order. RID (row ID) indicates ID information of various data rows corresponding to a key value in the data table (e.g., one row for a unique index, and several rows for a non-unique index). Because RIDs generally correspond to particular physical storage information of data rows, they are not reusable during a data migration process. Still taking the index of “Province” column as an example, typical leaf nodes in the “Province” index are generally shown as follows:
Index leaf node 1: <“Hebei”, {RID1, RID5, RID7, . . . , RIDj}>, representing that the key value “Hebei” appears on the 1, 5, 7, . . . , and j rows respectively, where j is a natural number;
Index leaf node 2: <“Henan”, {RID2, RID3, RID6, . . . , RIDk}>, representing that the key value “Henan” appears on the 2, 3, 6, . . . , and k rows respectively, where k is a natural number.
Currently, the process of generating an index for a relational database includes the following.
(1) A database engine may scan various data tables, extract index key values from various data rows (i.e., records) in sequence and construct corresponding “index leaf entries.”
(2) The above “index leaf entries” are sorted in the unit of the sort space of the database, and the sorted results are written onto the disk as an “intermediate sort sequence.”
(3) The above (1) and (2) are repeated until all data records have been traversed and corresponding index leaf entries have been written to a “intermediate sort sequence.”
(4) In memory, a merging sort is performed on the “intermediate sort sequences” to generate a single ordered sequence, from which an index is constructed. This process may be as follows: (a) find an index entry with the minimum (or maximum) key value from a first index entry of each of the intermediate sort sequences; perform a merge step if there is more than one index entry that satisfies this condition; use the index entry with the minimum (or maximum) key value to construct the first index leaf node; (b) process a next index entry with the minimum (or maximum) key value in sequence; when a first index leaf block is full with processed index entries, write that first index leaf block to the disk and start the construction of a next index leaf block; (c) construct intermediate index nodes while writing the index leaf block, and write an index leaf block to the disk once it is completely populated; (d) repeat (a), (b), (c) until the whole index construction is finished.
There is a problem with the above traditional data migration process. Because the target data table generally has a huge amount of data, reconstructing an index directly for the target data table may take a lot of time, so that regular services may be affected severely.