The present invention relates generally to the field of information storage and retrieval and, more particularly, to loading data into large information warehouses.
Today's information warehouses are becoming increasingly large, e.g., hundreds of gigabytes (GB) or even terabytes (TB) of structured and unstructured information. Such information warehouses often were built from one or multiple data sources. It is not uncommon that the lengthy process of loading data into the information warehouse would run into various issues (e.g., data loading failures) that may lead to incompleteness or errors of the data loads. Typically, data loading failure and error cases can be classified into two classes: 1) failures and errors that are caused by unexpected system problems, e.g., machine crashes and broken network connections; and 2) data source content that contains “dirty” data, i.e., data that is faulty for whatever reason, e.g., incorrect linkages between data tables. For failures in the first class, such a failure may cause the data loading to be incomplete. An ideal recovery process should be able to resume the data loading from where it was left off rather than starting the data load from scratch, i.e., from the beginning, which is typical of current information warehouse solutions to data loading failures. For failures in the second class, in the case, for example, of data that is already loaded, the data may require cleaning up or reloading, or both, if the data source content contained dirty data.
Better methodologies and tools are needed for coping with the lengthy data loading required for maintaining increasingly large information warehouses.