1. Field of the Invention
Generally, the present disclosure relates to the field of computer systems, and, more particularly, to computer systems and methods carried out by computer systems wherein snapshots of data are created.
2. Description of the Related Art
In factories for the manufacturing of complex products, such as, for example, semiconductor devices, computerized systems may be employed for optimizing manufacturing processes, and for providing real-time feedback on current conditions of the factory. Such computerized systems may include manufacturing execution systems that are connected to distributed systems of process automation, and allow monitoring and control of the production in the factory in real-time.
Computerized systems used in manufacturing may be combined with a central repository of data wherein data from different source systems are integrated in a standardized data format. Such a central repository of data wherein data from different source systems is integrated is denoted as a “data warehouse.” Data integrated into a data warehouse need not be provided by a manufacturing execution system. Additionally or alternatively, data integrated into a data warehouse may be provided by other sources of data.
For initializing a data warehouse, source data stored in the source systems may be retrieved. The source data may be provided in the form of source database tables that may include fact tables and dimension tables. Fact tables may store measures that are referred to as facts, and typically include numerical values that may be aggregated. Dimension tables may contain textual descriptions of entities.
The source data retrieved from the source systems may be processed in accordance with an extract-transform-load process, wherein relevant data are extracted from the retrieved source data, are transformed to fit operational needs of the data warehouse and are loaded into the data warehouse.
After the initialization of the data warehouse, updates of the data warehouse may be performed, so that changes of the source data at the source systems are introduced into the data warehouse for keeping the data warehouse up to date. The updates may be performed by means of incremental updates, wherein changes of the source data that have occurred since the last update, or, in the case of the first update performed after the initialization of the data warehouse, since the point in time at which the initialization was performed, are retrieved from the source systems, transformed into the data format of the data warehouse and stored in the data warehouse.
For purposes of initializing a data warehouse, it would be advantageous to disallow a source system from which source data are retrieved to continue to make changes to the source data, for example, during a downtime of the source system. During this downtime, the data could be retrieved from the source system and then processed in accordance with an extract-transform-load scheme. However, in computerized systems employed in a production environment wherein it is intended to manufacture products every day and for 24 hours each day, a downtime of the computerized system may cause an interruption of production, which is very expensive and, accordingly, is allowed only in exceptional cases.
Therefore, data are frequently retrieved from a source system by creating snapshots of the source data while the source system continues to perform changes on the data. Typically, this is performed automatically by a prewritten script. The source system may employ concurrency control techniques for providing a consistent snapshot of each source database table stored in the source system. However, the data retrieved from the source system may include multiple source database tables, wherein the snapshots of the individual source database tables are created at different times.
Thus, for updating the data in the data warehouse by means of an incremental update, it is desirable to know the point in time at which the snapshot of each table was created to know from what time the incremental maintenance of the data in the data warehouse corresponding to the table is to be continued.
U.S. Pat. No. 7,257,257 discloses a method and an apparatus for providing differential bandwidth efficient and storage efficient backups and restoration. The method and apparatus employ differential contours that include differences between some given reference contour and a new contour, wherein a “contour” includes a snapshot of the state of every object to be stored or manipulated within a designated collection of such objects, and supplementary annotations or metadata at a given time. For providing the differential contours, content identifiers which may, for example, be generated by using cryptographic hash algorithms may be employed.
U.S. Pat. No. 6,618,794 discloses a system for generating a virtual point-in-time snapshot of a selected volume or logical unit of a storage system. The system operates by using a bitmap in a cache memory to indicate blocks of memory in the selected volume that have been overwritten since the snapshot was initiated. When a write to the selected volume is requested, the cache bitmap is checked to determine whether the original data has already been copied from the selected volume to a temporary volume. If the original data was previously copied, then the write proceeds to the selected volume. If, however, the original data would be overwritten by the presently requested write operation, then an area containing the original data is copied from the selected volume to the temporary volume. Reads from the temporary volume first check the bitmap to determine if the requested data has already been copied from the selected volume to the temporary volume. If so, the data is read from the temporary volume. Otherwise, the data is read from the selected volume.
Jörg and Dessloch, “Formalizing ETL Jobs for Incremental Loading of Data Warehouses,” Proceedings of PTW, 327-346, ISBN 978-3-88579-238-3, 2009, discloses an automated creation of incremental load jobs for data warehouses.
Further techniques for obtaining backups including snapshots of data from a source system are disclosed in U.S. Pat. Nos. 6,078,932, 6,061,770, 5,857,208, 5,778,165 and 5,381,543.
Techniques wherein snapshots of source data tables are obtained at a particular point in time for each source database table and wherein changes of the source database are made during the creation of the snapshots may have particular issues associated therewith. Information concerning the time of creation of the snapshot of a source database table, if obtainable from the source system, may have an insufficient accuracy, in particular if changes of the source database are made at a high frequency. This may lead to inaccurate and potentially false data when the information is used for performing incremental updates. Moreover, the source database system may be provided by a different organizational or business unit than the data warehouse, and access to information from the source database system concerning the exact point in time at which snapshots of source database tables were made may be restricted.
In view of the situation described above, the present disclosure provides methods, computer readable storage media and computer systems that allow determining a point in time at which a snapshot of a source database table in a source database system was made with a relatively high accuracy.