Computer systems have been utilized to gather, store, transfer, and analyze data obtained from a variety of sources. Such data can represent financial and commercial information, scientific information, sales transactions, customer and personnel information, internet activity, or any of a variety of information regarding activities, people, parameters, or events.
In this regard, relational database management systems (RDBMS) have been developed which allow users to access, store, and operate on data using a computer language. For example, the TERADATA RDBMS system developed by NCR Corporation provides a single data store for any number of client architectures. Such a shared information architecture eliminates the need for maintaining duplicate databases on multiple platforms. For example, this system could allow mainframe clients, workstations, and personal computers to access and manipulate the same database simultaneously. One of many ways to put data into such a data store is to use a load utility, such as FASTLOAD, MULTILOAD, or TPUMP load utiliztes. Structured Query Language (SQL) can then be used to access the data store and to query and/or manipulate the data therein. Large amounts of data can be stored or “warehoused” using such a system.
In transferring data from the various sources to the data store, a number of methods can be utilized. For instance, a process could access the source data and transfer the desired data to a file. Once all of the desired data is transferred, the file could be closed and a second process could then open the file and load the data into the data store. However, such a system can suffer from inefficiencies in transfer speed as the file must be closed by the first process before it can be accessed by the second process. Moreover, such systems are limited by file size and disk storage restrictions.
Accordingly, for data warehousing as well as for many other data transfer applications, it is often desirable to utilize more efficient data transfer mechanisms, such as those which operate using volatile mechanisms for transferring the data. One such volatile data transfer mechanism is known as a “pipe.” Pipes are areas of shared memory set aside for data transfer and can be used as an efficient conduit of data from one process to another process. Pipes speed the data transfer function because multiple processes can access the pipe concurrently. Such pipe conduits are available in both the UNIX and WINDOWS operating system, and are available as “named pipes” (where the pipe has a name that is accessible by any other operating process that knows the name) and as “unnamed pipes” (where the pipe is given a private identification number used only by two processes).
However, if there is an error in transferring data via pipes, the user typically is forced to restart the data transfer process from the beginning. In other words, mid-transfer restart capability is conventionally not permitted with pipes due to their volatile nature. Once data is read from a pipe to a second process, it is removed from the pipe and if there is an error in loading that data from the second process or any other power or transfer error, the entire transfer has to start again from the beginning, thereby introducing inefficiencies due to the redundancy in retransmitting data. Some transfers can take hours or days, causing valuable time to be lost when the entire transfer must be carried out again.
Accordingly, methods and systems are desired which allow for transfer of data from a first process to a second process using pipes but which do not require data to be re-transmitted over the pipe if an error or failure is encountered. At least one embodiment described herein relates to such methods and systems.
Furthermore, after some data transfer failures, it may not be possible to resume the transfer of the data without starting again from the beginning. Yet, as can be understood, simply re-transferring the data from the first process through the pipe and to the second process does not ensure that the data being re-transferred is actually the same data which had been transferred previously. Accordingly, corrupted or erroneous data might be transferred through the pipe when re-starting the process from the beginning. Moreover, it would be redundant for the second process to operate on this data again if it had already done so during the original transfer. Accordingly, methods and systems are also desired which allow the re-transferred data from the first process to be validated against data which had already passed through the pipe and been received by the second process prior to the failure. Thus, assurance can be provided that the data being re-transferred is correct and valid as compared with the data that had already been transferred. If the data is verified as correct, the second process need not operate on that data again, thereby providing efficiency. At least one embodiment described herein relates to such methods and systems.