The amount of data in databases and other data stores continues to grow and grow over time. There are many reasons for this continued expansion. One reason relates to the fact there are more and more business transactions taking place every day. Internet, intranet, and/or systems have increased the number of transactions greatly, so too has merger and acquisition activity in many business sectors. The end result is more data needs to be archived in order to, for example, remain compliant with legislation (e.g., that requires maintaining such information), keep the main “production” part of business systems at maximum levels of efficiency, etc. Details regarding transactions including, for example, the parties, terms, conditions, relevant products or data, etc., often are archived. In addition, there recently has been a move towards archiving metadata (which may sometimes be thought of as “data about data”) concerning each of these and/or other aspects of a business transaction.
With the need for more archiving of data comes the concomitant need for better ways to accomplish it. For instance, the inventor of the instant application has realized that it would be desirable to archive business and/or other data in a way that allows for central control of the service but, at the same time, maximizes the potential scope for distribution across the network.
Current archiving approaches are not designed to cope with a proliferation of archiving. Present systems typically either funnel through one or only a very few, specific computers in the network. This approach limits capacity and reduces fault tolerance, as there is a “funnel” through which all data must pass. This funnel approach is limited by the resources dedicated to the funnel. It also presents a single point of failure.
A product that once was commercially available from Neon purported to operate in a distributed manner. To an extent, this claim is true. That is, Neon's product involved “extraction” of data is allowed to take place in multiple computers. However, the “accumulation” of this extracted data into the archive had to go through a specific computer that acted as the archive control center. Thus, even though data extraction may have taken place in or at various computers across a network, the system disadvantageously involved a “funnel” through which all extracted data was archived.
FIG. 1 is a simplified view of network system that helps demonstrate certain disadvantages of the funneling approach to data archiving. The network 100 in FIG. 1 includes a plurality of computers 102. Each computer is connected to a data store 104. A computer 102 is configured to extract data from an associated data store 104, and then send all or part of it to an accumulator computer 106 which, in turn, archives the extracted data in the archive storage location 108. As can be seen from the FIG. 1 example, all data is funneled through the accumulator 106 before it can be stored to the archive storage location 108. Thus, if anything happens to the accumulator 106 (e.g., it is damaged, destroyed, temporarily inaccessible, etc.), the archiving operation will fail. As such, even though the extraction of data may be somewhat distributed, the central funnel provided by the accumulator 106 is still disadvantageous.
Thus, it will be appreciated that there is a need in the art for improved archiving systems and/or methods. For example, it will be appreciated that there is a need in the art for truly distributed approaches to both data extraction and data accumulation.
One aspect of certain example embodiments involves archiving of data in a way that allows for central control of the service but, at the same time, maximizes the potential scope for distribution across the network. According to certain example embodiments, any number of data “extractors” may be running on any number and/or type of computers. At the same time, according to certain example embodiments, any number of data “accumulators” may be running on any number of computers. The accumulator computers may be the same or different computers as the extractor computers. In certain example embodiments, some (but not all) of the accumulators computers may also be extractor computers, and vice versa. The distributed design disclosed herein advantageously increases capacity and improves fault tolerance.
Another aspect of certain example embodiments relates to techniques for automatically determining whether data has been lost or damaged. In such cases, data integrity checks may be performed along with, or in place of, security-enhancing encryption-based techniques. An example scenario is that archive data is stored for a very long time but generally is untouched. Accordingly, an incident may occur, e.g., at the hardware level, where storage media is lost—even though relevant personnel (e.g., an information technology department, data owner, etc.) are not informed, there is no apparent loss, and/or no apparent interruption to service. Some amount of time later (e.g., days, months, or years), a search of the archive may need to access what was lost. Unfortunately, because so much time has passed since the incident, there now may be no chance to correct the problem. However, using the techniques of certain example embodiments, it may be possible to help ensure that all parts of the archive are regularly accessed (to varying degrees) within a controlled amount of time. This example approach may, in turn, improve the ability to identify, address, and rectify any loss, as discovery of the problem may be closer to the time when the incident occurs.
Certain example embodiments of this invention relate to an archival system. A plurality of computers is connected to a network. At least one source data store and at least one target data store are connected to the network. At least one archive service is configured to coordinate a plurality of extract operations and a plurality of accumulate operations, with each said extract operation being executed on one said computer in the plurality of computers to read data from one said source data store and with each said accumulate operation being executed on one said computer in the plurality of computers to write data to one said target data store. Each said extract operation is configured to run on any one predetermined computer in the plurality of computers and is paired with one said accumulate operation that is configured to run on any one predetermined computer in the plurality of computers.
In certain example embodiments, the at least one archive service may be further configured to access rules (a) mapping the extract and accumulate operations to respective computers on which the operations are to be executed, and (b) storing pairings between peer extract and accumulate operations. In certain example embodiments, the rules may further identify at least one said source data store for each said extract operation and at least one said target data store for each said accumulate operation. In certain example embodiments, the rules may further include extract rules identifying data to be read from the at least one source data store and whether any further data is to be attached to the data to be read, and/or accumulate rules indicating how duplicate entries are to be handled and/or how long data is to be retained in each said target data store.
The system of certain example embodiments may be arranged such that a central hub or funnel is not used when data is written to the at least one target data store.
According to certain example embodiments, the at least one archive service may be further configured to coordinate at least one recall operation to retrieve data from the at least one target data store for access on one said computer. In the at least one recall operation, the retrieved data from the at least one target data store may be placed into at least one said source data store.
According to certain example embodiments, the at least one archive service may be further configured to coordinate at least one validation operation to verify integrity of data in the at least one target data store. The at least one validation operation may run continuously for a predetermined amount of time, periodically such that the at least one validation operation begins execution at predetermined times or time intervals, etc. The at least one validation operation may be configured to determine whether data exists, matches a checksum, and/or is recallable, based on predetermined rules accessible by the at least one validation operation. The at least one validation operation may be further configured to raise an alarm if the at least one validation operation encounters a fault during execution.
According to certain example embodiments, the at least one archive service may be further configured to coordinate at least one importing operation to incorporate data from an otherwise non-consumable backup location into the at least one source data store and/or the at least one target data store. The at least one importing operation may be configured to determine (a) what data exists in the data from the otherwise non-consumable backup location was backed up, and (b) how the data from the otherwise non-consumable backup location was backed up. The at least one importing operation may implement rules for reacquiring the data from the otherwise non-consumable backup location, with the rules for reacquiring the data being one or more user programmed rules, one or more predefined algorithms, and/or automatically generated by the at least one importing operation.
In certain example embodiments of this invention, corresponding methods for providing, configuring, and/or executing the same also may be provided.
In certain example embodiments of this invention, corresponding computer readable storage media tangibly storing instructions for executing such methods also may be provided.
These aspects and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.