Along with the rise of massively distributed data storage and data availability on the Internet has come the necessity for various organizations to download the data in its various forms (“Content”) for offline analysis and processing. Theses applications include indexing of content, analyzing of data, archiving of content, and many others. The only effective way to download theses numerous content sets, some of which are very large, is to use automated content pullers connected to high-bandwidth connections.
Unfortunately, there are often adverse consequences to downloading content with automated content pullers. One such adverse consequence of downloading content with automated content pullers is the potential to deny service to other content pullers. For example, if one puller downloading all the content from any one source is connected to a high-bandwidth connection, it could effectively prevent that source from serving its content to other applications trying to access the data. This situation should be avoided because it can create ill-will with the provider of the content, stop other users from accessing the content, and can temporarily or permanently harm the equipment that the content is served from.
Another adverse consequence is that operators of sites containing content frequently implement Intrusion Detection Systems (IDS) which are put in place to detect when a server containing the content is under an attack. While an IDS is a valuable tool in determining if an attack is under way, if it is not configured correctly it can incorrectly report that any kind of automated content collection is an attack. This stems from the fact that the IDS attempts to detect and report patterns of activity. For example, repeatedly accessing a server containing content, even if spread over large time periods and for non-disruptive purposes, will trip most baseline IDSs and trigger an alarm. Responding to these false alarms is time consuming for the administrators of the servers containing the content, and can result in ill will toward organizations attempting to download the content, regardless of the intentions of such organizations.
For these reasons, a system for downloading high volumes of content without adversely affecting the source of the content, or being detected is desired to address one or more of these and other disadvantages.