1. Technical Field
The present invention generally relates to collectors for collection of data from nodes in distributed networks and in particular to collectors providing asynchronous collection of large blocks of data from distributed network nodes. Still more particularly, the present invention relates to collectors implementing a scalable, distributed data collection mechanism.
2. Description of the Related Art
Distributed applications which operate across a plurality of systems frequently require collection of data from the member systems. A distributed inventory management application, for example, must periodically collect inventory data for compilation from constituent systems tracking local inventory in order to accurately serve inventory requests.
Large deployments of distributed applications may include very large numbers of systems (e.g., than 10,000) generating data. Even if the amount of data collected from each system is relatively small, this may result in large return data flows. For instance, if each system within a 20,000 node distributed application generates only 50 KB of data for collection, the total data size is still approximately 1,000 MB.
Current synchronous approaches to data collection in distributed applications typically follow a xe2x80x9cscanxe2x80x9d methodology illustrated in FIG. 6. In this approach, a centralized data collector (or xe2x80x9cscan initiatorxe2x80x9d) 602 initiates the data collection by transmitting a set of instructions to each node or member system 604a-604n through one or more intermediate systems 606, which are typically little more than a relay providing communications between the central data collector 602 and the member systems 604a-604n. The central data collector 602 must determine hardware and software configuration information for the member systems 604a-604n, request the desired data from the member systems 604a-604n, and receive return data via the intermediate system(s) 606. The data received from the member systems 604a-604n is then collated and converted, if necessary, and forwarded to a relational interface module (RIM) 608, which serves as an interface for a relational database management system (RDBMS).
In addition to not being readily scalable, this approach generates substantial serial bottlenecks on both the scan and return side. Even with batching, the number of member systems which may be concurrently scanned must be limited to approximately 100 in order to limit memory usage. The approach also limits exploitable parallelism. Where a five minute scan is required, 20,000 nodes could all be scanned in just five minutes if the scans could be performed fully parallel. Even in batches of 100, the five minute scans would require 1,000 minutes to complete. The combination of the return data flow bottleneck and the loss of scan parallelism creates a very large latency, which is highly visible to the user(s) of the member systems.
Current approaches to data collection in distributed applications also employ Common Object Request Broker Architecture (CORBA) method parameters for returning results to the scan initiator 602. This is inefficient for larger data sizes, which are likely to be required in data collection for certain information types such inventory or retail customer point-of-sale data.
Still another problem with the existing approach to data collection is that nodes from which data must be collected may be mobile systems or systems which may be shut down by the user. As a result, certain nodes may not be accessible to the scan initiator 602 when data collection is initiated.
It would be desirable, therefore, to provide a collector which may be utilized to implement a scalable, efficient data collection mechanism for a distributed environment. It would further be advantageous for the collectors to provide priority based queuing for collection requests, data rate matching to available bandwidth, and collection transfer control cooperating with other distributed applications for optimization of bandwidth utilization.
It is therefore one object of the present invention to provide collectors for collection of data from nodes in distributed networks.
It is another object of the present invention to provide collectors providing asynchronous collection of large blocks of data from distributed network nodes.
It is yet another object of the present invention to provide collectors implementing a scalable, distributed data collection mechanism.
The foregoing objects are achieved as is now described. A collector for distributed data collection includes input and output queues employed for priority based queuing and dispatch of data received from endpoints and downstream collector nodes. Collection Table of Contents (CTOC) data structures for collection data are received by the collector from the endpoints or downstream collectors and are placed in the input queue, then sorted by the priority within the CTOC. Within a given priority level, collection of the data is scheduled based on the activation time window within the CTOC, which specifies the period during which the endpoint or downstream collector node will be available to service data transfer requests. The collected data, in the form of data packs and constituent data segments, is stored in persistent storage (depot). A CTOC is then transmitted to the next upstream collector node. Network bandwidth utilization is managed by adjusting the activation time window specified within a CTOC and the route employed between source and recipient.