Most present day computer clusters consist of a large number of individual machines or nodes and various types of resources associated to the nodes. These resources include storage devices of various types. These are usually embodied by non-transitory storage media such as hard disk drives (HDDs) including arrays of such drives (e.g., RAIDs), optical disks, solid state devices (SSDs) and still other types of media known to those skilled in the art.
It is important that clusters maintain a high level of availability of important data files that are stored in the non-transitory storage media belonging to it. In particular, certain data files residing on such clusters need to be kept highly available to users who wish to access these data files and/or execute jobs on these data files.
When nodes belonging to a high availability cluster are tasked with performing a batch job, e.g., a Map-Reduce job, it is crucial to ensure high availability of data files on which the job is to be performed to guarantee rapid execution. This applies in particular to data files that are very popular and hence most likely to be involved in an important batch job. Additionally, preliminary or intermediate results obtained during a Map-Reduce job that are based on the most popular data can be a useful indicator or gauge of the likely final result. Results obtained from processing popular data should thus be made available quickly, whenever practicable. Published Patent Appl. No. 2013/0117225 to Dalton teaches how heterogeneous storage media in high availability clusters can be deployed to ensure high availability of certain popular data files. A high availability cluster as taught by Dalton has nodes provisioned with high-access-rate and low-access-rate non-transitory storage media. A distributed storage medium management protocol determines whether to place data files in the high-access-rate or in the low-access-rate storage media depending on the data file's popularity. High availability of popular data files is ensured by placing or migrating them to high-access-rate storage media, such as solid-state memory or RAM disks. While there, the popular data files are rapidly accessible to users for various purposes, ranging from simple read requests to their deployment in user-defined batch jobs.
In some cases, however, the data files, whether popular or not, are not yet even stored on cluster storage resources. For example, the data of interest has not yet been uploaded to the storage resources of the high availability cluster. This may be the case when the data of interest flows through an external network that is not part of the high availability cluster. In such situations, the data must first be captured from the external network and brought into the high availability cluster.
The prior art teaches many devices for intercepting packets of data propagating through networks. There are two broad types of devices used for accessing or interacting with data flow or traffic on a network. The first group includes traffic flow or monitoring devices. These are sometimes referred to as “packet sniffers” by those skilled in the art. Such devices do not capture the full contents of the data packets. Instead, they monitor the flow of data packets and are thus used in the analysis of network traffic flow. An excellent review of network monitoring with various types of flow tools is provided by Chakchai So-In, “A Survey of Network Traffic Monitoring and Analysis Tools”, IEEE.org, 2006, pp. 1-24 at: http://www.cse.wustl.edu/˜jain/cse567-06/net_traffic_monitors3.htm.
It should be noted that monitoring of data packet flows is practiced extensively in many areas. This includes Internet traffic analysis using Hadoop Distributed File Systems (HDFS) and batch jobs, such as Map-Reduce deployed in high availability clusters. For more information the reader is referred to Lee Y. et al., “An Internet Traffic Analysis Method with MapReduce”, IEEE/IFIP, Network Operations and Management Symposium Workshops, 2010, pp. 357-361.
The actual capture of data packets propagating through an external network, rather than just flow monitoring, requires the second type of device; namely in-path interceptors or packet capture devices. These are typically mounted in routers of the external network and capture entire packets of data propagating through the router. An overview of suitable packet capture devices (also referred to as PCAP devices by some skilled artisans) is provided by Scott M., “A Wire-speed Packet Classification and Capture Module for NetFPGA”, First European NetFPGA Developers' Workshop, 2010, Cambridge, UK. Some specific devices are discussed in practical references as well as product guides and descriptions. These include, among many others, the “GL Announces PacketShark”, GL Communications Website (www.gl.com) Newsletter 3 Sep. 2012, pp. 1-2, as well as product literature for devices such as GigaStor™ by Network Instruments (operating at 10 Gb Wire Speed).
Once data packets are captured by a suitable packet capture device from an external network into a cluster, they can be analyzed and operated on. Thus, captured data packets can be submitted for any jobs that are supported by the cluster and its file system. Corresponding teachings are once again provided by Lee Y. et al, “Toward Scalable Internet Traffic Measurement and Analysis with Hadoop”, ACM SIGCOMM Computer Communication Review, Vol. 43 No. 1, January 2013, pp. 6-13. This reference addresses the capture of packets into clusters that deploy the Hadoop Distributed File System (HDFS) and are capable of running batch jobs.
Although the above prior art teachings provide solutions to data capture and analysis in a high availability clusters, they do not address several important issues. First of all, the prior art does not teach suitable approaches to ensuring high availability of captured data packets that are popular. In other words, there are at present no suitable methods and/or protocols to ensure that popular data packets captured into the high availability clusters are highly available to cluster users immediately upon capture. Furthermore, the prior art does not address how to manage the packet capture process in cases where the high availability cluster supports different file systems. For example, it would be desirable to capture packets propagating in an external network into a high availability cluster that is configured to run both POSIX-compliant and non-POSIX-compliant (e.g., HDFS) file systems.