“Big data” is a commonly-used term that refers to very large data sets that are a byproduct of rapid advances in data collection and storage. While the data itself has importance, it is also important to be able to identify relevant data within the data sets and to use analytics to create value from the relevant data. The ability of a business to analyze large data sets and understand and apply the results can be key to increasing competitiveness, productivity, and innovation. However, as the amount of data continues to grow, current systems may find it difficult to keep pace.
A distributed or shared storage system (e.g., a network-attached storage (NAS) system or cluster) typically includes a number of NAS devices that provide file-based data storage services to other devices (clients, such as application servers) in the network. An NAS device is, generally speaking, a specialized computer system configured to store and serve files. Accordingly, an NAS device has less capability compared to a general purpose computer system such as an application server. For example, an NAS device may have a simpler operating system and a less powerful processor, and may also be lacking other components such as a keyboard and display.
An application server can retrieve data stored by an NAS system over a network using a data-sharing or file-sharing protocol such as Network File System (NFS) or Common Internet File System (CIFS). After retrieving the data, the application server can analyze it as mentioned above.
The conventional approach—in which an NAS system supplies data to an application server, which then analyzes the data—is problematic for very large data sets (big data). Generally speaking, it can take a relatively long time and can consume a relatively large amount of bandwidth and other resources to deliver big data from NAS devices to application servers. For example, multiple remote procedure calls are defined for the NFS protocol, including read, lookup, readdir, and remove. In the NFS protocol, file or directory objects are addressed through opaque file handles. Any read call is preceded by a lookup to locate the object to be read. The read call is then invoked iteratively; the number of times is it invoked depends on the NFS configuration and the size of the object to be fetched. With the larger data sets associated with big data, the read call will have to be invoked many times to fetch the data for a single object. Thus, operations such as a read call over a distributed storage system can be very expensive in terms of the amount of computational resources and bandwidth that are consumed.