Embodiments of the invention relate to file systems, and in particular, to a switch-aware parallel file system.
A file system is a management structure for storing and organizing files and data. File systems are software components that use storage subsystems to maintain files and data. File systems impose a logical structure on a storage subsystem to allow client computers to create, store, and access data on the storage subsystem. A Distributed File System is a file system that supports sharing of files and storage resources for multiple clients over a network. An Internet-Scale File System is a distributed file system designed to run on low-cost commodity hardware, which is suitable for applications with large data sets. A cluster file system is a type of distributed file system that allows multiple compute nodes in a computing cluster to simultaneously access the same data stored within the computing cluster. A parallel file system is a type of distributed file system that distributes file system data across multiple servers and provides for concurrent access for multiple tasks of a parallel application.
A computing cluster includes multiple systems that interact with each other to provide client systems with data, applications, and other system resources as a single entity. Computing clusters typically have a file system manage data storage within the computing cluster. Computing clusters increases scalability by allowing servers and shared storage devices to be incrementally added. Computing clusters use redundancy to increase system availability and withstand hardware failures.
Supercomputers (e.g., IBM General Parallel File System) use parallel file systems to transfer large amounts of data at high speeds, which reduces a likelihood of any one storage node becoming a performance bottleneck. However, uses of supercomputers in commodity data centers are limited because data striping creates a many-to-many architecture of storage nodes to compute nodes, which requires expensive networking hardware to achieve acceptable performance.
Performance bottlenecks would arise in modern data centers that use smaller commodity switches, if parallel file systems were used. Commodity switches lack sufficient buffer space for each port, which causes packets to be dropped if too many packets are directed towards a single port. Commodity switches also have a limited number of ports, necessitating a hierarchy of switches between compute nodes and storage nodes. Consequently, more nodes must share a decreasing amount of available bandwidth to the parallel file system with each successive level in the hierarchy.
Cheaper commodity-based computing clusters do not match performance of supercomputers due to inherent limitations of low-end hardware. Cloud computing and software frameworks (e.g. MapReduce) for processing and generating large data sets enable use of inexpensive commodity-based computing clusters in data centers. Many data center use internet-scale file systems that rely on co-locating compute processing and required data. The internet-scale file systems avoid bottlenecks created by parallel file systems by striping data in very large chunks (e.g. 64 MB), directly on compute nodes with each job performing local data access. However, compute and data co-located creates other limitations in system architecture. For example, data needs to be replicated on multiple nodes to prevent data loss and alleviate I/O bottlenecks, which increase availability and integrity while proportionally reducing available disk space. In addition, general or traditional applications cannot utilize these file systems because of their lack of Portable Operating System Interface for Unix (POSIX) support and data sharing semantics and their limited support for remote data access using Network File System (NFS) or Common Internet File System (CIFS) protocols.
Typical data centers use multi-tier trees of network switches to create computing clusters. Servers are connected directly into a lower tier consisting of smaller switches with an upper tier that aggregates the lower tier. The network infrastructure will be oversubscribed by using large switches in the upper tiers. The oversubscription is due to cost limitations for typical data centers. Accordingly, the oversubscription creates inter-switch bottleneck that constrains data access in data centers.