The advent and maturing of Internet technology over the last few decades has totally changed the landscape of the information technology (IT) industry. The absolute success and popularity of (mostly Ethernet based) Internet Protocol (IP) networks has promoted this technology as the prime architectural choice in most IT environments. Central mainframe computers have in most cases been replaced by distributed client-server architectures connected by very powerful IP networks.
This technology has steadily found its way in the media industry as well. IP based architectures are now fully accepted as the standard solution for file based media production and have drastically changed the way broadcasters operate and function internally. Applying an ICT based infrastructure and IP networks as means of transport in video/media production in particular, introduces a number of substantial possible benefits, facilitating the fundamental shift from traditional tape-based video manipulation to a file-based production paradigm. This technology leap enables video to be treated, processed, stored and transported as ordinary files independent of the video format, instead of the continuous streams used by the classical media technology of today. This evolution has drastically changed the media infrastructure technology towards central disk-based media storage. Many broadcasters have adopted a tape-less TV production vision since. This idea is further supported by the appearance of camera devices with storage facilities other than the traditional videotapes, e.g. optical disks (Sony) or Solid State memory cards (Panasonic).
Typically camera crews now enter the facilities with their video stored as ordinary files on memory cards in stead of on video tape. The memory cards are put into ingest stations, e.g. ordinary PCs, and the files are transferred as fast as possible, preferably faster than real time, into a central disk based storage system. Once stored in the central system, everybody can access the material simultaneously.
Storage is one of the most important media services in a file-based media environment. As with IP networking in media, media storage requirements largely differ from the classical IT storage solutions. Although architectures based on generic IT storage components are preferred above very expensive proprietary less-reliable media solutions, mainly because of economics of scale, reliability, cost, etc. . . . , media poses very high requirements on the file system. These special file system requirements are needed because of extreme characteristics in terms of (parallel) throughput, storage capacity, scalability, redundancy, availability and reliability, etc. . . . .
IBM's General Parallel File System (GPFS) is one of the most powerful media file systems available on the market today. It is a file management infrastructure providing high performance and reliability with scalable access to critical file data. In addition to providing file storage capabilities, GPFS provides storage management, information life cycle management tools, centralized administration and allows for shared access to file systems from remote GPFS clusters. GPFS provides scalable high-performance data access from a single node cluster, a two node cluster providing a high availability platform supporting a database application, for example, to 2000 nodes or more. GPFS was designed from the beginning to support high performance parallel applications and has since been proven very effective for a variety of applications.
Traditionally, the storage industry builds very scalable storage clusters (2) (i.e. groups of nodes for storage purposes) based on a classical Fibre Channel (FC) Storage Area Network (SAN) network architecture (see FIG. 1) using the FC protocol for transport of storage traffic. This architecture typically scales throughput together with storage capacity, i.e. if one requires more throughput, more disks are required and vice versa for capacity increases. Since it is based on high end storage systems (10), the solution is quite expensive. In the end, the Fibre Channel network typically becomes the scalability bottle neck of this classical IT architecture when heavily loaded with media traffic.
However, many essential media services require a processing power platform close to the storage service, such as transcoding and rewrapping services. This leads to the definition of many distinct storage services each with its particular characteristics. Hence, one can distinguish e.g. primary capacity storage suited for HD (high definition), SD (standard definition), low resolution video and audio, secondary disk-based and tape-based storage with high volume but low cost for backup and recovery, ingest storage, central editing storage, temporary storage, distribution storage, etc. Hence, there is a need for a cost-effective fit-for-purpose storage cluster architecture that can provide independent and tuneable scaling for processing power, throughput, storage capacity and availability, preferable using low-end cheap commodity components.
A GPFS cluster based on a ‘network attached node’ (NAN) model perfectly fulfils these requirements (see the example of FIG. 2). A GFPS cluster based on the NAN node model consists of storage cluster nodes (4) and network attached cluster nodes (6). The storage servers (4) have local storage or are directly connected to an external storage system (10), either locally attached or via a SAN architecture. NAN nodes are via a cluster network connected to all storage nodes, but are not directly attached to the underlying storage system (10). Each storage node is a primary server for part of the total storage. The NAN node stripes its data requests over all storage nodes, thereby aggregating the available bandwidth of each individual storage node and connected storage subsystems.
Initially, the network protocol and architecture for the cluster network was TCP/IP. It has been demonstrated that the same traffic could be passed unaltered over Infiniband (IB) via IPoIB. In a later version, IB verbs, also sometimes called native IB, became supported. The cluster depicted in FIG. 2 uses IB (5) as cluster network.
The cluster can be independently scaled for processing power by reinforcing the NAN nodes in terms of CPU or increasing the number of NAN nodes in the cluster. If more throughput is required, the cluster network can be scaled to higher bandwidths, e.g. from single data rate Infiniband (SDR) to double data rate (DDR) and in the future to quadruple data rate (QDR). The throughput to the clients can be augmented by adding NAN nodes. The storage throughput can be optimised by using faster disks or increasing the number of storage nodes. The pure storage capacity can be scaled by using larger hard disks, putting more storage under each storage node or increasing the number of storage nodes. Every component in the cluster can be made redundant to avoid any single point of failure. On top of that, GPFS has the concept of failure groups to further protect the storage system.
The data storage traffic over the cluster network (i.e. the network interconnecting the storage nodes with the NAN nodes) can be considered as a particular case of media traffic. In order to boost up the hard disks for media use, the segment size of the disks should be set as large as possible. The disks are combined in a Redundant Array of Independent Disks (RAID) protection configuration. This results in very large I/O block sizes, typically 4 MB, transported by the file system over the cluster network. Hence, this is a case of extremely bursty traffic.
Both in reading and writing operations the cluster network displays a many-to-one traffic pattern. When a NAN node reads from the storage nodes, all the storage nodes respond at the same time back to the NAN node with large bursts of traffic. If, on the other hand, multiple NAN nodes write data to the storage, the receiving storage nodes are simultaneously addressed by the bursts of all the writing NAN nodes. Both cases result in heavy oversubscription of the cluster network. Since high efficiency is of extreme importance in a media storage architecture, packet-loss and the resulting retransmissions by TCP are heavily penalised and should be avoided at all cost. Some media file systems using an IP network as cluster network try to remediate this by using UDP. They count on very large switch buffers to overcome the packet loss due to the oversubscription. This is only effective if the number of devices actively participating in such cluster architecture is relatively small, and pre-fetching is not used aggressively. However, this puts heavy restraints on the maximum throughput and fails when too many traffic requests interfere with each other.
In the case of GPFS, TCP/IP was originally used as protocol stack. This allows for flow control, but at the cost of limiting maximum throughput. As the most prevailing network technology for TCP/IP is Ethernet based, packet loss in the Ethernet network leads to retransmissions and further limits throughput efficiency. For this particular type of traffic the network technology used by Fibre Channel or Infiniband is very effective. Their flow control mechanism is based on buffer-to-buffer credits to eliminate packet loss completely in case of oversubscription. Credits of available buffers are continuously exchanged between ports on the same link. When no buffer credits are available, no packets are transmitted, until the network processes its congestion and buffers become available again.
Hence, the above depicted cluster uses Infiniband as cluster network technology. This is a very cheap technology with high bandwidth. The net data bandwidth is 8 Gb/s for SDR-IB and 16 Gb/s for DDR-IB. The capability of the PCI express bus becomes the next bottleneck. Also, the Buffer-to-Buffer credit flow control mechanism on all traffic on the link at the same time poses a constraint on the linear scalability of such a solution.
The Infiniband stack is extremely efficient for Linux based servers, reaching the full physical limits of the underlying bus-technology. The processing of the protocol stack is fully offloaded in the Host Channel Adapter (HCA), the IB network cards. Even remote direct memory access (RDMA) is fully supported and exploited. This leads to a very powerful cluster architecture, extremely well adapted for the file-based media production environment.
However, many media client applications require a Microsoft Windows operating system. This is both the case for Windows applications that have to run on the NAN cluster nodes, as for applications that require a mount of the central file system via the Common Internet File System (CIFS) protocol. Recently, IBM added a GPFS on Windows client to its NAN node configuration. This allows for a Microsoft Windows 2003 or 2008 Server to participate as a NAN node in the GPFS cluster. The state of the art Infiniband stack for Windows machines is presently however much less performing than the Linux flavour. The cluster protocol stack has to fall back to using IPoIB without any offloading, since not all GPFS commands are yet supported in the native IB stack for Windows. This decreases the performance of the cluster network by a factor of five.
Recently, some new developments supported by some of the leading IP networking companies have led to the definition and implementation of Data Centre Ethernet (DCE). Data Centre Ethernet is a term that refers to enhancements to Ethernet bridges that enable lossless LAN and SAN connectivity over an Ethernet network. With the term ‘lossless’ is meant that the Ethernet bridges (i.e. switches) do not lose frames under congestion. DCE, also known as CEE, Converged or Convergence Enhanced Ethernet, or as DCB, Data Centre Bridging, describes an enhanced Ethernet that enables convergence of various applications in data centres (LAN, SAN, high-performance computing) onto a single interconnect technology.
DCE is known in the art, for example from Cisco patent application US2006/251067, wherein a Data Centre Ethernet network and related methods and device are provided in the context of Fibre Channel over Ethernet. A DCE network simplifies the connectivity of data centres and provides a high bandwidth, low latency network for carrying Ethernet, storage and other traffic.
In the Cisco whitepaper “Data Center Ethernet: Cisco's Innovation for Data Center Networks” an overview of DCE is provided. In the past, separate physical network infrastructures were being deployed beside each other to support different traffic types, such as a Fibre Channel network for storage traffic, a classical lossy Ethernet network for IP data traffic or iSCSI storage traffic and an IB network for cluster traffic. Each network technology has different characteristics matching its application. DCE supports multi-protocol transport over one and the same Ethernet network fabric to consolidate these different applications and protocols on the same physical network infrastructure. It does so by defining a PFC (Priority-based Flow Control) mechanism to distinguish different traffic classes and to pause selectively a certain traffic class while continuing the transmission of other traffic classes on the same link. A “no-drop”-service class can be defined for FC traffic over the Ethernet link (FCoE) providing a lossless Ethernet fabric for the FC protocol based storage traffic, while other Ethernet traffic is transported in the normal lossy way. Linked to the PFC classification bandwidth provisioning per priority can be introduced. Optimal use of the available physical bandwidth can be enhanced by the introduction of Layer 2 Multipathing, increasing the throughput and scalability of Layer 2 Ethernet network topologies. Layer 2 is the well known data link layer of the seven layer OSI model.
Cisco application US2006/087989 provides methods and devices for implementing a Low Latency Ethernet solution, also referred to as a Data Centre Ethernet solution, which simplifies the connectivity of data centres and provides a high bandwidth, low latency network for carrying Ethernet and storage traffic. Some preferred implementations of the disclosure implement multiple virtual lanes (VLs) in a single physical connection of a data centre or similar network. Some VLs are “drop” VLs, with Ethernet-like behaviour, and others are “no-drop” lanes with FC-like behaviour. Active buffer management allows for both high reliability and low latency while using small frame buffers.
Cisco's whitepaper “Fibre Channel over Ethernet Storage Networking Evolution” describes the different evolutionary phases to introduce Fibre Channel over Ethernet (FCoE) in replacement of classical FC network environments by a unified Ethernet fabric. First FCoE is implemented at the server end to enable standalone servers to communicate with FC-attached storage systems using the traditional FC protocol but over a single unified Ethernet interface and cabling, thereby preserving the existing SAN operating and management model. In a second phase Cisco predicts that also blade servers will make use of that same principle and that FC switches will offer support for FCoE. In a third phase, storage arrays and tape libraries will support native FCoE interfaces. This enables the LAN and FCoE SAN traffic to converge onto a single, unified network fabric in the future.