In the past, data centers typically implemented two, completely separate, network infrastructures: a data communication network (typically based on Ethernet), and a separate “storage” network for storage access. A typical storage network implemented the conventional Fibre Channel protocol. The expressions “data communications network” and “data network” are used herein as synonyms to denote a network in a class distinct from the class of “storage networks” in the sense that a storage network is configured and employed to carry primarily “storage data” traffic (where “storage data” denotes data retrieved from, or to be stored on, at least one storage device), and a data network is configured and employed to carry primarily other data traffic (i.e., data which is not storage data).
Undesirably, however, implementation of multiple network types (e.g., separate data and storage networks) increases the capital and operational costs of running a data center.
Recently, many data centers have begun to investigate use of (and some have begun to use) a single network which carries both storage data traffic and other (non-storage data) traffic. Such a single network will be referred to herein as a “converged network.” An example of a converged network is an Ethernet based network on which all traffic is sent between servers coupled to the network and storage devices coupled (via adapters) to the network. Unfortunately, the two types of network traffic (storage data traffic and other data traffic) to be sent over a converged network have different characteristics.
Data networks (e.g., those implementing Ethernet with the Internet Protocol), in order to carry traffic other than storage data traffic, can be (and thus are typically) implemented as un-managed or minimally managed networks. This makes it simple to add and remove computers and other hardware to or from a data network. For example, the DHCP protocol can typically provide (without human intervention) to new devices all the information they need to operate on a data network.
However, network loops can cause serious problems in data networks (i.e., continuous forwarding of packets that should be dropped). For this reason, data networks often implement a protocol (e.g., the Spanning Tree Protocol) to ensure that only one path is known between any two devices on the data network. Redundant data paths are rarely set up explicitly on data networks. Further, traffic on data networks is relatively unpredictable, and applications are usually written to tolerate whatever bandwidth is available on data networks.
In contrast, storage networks are usually managed networks. A network administrator typically manually assigns what computers can communicate with which storage devices on a storage network (i.e., there is usually no self-configuration). There has been little development in making the network connections (in a storage network which is implemented to be separate from a data network) adaptable to changing conditions. Further, in order to provide the high level of availability and fault tolerance typically required for low level data storage, there are typically fully redundant paths between a storage device (coupled to a storage network) and a computer.
As a result of the differences between storage networks (and the storage data traffic thereof) and data networks (and the non-storage data traffic thereof), combining both storage data traffic and other traffic in a converged network can lead to imbalances in network utilization, which can reduce the overall performance of applications in a data center. Typical embodiments of the present invention address such imbalances in utilization of a converged network, e.g., to allow a data center's applications to approach the maximum performance available.
The following definitions apply throughout this specification, including in the claims:
“storage device” denotes a device which is configured to store and retrieve data (e.g., a disk drive). Typically storage devices are accessed using Logical Block Address (LBA) and a number of blocks. A logical block is a fixed sized chunk of the total storage capacity (e.g., 512 or 4096 bytes). A traditional rotating disk drive is an example of a storage device;
“server” denotes a computing device configured to access and use a storage device across a network (a converged network) to store and retrieve data (e.g., files and/or applications);
“adapter” denotes a device configured to connect a storage device, or a storage system (e.g., a JBOD) comprising two or more storage devices, to a network (e.g., a converged network). In typical embodiments of the invention, each storage device is normally accessible to a server via two or more adapters in order to provide failure tolerant access to data stored on the storage device;
“interface” denotes a component of a server or adapter that connects the device (the server or adapter) to a network (e.g., a converged network). Examples of an interface are a physical device (i.e., a Network Interface Controller (NIC)) and a software-defined wrapper of multiple NICs (as for link aggregation). In typical embodiments of the invention, an interface is a hardware or software element that has its own Internet Protocol (IP) address in a converged network;
“agent” denotes a software or hardware component or subsystem, of a server (or an adapter), configured to run on the server (or adapter) during operation of the server (or adapter) to exchange (or prepare for the exchange of) storage data traffic on a network (e.g., a converged network). In some embodiments of the invention, not all servers and adapters on a converged network have agents. However, coupling of non-participating servers and/or adapters (servers and/or adapters without agents) to a network may limit the degree of balancing that can be achieved (in accordance with embodiments of the invention); and
“data path” denotes a path along which data is sent between a storage device and a server via an adapter, using one interface on each of the adapter and the server (i.e., a path from the storage device to the server through the adapter interface and through the server interface, or a path from the server to the storage device through the server interface and the adapter interface). In an IP network, a data path can typically be denoted by the combination of the IP address of the server's interface and the IP address of the adapter's interface, and, optionally, also by the port number to be used at the adapter. However, in the case of link aggregation, the full path would depend on the actual interface used for the path within the group of interfaces bonded into one IP address.
When a storage system (e.g., a JBOD) comprising two or more storage devices is coupled to an adapter, and both the adapter and a server are coupled to a converged network, we contemplate that a server (in order to access a storage device of the storage system) will typically specify (i.e., be configured to use) a specific storage device of the storage system (e.g., one disk drive of a JBOD) and a data path between the server and the storage device. In accordance with typical embodiments of the present invention, the data path may be changed from time to time in order to balance storage data traffic on the network. In accordance with some embodiments of the present invention, the data path (between the server and the storage system) may be changed from time to time in order to balance storage data traffic on the network (also, the adapter's selection of the specific device of the storage system to be accessed by the server may change from time to time but such changes would not necessarily be determined in accordance with the invention).
In general, when storage data traffic is combined with other data traffic on a converged network, the attributes of the different types of traffic can combine to result in inefficient use of the network's overall bandwidth, limiting the performance of the data communications traffic and/or the storage traffic.
For example, it is common for a modern server computer to include two or more 1 Gbps or 10 Gbps network interfaces (referred to herein as “interfaces” in the context that the server is connected to a converged network). Many such servers run a software package (e.g., the Hadoop open source software package) that allows a large number of servers to work together to solve problems involving massive amounts of data. However, such software (e.g., Hadoop) typically requires each server to have a unique name and address. Therefore the data communications traffic between servers running the software (e.g., Hadoop) will typically only use one of the two (or more) network connections available on each server.
In contrast, storage data traffic is usually configured to have redundant paths between servers and disk drives in order to survive failures of any of the components. These redundant paths can be used to redirect storage data traffic (e.g., spread storage data traffic among network interfaces) to avoid network interfaces which are made busy by data communications traffic (non-storage traffic). However, the standard mechanisms (e.g., Multipath I/O or “MPIO” methods) for implementing this redirection create a severe performance penalty in the storage data traffic on a converged network. Specifically, the normal storage data load spreading mechanisms are based on sending storage commands across all available interfaces in round-robin fashion, or determining some measure of how much work is outstanding on each link (e.g., number of commands outstanding, or total number of bytes outstanding, or some other measure), and sending commands to the ‘least busy’ interface. The reason that these mechanisms cause a large performance penalty for storage data traffic between servers and disk drives is that, to obtain maximum performance, the commands executed by a disk drive must be to consecutive locations on a disk. If commands are not sent to access consecutive locations, then a ‘seek’ operation is required to move the disk drive's read/write heads to a new location. Each such seek operation will typically reduce the overall performance by approximately 1% or more. Conventional spreading mechanisms (round-robin or ‘least-busy’spreading mechanisms) increase the number of seeks required to execute a sequence of disk access commands, because they frequently cause consecutive commands in the sequence to take different paths from the server to the disk drive. The different paths will have different processing times and latencies (due to other operations on each path), so the commands issued in one order will often be executed in a different order. Each reordering will cause a seek, and thereby reduce the overall data carrying capacity. It has been observed that these conventional spreading mechanisms, when applied to Hadoop storage operations, reduce the total performance of the storage data traffic by approximately 75% (i.e., the amount of storage data that can be transferred is about 25% of the amount that is possible without using round-robin or least-busy mechanisms).
Another conventional technology, known as ‘link aggregation’ is sometimes applied to split traffic, between a first device (typically, a server) having multiple interfaces available to couple the device to a network, and a second device (typically, another server) also having multiple interfaces available to couple the devices to the network, between the set of all interfaces which are available to couple the devices to the network. In accordance with link aggregation, to achieve a kind of load balancing, a new choice of one of the first device's interfaces and one of the second device's interfaces is made (e.g., in a random or pseudorandom manner) before each new flow of data values (i.e., each new sequence of data values which are not to be transmitted out of sequence) is transmitted from the chosen interface of one of the devices over the network to the chosen interface of the other device. This allows data communication traffic (averaged over many flows) to use all the available interfaces and keeps a rough balance between the amount of data sent on each interface (unless one interface fails).
Conventionally, it is not recommended to perform link aggregation to transmit storage data over a network. However, even if a form of link aggregation were used (contrary to conventional recommended practice) in an effort to balance storage data traffic over a converged network between multiple interfaces of a server and multiple interfaces of an adapter, such use of link aggregation would not prevent significant imbalances in storage data traffic in the converged network. Significant imbalances would result from the design decisions necessary to maintain the fault tolerance of the storage traffic. That is, the need for a fully redundant path to each storage device (via at least one adapter) from a server requires that, each storage device (or storage subsystem comprising multiple storage devices) must be attached to the network by two completely separate network-connected devices (i.e., two separate adapters), each coupled between the storage device (or storage subsystem) and the network. Otherwise, if there were only one adapter, the failure of the adapter would render the storage device (or subsystem) unusable. Since each such adapter must be a separate device, link aggregation cannot balance the network load between two adapters providing redundant data paths to the same storage device (or storage subsystem), and cannot prevent significant imbalances in storage data traffic through one adapter relative to storage data traffic through another adapter providing a redundant data path to the same storage device (or storage subsystem). Because the adapters are separate devices, one can be busier, and therefore slower, than the other one(s) that can access the same storage device. In contrast, typical embodiments of the present invention can alleviate storage data traffic imbalances (and prevent significant storage traffic imbalances) in a converged network, even when link aggregation is in use.