Managing Distributed Computing
Large/multi-sited customer connectivity requirements are based on various parameters: number and location of sites, traffic volume, quality of service for specific applications, etc. It is hard for a customer to plan the exact amount of network resources for a specific location. At the same time for an operator cost of access networks are dependent on customer location and it is complex to control these costs prior to winning customers. We focus our effort on a distributed management system that aims to improve customer flexibility and reduce operator costs.
This work is motivated by the way distributed systems are evolving in the current Internet market place. In the past, distributed computing was mainly represented by server farms managed by a single organisation with an integrated set of applications and specific network requirements.
The current trend is to deploy new services and applications on third party hosted platform across the Internet. The success of Service Oriented Architecture (SOA) has motivated the development of services and functions accessible over the networks. SOA is based on the concept of loose coupling among applications/services and physical resources.
In this way software developers can combine and reuse these functions to develop new business applications. For example, Amazon Elastic Computer Cloud (EC2) is an example of how SOA is changing the distributed computing world. The solution provides a grid computing model where several servers can be deployed in clusters to provide scalability and high availability. The aim is to provide an infinite amount of computing resource to any customer that is willing to pay for it.
The economic concept behind this is the need for agile corporations to sell their underutilised computing assets and hire additional computing when the demand for new services increases. What is important in the context of this work is the ability to provide a dynamic provision of resources scaling up and down based on application requirements. The aim is to enable not only optimal usage of infrastructure but also enable major cost savings in terms of energy consumption and better power management.
Managing distributed network usage in a distributed environment has significant technical hurdles. Customers and service providers cannot plan in advance the requirements for each distributed component.
Developments Relating to Distributed Computing
Various developments relating to distributed computing are considered to be of relevance to the specific technology to which the present invention relates, and will therefore be discussed briefly.
1) Service Oriented Architecture and Grid Computing: Service Oriented Architecture (SOA) has evolved as a form of service design where modular components can be assembled to design distributed services. The style of distribution can range from a vertical integrated co-located system to global scale grid computing made up by vast number of system operated by different organisations. Today major Internet-based organisations (Google, Amazon, Yahoo) exploit these concepts to implement and design scalable services.
SOA is discussed further in the article: “Understanding SOA with Web Services”, Eric Newcomer & Greg Lomow, Addison Wesley (2005). ISBN 0-321-18086-0.
2) Content Distribution Networks and Cloud-Based Services: Content Distribution Networks (CDNs) provide a mechanism capable of providing an improved Internet experience for end-users. CDN servers may be distributed among geographical locations and may thus be physically closer to end-users. In this way they may provide a faster and more reliable Internet experience. With popular content services such as those provided by video sharing and downloading websites such as “YouTube” and such as the video and audio streaming service known as the “BBC i-Player”, CDN operators may need to limit the bandwidth that a user can consume. As will later be understood, however, even where CDN networks are used, particularly popular or high-volume content-providing users can still create congestion to a problematic degree on CDN networks.
Cloud Computing is further discussed in the article: “Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities”, Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, Department of Computer Science and Software Engineering, The University of Melbourne, Australia. Retrieved on 2008-07-31.
FIG. 1 illustrates a CDN scenario where a multi-sited content provider 10 has or makes use of several CDN servers 12 able to provide content to its end-users via a shared network resource 14, for example a cloud infrastructure. (It will be understood that the CDN servers may be under the control of the content-provider organisation, or may be under the control of a separate organisation of which the content-provider organisation is a client. For the purposes of this explanation, it is sufficient to regard the organisation which controls the CDN servers as the content provider, even if the initial content provider is in fact one step removed from this role.) The content provider 10 and its one or more end-user customers 16, 18 are themselves customers of a network provider responsible for providing the shared network resource 14. The curved dashed lines 13 in FIG. 1 (and later in FIG. 4) symbolise the data flow or traffic from CDN servers 12 (belonging to content provider 10) to content-receiving end-users. The end-users may include one or more “retail customers” 16, one of which is shown symbolically as having a desktop computer 161 and a laptop computer 162, and/or one or more “corporate customers” 18, one of which is shown symbolically as having a desktop computer 181, a laptop computer 182 and a mobile phone device 183. Corporate customers in particular are in fact likely to have several individual users and/or individual access points, each of which may have one or more such associated devices, possibly all forming part of a Virtual Private Network (VPN). This figure is intended to illustrate the types of entities that may be involved in an example scenario for which embodiments of the present invention may be applicable.
Developments Relating to Rate Control, Congestion Signalling and Policing in Data Networks
Various developments relating to rate control, congestion signalling and policing in data networks are considered to be of relevance to the specific technology to which the present invention relates, and will therefore be discussed briefly.
It will be understood that data traversing a network such as the Internet follows a path between a series of routers, controlled by various routing protocols. Each router seeks to move packets closer to their final destination. If too much traffic traverses the same router in the network, the router can become congested and packets start to experience excessive delays whilst using that network path. If sources persist in sending traffic through that router it could become seriously overloaded (congested) and even drop traffic (when its buffers overflow). If sources still persist in sending traffic around this bottleneck it could force more routers to become congested, and if the phenomenon keeps spreading, that can lead to a congestion collapse for the whole Internet—which occurred regularly in the mid-eighties.
1) Rate Control: A solution to that problem has been to ensure that sources take responsibility for the rate at which they send data over the Internet by implementing congestion control mechanisms. According to these mechanisms, sources are required to monitor path characterisation metrics to detect when the path their data is following is getting congested, in which case they react by reducing their throughput. In the absence of such congestion indications, they may slowly increase their throughput. The congestion level is one of the parameters controlling the rate adaptation of a source sending data over a congested path.
2) Implicit Congestion Signalling: The congestion level can be signalled either implicitly (through congested routers dropping packets when their buffers overflow or to protect themselves) or explicitly (through mechanisms such as explicit congestion notification—see next subsection). Currently the most common option is implicit signalling. Historically, routers would drop packets when they became completely saturated (which happens when a traffic burst cannot be accommodated in the buffer of the router)—this policy is called “Droptail”. Random Early Detection (RED) (see reference below) is an improvement where routers monitor the average queue length in their buffer and when this is higher than a given threshold, start to drop packets with a probability which increases with the excess length of the queue over the threshold. It is widely used in today's internet because it allows sources to react more promptly to incipient congestion. Sources using Transmission Control Protocol (TCP) are able to detect losses, because a packet loss causes a gap in the sequence; whenever a TCP source detects a loss, it is meant to halve its data transmission rate, which should alleviate the congestion on the router at the bottleneck.
RED is discussed further in the article: S Floyd & V Jacobson: “Random Early Detection Gateways for Congestion Avoidance”, IEEE/ACM Transactions on Networking, Vol 1-4 (397-413) August 1993.
3) Explicit Congestion Notification: Explicit Congestion Notification (ECN) (see reference below) further improves on RED by using a two-bit ECN field in the Internet Protocol (IP) header to signal congestion. It runs the same algorithm as RED, but instead of dropping a packet, it sets its ECN field to the Congestion Experienced (CE) codepoint. The ECN standard requires a sender to echo any congestion mark signalled in the data; for instance, a TCP receiver sets the Echo Congestion Experienced (ECE) flag in the TCP header, which the TCP source interprets as if a packet has been dropped for the purpose of its rate control. In turn the source then reacts to the congestion by halving its transmission rate and notifies the receiver of this using the Congestion Window Reduced (CWR) codepoint.
ECN thus allows routers to signal network congestion. This may be used to reduce TCP re-transmission and to increase overall network throughput.
The four values of the two-bit ECN field in the IP header are:                Non ECT, which signifies that the packet belongs to a flow that doesn't support ECN.        ECT(0) and ECT(1), which signify that the packet belongs to a flow that supports ECN but that upstream routers haven't had cause to mark the packet.        Congestion Experienced (CE), which signals that a packet has experienced incipient congestion.        
ECN is discussed further in the following article: K Ramakrishnan, S Floyd & D Black: “The Addition of Explicit Congestion Notification (ECN) to IP”, RFC 3168, September 2001.
4) Re-Feedback: The re-feedback framework has been developed to allow for network users' usage to be accounted for based on the congestion externality they cause to other users. It will be understood that one of the functions of the IP header is to carry path information from a sender to a receiver. This path information allows downstream nodes (nodes nearer the receiver) to learn about the upstream state of the path. Mechanisms exist which allow the receiver to feed this information back to the sender. The re-feedback proposal (see reference below, for example) provides a mechanism whereby path information that a receiver feeds back to a sender can be re-inserted into the forward data path, thus allowing nodes along the path to learn information relating to the downstream state or the path as well as information about the upstream state of the path.
The re-feedback proposal is further discussed in the article: “Policing Congestion Response in an Internetwork using Re-Feedback”, Bob Briscoe, Arnaud Jacquet, Carla di Cairano Gilfedder, Alessandro Salvatori, Andrea Soppera and Martin Koyabe, ACM Sigcomm 2005
International patent applications WO 2005/096566 and WO 2005/096567 relate to data networks, and to nodes making up parts of data networks, arranged to derive information relating to the characterisation of paths taken by data travelling between nodes in the networks according to the re-feedback proposal.
Mechanisms based on the re-feedback approach can be used to enable or cause multiple users to share resources relating to Internet capacity in a fair manner. Some such mechanisms may enable light users to increase their usage of network resources even in periods of network congestion while heavy users may be provided with an incentive to improve resource management control. In particular the re-feedback approach may be used to enable network service providers to obtain information about the congestion volume that each user creates.
5) Re-ECN: Re-ECN is an example of a system based on the ECN mechanism that utilises the re-feedback concept, whereby path information that a receiver feeds back to a sender can be “re-inserted” into the forward data path, in order to provide upstream and downstream congestion information throughout the network. With re-ECN, the information “re-inserted” is based on ECN marks in previously transmitted packets. It is similar to ECN but uses an extra bit in the packet header. This bit enables a number of new codepoints to be used. A simple way to understand the re-ECN protocol is to think of each packet as having a different colour flag (corresponding to a codepoint). At the start of a flow, a green flag (FNE or “feedback not established”) is used to indicate that a sender doesn't have existing knowledge of the path. Green flags are also used when the sender is unsure about the current state of the path. By default packets are marked with grey flags. If they encounter congestion during their progress through the network they are marked with a red flag. The destination will send back a count of the number of red flags it has seen. For every red flag it is informed of, the sender should send a packet with a black flag (re-echo). These black flags cannot be modified once they are set by the sender, so signal to a node at any point on the path what the total end-to-end congestion is expected to be (based on the fact that the number of black flags signals the total end-to-end congestion level actually experienced by the immediate predecessors of the current packets). At any intermediate node the upstream congestion is given by the number of red flags seen, and the downstream congestion may therefore be derived by the difference between the number of red flags and the number of black flags.
By “re-inserting” ECN information on the forward path, the re-ECN mechanism provides information which may be used to allow policing of network traffic to be performed in dependence on the contribution to network congestion being caused by the traffic, rather than simply on the volume of the traffic, thereby allowing a limit to be set and policed based on the amount of congestion a specific user may cause.
Re-ECN is further discussed in the article: “Re-ECN: Adding Accountability for Causing Congestion to TCP/IP”; Bob Briscoe, Arnaud Jacquet, Toby Moncaster & Alan Smith, IETF Internet-Draft <draft-briscoe-tsvwg-re-ecn-tcp-07.txt> (March 2009).
Various mechanisms have been proposed based on the concept referred to above as “re-ECN”. One such proposed mechanism, which will be explained with reference to FIG. 2 (and which is discussed in further detail in the reference above) consists of taking what can be regarded as a “classic” token bucket policer (which would react to the volume of traffic generated by a user) and adapting this such that it reacts based on the amount of congestion a user creates in the network, rather than simply the volume of traffic the user generates. Such a mechanism is therefore referred to as a “Congestion Policer”, rather than a “Rate Policer”, and will be discussed in the next section.
Further discussion of why congestion policing is believed to be particularly effective in relation to policing the usage of pooled resources is given in the following article: “Policing Freedom to Use the Internet Resource Pool”, Arnaud Jacquet, Bob Briscoe & Toby Moncaster, Workshop on Re-Architecting the Internet (ReArch'08) (December 2008).
6) A Basic Congestion Policer: As illustrated in FIG. 2, token bucket 21 is filled at a constant rate, and emptied in proportion to the contribution of the user's traffic to network congestion. First, when a packet 25 arrives at the policing node, the token reserve r is updated (step s210). This updating involves two factors: the token reserve r is updated by adding tokens in proportion to a predetermined congestion allowance w of the user (step s210a). The token reserve r is also updated by removing tokens (step s210b) according to a function g( ) whose value depends on information retrieved from the packet header, in particular the size si and the re-ECN field (which reflects a congestion level pi). The function g( ) could be defined as:
g(packet) = siif the re-ECN codepoint signals a markg(packet) = 0otherwise
Subsequently, the packet may be subjected to a sanction (step s220) according to a relevant policy (indicated by graph 22) with a probability f(r) where the sanction curve f( ) is null so long as the value of the token reserve r remains positive.
Such a mechanism may be used to put an upper bound on the amount of congestion a user can cause.
Congestion Policing
In the light of the explanations given above, it will be understood that a congestion policer may be used to police traffic being sent by a data-providing entity (such as one of the CDN servers 12 in FIG. 1) to a data-receiving entity (such as one of the end-users 16 and 18 in FIG. 1). Such policing would be reasonably simple to implement in a scenario in which a single data-providing entity is providing data via a network to a single data-receiving entity—this could be achieved by locating a suitable policing node either at the access point via which the data-providing entity is connected to the network, or at the access point via which the data-receiving entity is connected to the network. In a more complex scenario in which a single data-providing entity is providing data via a network to more than one data-receiving entities each having its own access point via which it is connected to the network, it would again be reasonably simple to implement congestion policing in respect of the data-providing entity by locating a suitable policing node at the access point via which the data-providing entity is connected to the network. Such policing would effectively concentrate on the behaviour of the data-providing entity. The function of a suitable policing node based on a token bucket congestion policer in this scenario will be explained below with reference to FIG. 3. Likewise, in a reverse scenario in which a single data-receiving entity is receiving data via a network from more than one data-providing entities, it would again be reasonably simple to implement congestion policing in respect of the data-receiving entity by locating a suitable policing node at the access point via which the data-receiving entity is connected to the network. Such policing would effectively concentrate on the behaviour of the data-receiving entity.
In FIG. 3 a token bucket congestion policer 30 is illustrated. This is shown as policing traffic 32 flowing from a data providing entity 34 (for example a digital media content providing organisation having one or more CDN servers 12 such as those shown in FIG. 1) to one or more data receiving entities 36 (such as customer 16 in FIG. 1, for example) via a network 14. According to the “token bucket” model, tokens are added to the bucket 301 at a constant rate w, but unlike policing using a “classic” token bucket policer (in which tokens are consumed simply in proportion to the volume of traffic passing through the policer), tokens are instead consumed in proportion to the congestion caused or expected to be caused by the traffic passing through the policer. As will be understood, an appropriate measure of the congestion caused or expected to be caused by packets in a flow can be obtained from congestion indications such as ECN or re-ECN marks carried by the packets.
In FIG. 3, the traffic 32 is shown within policer 30 as comprising a number “N” flows 302 traversing a path across the network via a policing node 303. In abstract terms, that means that the rate at which tokens are consumed from the bucket 301 is Σpi xi, where xi is the throughput of flow i=1 . . . N , and pi is the amount of congestion on its path. In practice this means that every time a packet is forwarded, tokens are consumed in proportion to the amount of congestion declared in the packet. In the case of re-ECN, this may mean a token is consumed every time a packet carrying a re-ECN mark is forwarded.
Alternatively, the nominal token size may be defined as one byte, for example, and the number of tokens removed for forwarding a congestion-marked packet could be in proportion to the size of the packet. At any point in time the amount of tokens left in the bucket represents the outstanding reserve available to the user for future use.
As is usually the case with the “classic token bucket” model, tokens may be discarded when the bucket is full, and sanctions (such as dropping packets, imposing penalties on users etc.) may start to be applied when the bucket is empty. As will be explained in detail later, the transition to sanctioning behaviour may be progressive, or alternatively it may be stepped-up immediately on a threshold (“empty”) being passed.
It should be noted that if a customer's usage (in the case of FIG. 3, the usage of data provider 34 as measured at congestion policer 30) stays below an agreed congestion allowance, the policer 30 merely monitors the traffic passively. However, as soon as the congestion rate empties the bucket the policer 30 may take policing action, such as applying a penalty to the traffic, imposing some other sanction, marking traffic (with additional marks in packet headers, for example) or issuing reports in respect of the transgression. By imposing a policing sanction such as dropping some traffic that was received marked with a congestion indication, for example, the policer can keep the customer within the predefined congestion allowance.
As outlined earlier, developments in relation to content provision have resulted in some customers of a network provider being “multi-homed”. This may be because the customer of the network provider is an organisation such as the multi-sited content provider 10 in FIG. 1, or because the customer is an organisation such as the corporate customer 18 in FIG. 1. In either case, mechanisms such as those above would only allow the behaviour of individual users at separate sites to be monitored separately, with a policer operating autonomously in respect of each site. In a distributed network environment, this may be easily abused, or may fail to have the required effect on the behaviour of the customer or its individual users. Proposals to deal with the control of resource allocations in a distributed network environment will be outlined in the next paragraph.
Distributed Resource Allocations
These research solutions discuss mechanisms to control resource allocations in a distributed network environment.
An article by Raghavan et al (see below) includes a discussion of the problem of distributed rate limiting as a mechanism to control the aggregate bandwidth that a customer is generating in the network. The approach suggested can be seen as a continuous form of admission control where policers placed at the edges of the network admit traffic until the aggregate bandwidth consumed by a customer has reached a certain volume or rate level.
This approach coordinates a set of distributed traffic rate limiters while retaining the behaviour of a centralised limiter. The solution chooses a token-bucket as a reference model to monitor the traffic rate at the different distributed locations. The solution assumes that among the different limiters, broadcast communication exists. A “gossip protocol” is used to enable a resilient and robust communication framework. At the end of each estimation interval the various limiters update the global demand estimates at a set of limiters.
The mechanism proposed is appropriate to rate limiting a large amount of flows in a distributed location. However, this mechanism assumes that if each user is TCP-friendly then the mechanism provides fair distribution of resources. Unfortunately a user that opens a large number of TCP sessions can gain an advantage over another user that is using only one session or a smaller number of sessions. This mechanism allocates a similar share of bandwidth to each TCP flow.
See: “Cloud Control with Distributed Rate Limiting”, Barath Raghavan, Kashi Vishwanath, Sriram Ramabhandran, Kenneth Yocum & Alex C. Snoeren, UCSD, ACM Sigcomm 2007
Similar mechanisms to this are suggested in European patent application EP1705851 and patent application US2008/008090. Both describe mechanisms to manage a capacity constraint that is shared between different users, and rely on token bucket or leaky bucket mechanisms. However, as with the approach proposed in the Raghavan reference above, the policer doesn't take into consideration the congestion impact of the traffic.