In a conventional centralized cloud environment, all computing is typically executed in a single large centralized data center. In contrast, a distributed cloud comprises a potentially high number of geographically dispersed data centers instead of only one central data center. These geographically dispersed data centers have different capabilities; some of the data centers may be relatively small and be located at the edge of a network comprising the distributed cloud environment, whereas others may be located at the core of the network and have a very high capacity.
Traditionally, Unified Communications (UC) services, such as multiparty audio and video conferencing, have been provided using dedicated server hardware and Digital Signal Processors (DSPs). Today, there is an increasing trend to migrate hardware-based UC solutions to a fully software-based cloud environment. The first step in this migration is to provide software-based UC services in a centralized cloud environment. The next foreseen step is to provide them in a distributed cloud environment.
FIG. 1 illustrates a simple example of media processing in a distributed cloud environment, in the following also referred to as network 1. In the figure, a distributed cloud 2 provides a video conference service for four users A, B, C and D. Media processing is distributed in the cloud 2 in such a way that there are local Media Server (MS) instances 3A, 3B, 3C located close to the users at the edge of the network 1. Further, processing such as audio mixing and switching for the video conference is being handled by a Media Server 3 in a large data center at the core of the network 1. Each Media Server instance is running in one Virtual Machine (VM) within a respective data center. A reason for distributing media processing to several virtual machines (i.e. a chain of virtual machines) is that the capacity of a single virtual machine is typically not sufficient for handling the media processing for all the users in a conference. This is very much the case for example in a high definition video conference where users are using different codecs and wherein transcoding thus is required.
It is beneficial to distribute the media processing to virtual machines in different data centers since latencies can be minimized and responsiveness maximized when media processing occurs as close to the conference participants as possible. Latencies need to be minimized to improve the quality of the service as experienced by the users. An example of such maximized responsiveness is the ability to adapt the video streams being sent towards the user using feedback from a local Radio Access Network (RAN). However, distribution of media processing also results in some challenges as will be described next.
An important challenge introduced when media processing of a multimedia session (e.g. a video conference) is distributed from one media server to a chain of media servers is increased latency. Although latencies are typically short when users behind the same distributed media server instance are communicating with each other, media streams between users at different ends of the media processing chain may experience long delays. This is simply due to the fact that when the media streams from e.g. user A to user B go via multiple media servers (3A, 3, 3B in the example of FIG. 1), the processing done by each individual media server 3A, 3, 3B adds to the end-to-end delay that Real-time Transport Protocol (RTP) packets carrying the multimedia session experience. As an example, if the multimedia goes through a chain of three media servers the delay introduced by processing on the media servers might, in a worst case scenario, be threefold compared to a scenario wherein a single high-capacity central media server is being used.
According to Telecommunication Standardization Sector (ITU-T) recommendation G.114 [ITU-T G.114], in order to keep users satisfied, the one-directional (i.e., mouth-to-ear) media delay between users should be no more than 225 ms. If the delay exceeds 300 ms, some of the users will already start becoming dissatisfied. It is not uncommon for a single software-based media server performing media decoding and encoding to add to the end-to-end delay that RTP packets experience in the order of 100 ms. Thus, already the presence of three media servers that encode and decode the media is enough to make some of the users dissatisfied. This introduces the need for mechanisms that can keep the delay acceptable even when multiple media servers are involved in the media path.