1. Field of the Invention
Embodiments of the invention relate generally to the field of electronic data transmission. More particularly, an embodiment of the invention relates to a buffer and a clock in a packet-based network, and methods of buffering incoming data and synchronizing clocks in such networks.
2. Discussion of the Related Art
With the advent of Internet Protocol (“IP”), packet-based transmission and routing schemes are becoming ever more popular. It is well accepted that Next Generation Networks (“NGN”s) will be built upon these principles. However, several services, such as real-time voice and voice-band communication, that are well suited for circuit-switched (“TDM”) transmission and switching, have to be supported by this new architecture. VoIP (“voice over IP”) is one such example. The underlying premise of VoIP is that speech, after conversion from analog to digital format, can be packetized and several protocols such as RTP and RTCP (see Ref. [1,2]) have been developed to support the ability of IP networks to provide such real-time services.
One of the premises of NGNs is that the Quality of Experience (QoE) should be at least as good as good, or even better than, that provided by the legacy circuit-switched network or PSTN (Public Switched Telephone Network). It is clear that delay is an important parameter in determining the QoE. It is well known that one-way delays that are very large (of the order of 400 ms or larger) are extremely detrimental from the view of subjective quality, making regular full-duplex conversation difficult. At lower one-way delays, the impact of echo is important. The Quality of Experience, for a given level of Echo Return Loss (ERL) drops rapidly with increasing delay.
The overall delay has four principal components. The process of packetization involves buffering information to fill the packet payload and thus introduces delay. The encoding and decoding algorithms, especially in the case of source codecs, require buffering as well. These two delays are often known quantities. The third component is the delay through the network. This delay is difficult to predict a priori since it depends on the physical distance, the number of intermediate packet switches involved in the end-to-end transport of a packet, the bandwidth of the links between switches (routers). However, for two given end-points there is, in principle, a minimal network delay corresponding to the transit time of the fastest possible packet transmission. Considering that in a pure IP network the transmission path could be different for different packets, and the queuing delay in intermediate nodes is a function of congestion, the delay experienced by packets will be variable, ranging from the minimal delay to infinity (a packet lost in the network is construed as an instance of infinite delay). Obviously, some maximum delay threshold must be determined and packets with delay greater than this maximum are discarded. Received packets are stored in a buffer whose size corresponds to the difference between minimum and maximum delays and so, practically speaking, fast packets are delayed so that the packets can be decoded and converted back to analog signals in a smooth fashion. The notion of play-out, or dejittering, whereby some delay is introduced via a jitter buffer constitutes the fourth delay component. Clearly, in order to maximize the subjective quality of the call, the play-out buffer, also referred to as the jitter buffer, should be as small as possible.
For specificity, consider the situation where a DS1 (1.544 Mbps) is carried over a packet network as depicted in FIG. 1a. The scenario involves two end-user locations with legacy DS1 (T1) terminations and the intent is to provide a private-line connection. In today's (yesterday's) network the DS1 is transported across the network as a bearer channel embedded in a higher-rate assembly such as a DS3 or SONET signal in a “circuit-switched” arrangement. The challenge then is to replace the circuit-switched transport network with a packet-switched network in a manner that is transparent to the end-user. This is achieved by placing an inter-working-function (IWF) at the circuit-packet boundaries. For simplicity FIG. 1a shows one direction of transmission. The “T-IWF” 102a receives the incoming serial data signal from the end-user terminal 101a as a conventional DS1 signal, assembles the bits into packets for delivery across the packet cloud 103a. The “R-IWF” 104a receives the packets and recreates the serial data signal for delivery to the end-user terminal 105a over a conventional T1 (DS1) facility. We assume, again for simplicity, that the bit-stream must be delivered intact and the network does not attempt to extract any framing or channelization information or features such as “flags” or “cells” or “packets” in the data stream. Interfacing with legacy terminal equipment implies that existing standards, such as [1,2], must be adhered to.
The primary functions of the IWF devices are, first, to reassemble the recovered serial bit-stream into octets; second, to assemble these octets into packets where each packet contains N octets of information and launch these packets over the network; third, to receive packets from the network and reassemble the bit-stream; and fourth, to transmit the bit-stream to the end-user equipment utilizing an appropriate clock. Since the delay through the network is not constant, there will be time-delay variations (TDV), the IWF requires an adequate “elastic” buffer to store received packets and absorb this TDV. The current technology approaches fail to adequately create synchronization of the clock for the fourth function.
Strictly speaking, the term synchronization applies to alignment of time and the term syntonization applies to alignment of frequency, but in the telecommunication environment we often use the term synchronization to refer to either time-alignment, or frequency-alignment, or both. It is generally clear from the context which meaning is appropriate. All real-time communication carried over a digital network requires synchronization to some degree. This can be illustrated by considering the example of delivering a real-time voice signal between two geographically disparate points across a network.
The situation is depicted in FIG. 1b, which shows a conventional VoIP network. The analog source is converted into digital format by an analog-to-digital converter (ADC or A/D) 101b operating at a sampling clock rate of nominally 8 kHz. Each sample is, conventionally, quantized to 8 bits so that the digital stream carrying the voice information is 8 kilo-octets-per-second or 64 kbps (see ITU-T Rec. G.711, Ref. [3], and Ref. [4]). This is regarded as a DS0 and represents “uncompressed” voice. In a conventional circuit-switched or TDM (Time Division Multiplexed) architecture, this DS0 is delivered “as is” to the destination for conversion back to analog format. In a packet-switched environment, exemplified by Voice-over-IP (VoIP), the DS0 is, possibly, compressed and organized into packets (102b). These packets are delivered to the destination where the expansion (103b) to DS0 format is performed prior to conversion back to analog (104b). Whereas the schemes described here are applicable regardless of the word-length employed for A/D conversion or D/A conversion, we shall henceforth assume here that these are done with a word-length of 8 bits (1 octet) (representative of μ-law and A-law formats provided in ITU-T Recommendation G.711) for specificity.
It is important to recognize that at each end the digital-to-analog converter (DAC or D/A) and analog-to-digital converter (ADC or A/D) are usually in the same integrated circuit chip and thus the same clock is used for both functions at any one end. In the event that the (digital) signal processing includes echo cancellation, it is mandatory that the same clock be used for both functions else the echo canceller will exhibit sub-par performance and there will be instances of echo leakage and other phenomena that negatively impact the quality of experience. In FIG. 1b we show a single direction of transmission solely for convenience in representation and explanation.
The rate at which packets are generated (in the encoder) is determined by the A/D clock, shown as fA in FIG. 1b. In most VoIP schemes, one packet is generated for every 160 samples from the A/D converter. That is, using the conventional sampling rate of 8 kHz (nominal), each packet represents 20 ms (ms=millisecond) of speech (there are variants that use block sizes other than 20 ms, such as 10 ms, 30 ms, etc.). The nominal word-length associated with each sample is 8 bits, following G.711 (see Ref. [3]) so the “uncompressed” signal represents a bit-rate of 64 kbps (or DS0). Compression algorithms are employed to reduce the effective bit-rate. For example, ADPCM (adaptive differential pulse code modulation) following ITU-T Recommendation G.726 (see Ref. [5]) reduces the word-length associated with each sample to 4, effectively reducing the data rate to 32 kbps. ITU-T Recommendation G.727 (see Ref. [5]) describes methods for reducing the bits/sample from 8 down to 5 or 4 or 3 or 3, corresponding to bit-rates of 40, 32, 24, and 16 kbps, respectively. More sophisticated schemes, such as those described in ITU-T Recommendation G.723 and G.729 (see Ref. [5]) are even more effective in reducing the bit-rate. The notion of a “20-msec-packet” is the collection of information produced by the coder that permits the decoder at the far end to synthesize a 20-msec block of speech. Depending on the coding algorithm it is possible that information from previous packets is necessary as well. At the receiving end the decoder recreates the appropriate digital signal (DS0) for conversion back into analog format. The D/A clock is shown as fD in FIG. 1b. 
It is immediately obvious that if the frequencies of the A/D clock (fA) and the D/A clock (fD) are not equal, then slips will occur. The notion of a slip is simple. If fA>fD then the DAC will experience a surfeit of samples; if fA<fD then the DAC will experience a shortage of samples. Rate-adaptation then requires that samples be deleted or inserted. In the circuit-switched architecture of the legacy PSTN, every transmission boundary element is required to extract DS0s from an incoming digital signal (typically a DS1) and reinsert the information into an outgoing digital signal (typically a DS1) that may, potentially, have a different time-base. Therefore slip buffers are very common. To minimize the occurrence of slips, the circuit-switched network is well synchronized and this approach to network synchronization has the derivative benefit that the clock offset between the end points is minimized. In an NGN, where asynchronous transport is employed, there is no guarantee that the clock offset between the end points is negligible.
However, this phenomenon is not necessarily catastrophic, but the DAC would have to either insert or delete a sample to account for the difference in sampling rates. This insertion or deletion of a block of information, such as a sample, is referred to as a slip. Note that a slip is the result of the difference in sampling rates and is independent of the word length associated with the quantization and compression. The degradation of perceptual quality caused by slips is in addition to any degradation caused by other factors. In conventional circuit-switched telephony, the unit of information inserted or deleted is one sample (or octet). Considering the nominal sampling rate is 8 kHz (one sample every 125 μs), a slip occurs when the accumulated phase difference, expressed in time units, caused by the aforementioned frequency difference, crosses 125 μs. In a packetized scenario, the unit could be as large a block of speech, typically of duration 20 ms and thus slips would have an impact similar to packet loss. Note that 20-ms slips occur much less frequently than 125-μs slips but have a greater impact each time they occur. The thrust of the current invention is to get the benefits of single-octet (single-sample) slips in a packet environment.
A similar effect will be observed in real-time video. A typical block size used in video compression is 8×8. Assuming a “standard” sampling arrangement comprising 352 pixels per line, 240 lines per frame, and 30 frames per second, the duration of a block is 25.25 □sec. When the accumulated phase difference between the A/D clock and D/A clock crosses 25.25 μs, a slip occurs. The current invention does not specifically apply to video but video is a good example of real-time communications and included to show the importance of having minimal frequency offsets between the end-points.
In the following table we provide the slip rate assuming that the D/A conversion clock uses a free-running oscillator and that the A/D clock is accurate (relative to a Primary Reference Source). Also provided is the typical technology used for that accuracy and a budgetary estimate (order of magnitude) of the cost of the oscillator. The last three columns provide an approximate time between slip occurrences for different block sizes. In generating this table it was assumed that the transmission link between the A/D and D/A is equivalent to a “null” link that adds no impairments such as excessive time-delay variation or transmission errors. The intent is to lay the baseline for the minimum impairment that is introduced by the lack of synchronization between the end-points.
TABLE 1Relationship between frequency offset and interval between buffer overflow/underflow eventsAccuracyTechnologyCost125-□sec slip20-msec slip25.25-□sec slip 1 × 1010Rubidium~$10001.25 × 106sec.2 × 108sec.0.25 × 106sec(14.5days)(6.4years)(0.3days)50 × 109 Hi-Quality~$50025 × 103sec.4 × 105sec.0.5 × 103sec.(50 ppb)OCXO(41.7min)(4.6days)(8min)5 × 10□OCXO~$5025sec.4 × 103sec.5sec. (5 ppm)(66.7min)50 × 10□ TCXO~$102.5sec.20sec.0.5sec.(50 ppm)1 × 103XO~$10.125sec.1sec.0.025sec.(0.1%)  (8per sec.)(40per sec.)1 × 102XO~$0.112.5msec.0.1sec.2.5msec(1%)(80per sec.)(400per sec.)
The perceptual degradation in quality caused by slips is very subjective. The impact of an isolated slip in conventional telephony using uncompressed signals (G.711) is typically a “click” that could well be imperceptible, especially if it occurs during a silent interval. However, the perceived quality degrades rapidly as the slip-rate increases. The various digital switches in the PSTN are all provided a PRS (Primary Reference Source) traceable reference and thus have an absolute accuracy of better than 1×10−11. A call traversing two distinct timing domains may experience slips corresponding to a worst-case frequency difference of 2×10−11. Considering that this equates to one slip every 72 days, we can, for all practical purposes, ignore the phenomenon of slips in the traditional circuit-switched network. In VoIP applications, the end points are quite cost sensitive and therefore it is likely that the quality of oscillator deployed will be represented by one of the last three rows of Table 1 and clearly slips may play an important role in determining the quality of experience (or lack thereof).
Most studies for evaluating the perceptual quality of compressed voice are done in a controlled environment and consider only a single compression/expansion. Additional study is required to assess the impact of tandem connections wherein there may be multiple conversions of format. Furthermore, the impact of an isolated slip may have a different perceptual effect on synthetic speech, such as that inherent in CELP (Code Excited Linear Prediction) methods for compression, such as G.729 (see Ref. [5]). However, it is quite well accepted that the controlled slip method, where one sample (octet) is deleted/inserted in an “uncompressed” stream, works very well provided that slips do not manifest themselves too often.
It is obvious that if the size of the buffer is large, then the relative frequency of occurrence of buffer overflow/underflow events will be small. However, large buffers imply the introduction of delay and the decrease in quality of experience. Nevertheless, even with large buffers deployed to mitigate the occurrence of buffer overflow/underflow, there are other impairments that arise because of a difference in clock between the end-points. These include the pitch modification effect and wow and flutter. These are not adequately addressed by present technology.
Delivery of constant-bit-rate services, such as DS1, over a packet network mandates that proper care be taken to ensure both bit-integrity and bit-time-integrity. The principles of clocking in circuit emulation applications is provided generically in ITU-T Recommendation Y.1413 in the form of four “architecture” options. In architectures #1 and #2, it is assumed that PRS-traceable clocks are available at the appropriate boundaries and the service clocks are derived therefrom and therefore the packet network is relieved of the responsibility of delivering timing information across the network.
Architecture #4 is a technique referred to as adaptive clock recovery. A theoretical analysis of adaptive clock recovery is provided to indicate the performance limitations of this technique. The conclusion is that adaptive clock recovery should not be used as the primary clock transfer mechanism unless there is no alternative available. However, the method has merit when used as an adjunct to architecture #2 or architecture #3.
Architecture #3 is the collection of methods that can be generically referred to as “encoding methods”. It is assumed that a PRS-traceable clock is available at the ingress and egress inter-working functions where the “circuit-to-packet” and “packet-to-circuit” conversions take place. Information based on the behavior of the service clock relative to this “common” clock is encoded as a message at the ingress IWF and sent across the network to the egress IWF. The egress IWF can regenerate a “replica” of the service clock using this information and the “common” clock available. One example of encoding methods is SRTS (Synchronous Residual Time Stamp) as described in U.S. Pat. No. 5,260,978 (see Ref. [15]). In U.S. Pat. No. 6,111,878 (see Ref. [16]) a method for utilizing adaptive clock recovery as an adjunct to SRTS is described.
Whereas ITU-T Recommendation Y.1413 (see Ref. [14]) covers various aspects of circuit emulation, the intent here is to summarize the requirements related to synchronization and clocking. In particular, 4 strategies or “architectures” for delivering service clock are presented in Y.1413. These are described here.
Architecture #1: Service Clock Generated by Terminal Equipment
There are situations where clock information does not have to traverse the network. For example, as pointed out in Y.1413, the terminal (i.e. end-user) equipment may have access to “equivalent” clocks (time-base) at both ends. In this scenario, the IWF loop-times, utilizing the recovered clock from the incoming DS1 to generate its transmit clock for the return DS1 signal. Essentially, the network is relieved of the responsibility to carry clocking information over the packet cloud. The end-user clocks do not have to satisfy any stringent frequency accuracy criteria other than they must be equal at the end-points. The mechanism for achieving such equivalent clocks is not specified in Y.1413. FIG. 2a depicts Architecture #1. In this configuration the TDM network elements 201a are assumed to have independent sources of timing 202a that are coordinated such that the TDM clocks are synchronized (or plesiochronous). As shown in FIG. 2a, the most effective way to achieve this is to have G.811-traceable timing references available for the TDM network elements. The Inter-working Function (IWF) generates its TDM transmit clock from its incoming (receive) signal. That is, the IWFs “loop-time”. Note that the packet network is relieved of the responsibility of transporting timing information across the network between the two IWFs. This is one of the recommended methods for providing circuit emulation service across a packet network. The size of the jitter buffer in the IWF must be commensurate with the expected time-delay-variation across the network to avoid data loss.
This architecture is appropriate when the packet network is interspersed between two TDM networks that are known to have good timing. If the end-user terminal is essentially customer-premises equipment, such as a PBX or T1 multiplexer, it is highly unlikely that PRS-traceability is available to the terminal equipment other than via the TDM link into the network.
Architecture #2: Service Clock Generated by Network
Another situation considered in Y.1413 where clock information does not have to traverse the network is similar to the one described earlier but has some subtle differences. This is when both the IWF devices have access to “equivalent” clocks (time-base) at both ends, generally a network clock traceable to a stratum-1 source. In this scenario, the “equivalence” is achieved by making both clocks accurate, typically to 1 part in 1011. In this scenario, the end-user equipment operates in a loop-time mode, utilizing the recovered clock from the incoming DS1 to generate its transmit clock for the return DS1 signal. The IWF uses the network clock for its outbound DS1. Here too, the network is relieved of the responsibility to carry clocking information over the packet cloud. This scenario is most appropriate when the end-user equipment is relying on the network for a time-base reference and is analogous to legacy schemes where the network end-points were devices, such as 1/0 digital cross-connects, that use a network timing reference for all transmit DS1s. There are other advantages of having an accurate, stable, reference at the IWF devices. It has been postulated that TDV across the network is minimized when the end-points have good synchronization. A low TDV allows a good compromise (trade-off) between latency and packet loss.
In this configuration the TDM network elements 201b are provided sources of timing 202b that are coordinated such that the TDM clocks are synchronized (or plesiochronous). As shown in FIG. 2b, the most effective way to achieve this is to have G.811-traceable timing references available for the IWF elements. The Inter-working Function (IWF) generates its TDM transmit clock from its local clock that is locked to a network timing reference. That is, the IWFs insert timing in a manner consistent with ITU-T Recommendation G.703 (the “centralized clock interface”). Note that the packet network is relieved of the responsibility of transporting timing information across the network between the two IWFs. This is the most highly recommended method for providing circuit emulation service across a packet network. The size of the jitter buffer in the IWF must be commensurate with the expected time-delay-variation across the network to avoid data loss.
Architecture #3: Encoded Methods
Then there are situations where the service clock is independent of the network clock. In these situations there is no alternative but to transfer the service clock over the packet infrastructure. However, even in this situation it is advantageous to have a network reference available at the IWF devices. An encoded version of the service clock, most easily visualized as the difference between the service clock and the network clock, at the T-IWF is transferred across the network as part of the information, allowing the R-IWF to recreate the service clock at the destination packet-circuit boundary. One example of this is the Synchronous Residual Time Stamp (SRTS) method suggested for ATM networks and described in [4].
The notion of SRTS that has been standardized as one means for transporting service clock over an ATM network (as in ATM Adaptation Layer 1 or AAL1) may well be extended to general packet networks as well. Encoding methods, such as SRTS, are considered for Architecture #3 in ITU-T Recommendation Y.1413 and shown in FIG. 2c. The principle of encoded methods is to transport a measure of the difference in service clock 202c and network reference, as established in the transmitting IWF 201c, across the network (as a message 203c appended to a packet or as part of the packet itself). The receiving IWF 204c can reconstruct the service clock using this measure of frequency difference in conjunction with its own network timing reference. Note that this mandates that both IWFs have a “common” timing reference, most advantageously obtained by providing each IWF with a G.811-traceable timing reference. The difference in service clock between the ingress and egress points will be directly related to the difference in network timing references at the two IWFs.
Architecture #4: Adaptive Clock Recovery
The fourth architecture described in ITU-T Y.1413 is the use of Adaptive Clock Recovery (or ACR). This is a best effort method and is unsuitable for transporting a network quality timing reference across the packet network. In Y.1413 adaptive clock recovery (ACR) is allowed for situations where there is no alternative but to transfer the service clock over the packet infrastructure. ACR is depicted in FIG. 2d, below. As shown in FIG. 2d, the service clock 202d for the transmit-out of the IWF on the right hand side is generated by adaptive clock recovery 203d. Just for simplicity, it is assumed that the TDM network element 201d on the left hand side is the “master” for the service clock.
With the exception of architecture #1, where the clock information does not have to transverse the network, offsets in the recovered clock at the receiving end will ensue. Adjustments of these offsets will require some kind of adaptive clock control.
Heretofore, the requirements of an adaptive play-out buffer and adaptive clock control referred to above have not been fully met. What is needed is a solution that solves all of these problems.