Early voice communications was based on the transmission of analog signals over comparatively short distances. However, the telephone soon became an indispensable part of modern life, for both personal and commercial use. As the amount of voice traffic grew and the use of long distance connections became greater, it became necessary to adopt a fundamentally different method of transmitting voice signals. The reason for this is simple. A communications network in which every 2-way conversation is allotted its own line works well enough for a small number of users, separated by short distances. But if the number of users increases by a factor of 10, the telephone company must install 10 times as much wire into the network. And, if many of these conversations occur between users at remote locations, the amount of wire can become very large. In fact, the material and labor demanded quickly becomes prohibitive.
A simple example illustrating the technique of time division multiplexing (TDM) is presented in FIG. 1. In this example, four different voice signals from sources A-D are to be transmitted across a single wire to a remote destination. In the first stage 24 of this process, the voice signals are digitized by analog-to-digital (A/D) converters 10A-D. In other words, each of the continuous signals A-D is periodically sampled and represented by a binary number denoting the approximate voltage of the sample. In FIG. 1, the samples for waveform A are represented by solid circles, while those for waveforms B, C and D are represented by hollow circles, hollow squares and solid squares, respectively. The individual samples in each sequence may be denoted by the letter associated with the source, with a subscript for the sample number. For example, the samples in the sequence derived from source B would be denoted B0, B1 . . . Bn.
The resulting sample sequences 26 must contain sufficient information to reconstruct the original waveforms at the destination. According to the Nyquist Theorem, this requires that each waveform be sampled at a rate greater than twice the highest frequency present in the waveform. For example, a signal containing frequencies of up to 1 KHz must be sampled at a rate greater than 2 KHz, to permit the signal to be reconstructed from its discrete samples. In the case of standard voice communications, signals are assumed to be band-limited to about 3 KHz, so a sampling rate of 8 KHz is often used. This implies that the sample interval (i.e., the time interval between any two adjacent samples) in the sequences 26 can be 125 μs.
A multiplexer 12 combines the four sample sequences 26 into a multiplexed sequence 28. Two characteristics of this multiplexed sequence are particularly noteworthy. First, the original four sample sequences are interleaved to create the multiplexed sequence. Thus, the sample order in the multiplexed sequence is:A0, B0, C0, D0, A1, B1, C1, D1, . . . An, Bn, Cn, Dn Note that this preserves the original order of the samples. Second, the effective sample rate in the multiplexed sequence is four times that of the original sequences. Within each 125 μs sample interval, the multiplexer 12 must collect a new sample from each of the four sources and transmit all four samples. Consequently, the samples in the multiplexed sequence 28 can be separated by 31.25 μs, for an effective sample rate of 32 KHz.
The multiplexed sample sequence 28 is typically buffered by a high-speed amplifier, which drives the impedance of the wire, cable, transmission line 16, etc. used to convey the sequence to the desired remote destination. At the destination, another amplifier receives the signal from the transmission line 16 and conditions (filtering, glitch suppression, etc.) it before presenting it to the input of a de-multiplexer 20. The de-multiplexer 20 reverses the operations performed by multiplexer 12, to extract the original four sample sequences 26 from the multiplexed sequence 28. Each of the resulting sample sequences may then be acted upon by a digital-to-analog (D/A) converter 22A-D to reconstruct the respective voice signals 30.
In the preceding example, only four signals were multiplexed. However, the TDM principle can clearly be extended to transmit greater numbers of voice signals over a single line. In fact, the upper limit on the number of voice channels that can be carried is related to the amount of available bandwidth, commonly stated in terms of the maximum bits per second (bps) sustainable by the hardware. Along with the number of signal sources (or, channels) and the sample rate, the bandwidth required for a TDM transmission depends on the number of bits per sample. For voice communications, signals are usually digitized to 8 bits. Thus, the bandwidth required can be expressed as: bandwidth (bps)=no. of channels×no. of bits per sample×sample rate. The original T-carrier system developed in the 1970's allows for 24 voice channels to be multiplexed onto a single line, using the techniques described above. If each channel is sampled with 8-bit resolution at a rate of 8 KHz, the TDM bandwidth required is: 24×8×8000=1.536 Mbps. The original T1 standard defines a data structure known as a D4 frame for the transport of TDM data. A D4 frame consists of 24 consecutive 8-bit samples (one from each voice channel), preceded by a framing bit. Note that the addition of the framing bit alters the previous TDM bandwidth calculation. Since each frame consists of 24×8+1=193 bits, and frames are transmitted at 8000 frames per second, the bandwidth becomes: (24×8+1)×8000=1.544 Mbps. The framing bit follows a special pattern called the frame alignment signal, which repeats every 12 frames. The group of 12 consecutive frames bounded by this frame alignment signal is known as a superframe.
T1 performance is easily achieved with today's technology, and the demand for greater bandwidth soon led to the introduction of other standards, embodied in the following digital signal hierarchy (DSH):
DS LevelNorth American BandwidthVoice ChannelsT-CarrierDS064Kbps1DS11.544Mbps24T1DS26.312Mbps96DS344.73Mbps672T-3Thus, for example, a single T-3 line supports 672 DS0 voice channels.
Since data in a frame is multiplexed, it is possible to reroute data by rearranging the time slots between incoming and outgoing channels. This is accomplished by a device known as a time-slot interchanger (TSI). FIG. 2 illustrates the operation of a TSI. As described above, a multiplexer 100 collects one sample from each of 24 incoming voice channels, at a sample rate of 8 KHz. These samples are placed in a memory buffer 102; their location in the buffer is based on the channel from which they originated. The TSI 104 rearranges the order of the samples and places the re-ordered samples in an outgoing buffer 106 (while another incoming frame is being entered into the first buffer 102). A de-multiplexer 108 then scans the outgoing buffer and assigns the samples to voice channels in a different sequence. A significant amount of memory is required for the TSI to re-sequence the time slots. If two entire frames of data must be buffered, a total of 384 bits of memory is needed. Furthermore, complex support circuitry is necessary to control the flow of data. Since both the memory and ancillary circuitry must operate at relatively high speeds, TSI modules can be costly, especially as channel capacities increase beyond T1 rates through the TSI.
In addition to voice data, line status information may be sent over a telephone connection. Voice band signaling is a method of placing line status bits within the voice data. In the simple case, two bits are included in a sixth frame of a D4 superframe on a T1 connection to indicate the on-hook/off-hook status of a call. The so-called A-bit and B-bit used for this purpose are inserted in the least significant bit of each of the 24 time slots in the 6th and 12th frames, respectively, of the superframe. Since the signaling bits overwrite voice data, this technique is referred to as “robbed bit” signaling. An extended superframe (ESF), consisting of 24 D4 frames, allows the addition of a C-bit (in the 18th frame) and D-bit (in the 24th frame). Alternatively, the line status information can be sent on a separate connection, by a technique known as “clear channel” signaling.
As described above, voice signals are typically encoded using time division multiplexing (TDM) for transmission over the telephone network. However, there is an undesirable characteristic of TDM, which potentially reduces its efficiency. Under TDM, the mapping of time slots to voice channels within a frame is fixed. Consequently, a time slot allotted for a particular voice channel may go unused, if the signal source for that channel is inactive during its time slot. This typically occurs with “bursty” signals, which consist of active signal intervals separated by periods of inactivity. Significantly, normal speech is a bursty signal. With such signals, the frame may be transmitted with less than its full capacity, since many of its timeslots may contain samples collected during a period of inactivity. An approach that overcomes this limitation is asynchronous transfer mode (ATM). Asynchronous transfer mode (ATM) is a switching technology that can organize digital data into 53-byte cells for transmission over a physical medium. Each cell may consist of one 5-byte header and a 48-byte payload, containing the actual data to be transmitted. Individually, a cell is processed asynchronously relative to other related cells and is queued before being multiplexed over the transmission path. ATM presents the cells (containing the voice samples) to the network whenever there is enough bandwidth available to handle them. In this sense, the voice data transfer is asynchronous relative to the generation of the original voice signal. In addition to voice, ATM supports various other types of signals and data, including video and multimedia applications. In an ATM network, data must be divided into cells before transmission and reconstituted from cells upon reception. This is known as segmentation and reassembly (SAR), and is typically handled by a hardware device (i.e., electronic circuitry).
A T1 connection can directly route frames from a source to designated destination. In contrast, ATM allows flexibility in the choice of a connection path. The 5-byte header within each 53-byte ATM cell contains a virtual path identifier (VPI) and virtual channel identifier (VCI). The VPI and VCI are used to route the cell to its intended destination. This allows the ATM switching hardware to efficiently allocate connection paths based on the level of activity in the voice channels. Because the cells are always the same size, dedicated hardware designs for high-performance ATM switches are relatively straightforward. As a result, ATM networks can operate at speeds greater than 155 Mbps.
Voice data formatted as ATM cells can be transmitted over a T1 connection by using a network adaptor. The network adaptor converts the 53-byte ATM cells into a sequence of samples, which are assigned to the timeslots within three frames (since each frame contains 24 bytes of data, the 53 cells must be spread over three frames). This process can also be reversed to generate ATM cells from T1 frames. The conversion between ATM and T1 data formats can be employed to efficiently route voice traffic through the telephone network.
Within the telephone system network, a central office (CO) is an office local to a group of subscribers (i.e., telephone system users). Home and business lines are connected to a CO by what is called a local loop. The local loop connection is usually on a pair of copper wires called twisted pair. The voice signals from each subscriber are typically in analog form (i.e., continuous) over the local loop, but are transformed into digital data at the CO. The CO also has switching equipment that can switch calls locally or to long-distance carrier phone offices. The conversion from T1 to ATM is useful for combining a large number of voice channels to be transmitted over a long distance by a high-bandwidth link (such as optical fiber) connecting one central office to another within the telephone network.
Normal voice communications is connection-oriented. That is, a connection between the talker and the listener must be established before voice data is transmitted. In contrast, data communication networks, such as the Internet, or a local area network (LAN) in an office, are inherently connectionless. The model for such networks is that of a single communications line, shared by several nodes. Connectionless network service does not predetermine the path from the source to the destination system. Messages are sent out on the shared line in the form of packets (also known as datagrams). Each packet is directed to a particular node through the inclusion of the recipient's address in header information associated with the message. Packets must be completely addressed because different paths through the network might be selected (by routers) for different packets, based on a variety of influences. Each packet is transmitted independently by the source system and is handled independently by intermediate network devices. The connectionless mode of operation is more appropriate for many types of data communication. For example, when sending an email message out over the Internet, it would be inconvenient to require the intended recipient of the email to have previously established a connection channel through which to receive the email.
Voice data may be formatted to allow transmission over a connectionless network by segmenting the data into appropriate-sized frames, prefixed with the required header information. This conversion is termed data encapsulation. Data encapsulation could be necessary, for example, at the interface between the public switched telephone network (PSTN) and an optical fiber-based LAN. The Transmission Control Protocol/Internet Protocol (TCP/IP) suite, described below, may be used for the encapsulation and delivery of voice data over a connectionless network. The function of the various protocols in the TCP/IP suite may be understood with reference to the following open systems interconnect (OSI) 7-layer model.
(7)(6)(5)(4)(3)(2)(1)APPLICATIONPRESENTATIONSESSIONTRANSPORTNETWORKDATA LINKPHYSICALLAYERLAYERLAYERLAYERLAYERLAYERLAYEREmailHTTPPOP3TCPIPEthernetADSLFileFTPIMAPUDPATMSLIPCoaxialTransferCableWebTelnetMACApplicationsRTP
In the OSI model, the process or communication between two computers connected by a telecommunication network is divided into layers. When a message is transmitted from one computer to the other it passes down through the various layers on the sender's side of the network, and back up through the protocol layers when it is received at the receiver's side.                (1) The Physical Layer is the lowest level of the OSI model, and the protocols here define actual physical medium for the transport of a bit stream from one point in the network to another.        (2) The Data-Link Layer defines the access strategy for the physical medium, and pertains to hardware devices such as network interface cards (NICs), routers and bridges.        (3) The Network Layer governs the routing and forwarding of data through the network.        (4) The Transport Layer provides error-checking and ensures that all the data sent have been received at the destination.        (5) The Session Layer coordinates exchanges between two computers over the network to ensure that the connection is preserved until the transaction is completed.        (6) The Presentation Layer, usually part of an operating system, is the point at which data sent is rendered into a format usable by the recipient—e.g., transformation of a byte stream into a displayable image.        (7) The Application Layer is the layer at which network-oriented applications programs reside—these applications are the ultimate target of the message transmitted by the sender.        
The IP is a Layer 3 protocol, most familiar as the protocol by which data is sent from one computer to another on the Internet. Each computer (known as a host) on the Internet has at least one IP address that uniquely identifies it from all other computers on the Internet. When data is sent or received (for example, an e-mail note or a Web page), the message gets divided into packets, each of which contains both the sender's and the receiver's Internet address. Packets are first sent to a gateway computer that directly accesses a small neighborhood of Internet addresses. If the destination address is not directly accessible to the gateway computer, it forwards the packet to an adjacent gateway. This process continues until one gateway recognizes the packet as belonging to a computer within its immediate neighborhood or domain. That gateway then delivers the packet directly to the computer whose Internet address is specified.
IP is a connectionless protocol, which means that there is no continuing connection between the end points that are communicating. Each packet that travels through the Internet is treated as an independent unit of data without any relation to any other unit of data. Consequently, the packets comprising a message may take different routes across the Internet. Furthermore, packets can arrive in a different order than that in which they were sent. The IP accounts for their delivery to the correct recipient, but does not manage the delivery sequence. In the context of the Internet, the Layer 4 Transmission Control Protocol (TCP) is generally relied upon to arrange the packets in the right order, and the two protocols are often jointly referred to as TCP/IP. An alternative to TCP (also at Layer 4) is the User Datagram Protocol (UDP), which offers a limited amount of service when messages are exchanged between computers in an IP-based network. Like TCP, UDP uses the IP to actually get a packet from one computer to another. Unlike TCP, however, UDP does not provide the service of dividing a message into packets and reassembling it at the other end. However, UDP does provide port numbers to help distinguish different user requests and, optionally, checksum capability to verify that the data arrived intact. UDP is used by applications that do not require the level of service of TCP or that wish to use communications services not available from TCP.
Realtime transport protocol (RTP) is an IP-based protocol providing support for the transport of real-time data such as video and audio streams. A Layer 4 protocol, RTP provides time-stamping, sequence numbering and other mechanisms related to managing timing issues in such data. The sender creates a timestamp when the first voice signal sample in a packet is collected, and this timestamp is then attached to the data packet before sending it out. The receiver may use this information to assemble the packets in their correct sequence, or to synchronize one packetized data stream with another—for example, in the case of transmitted audio and video data from a movie. RTP also provides other services, such as source identification. Using the source identifier in the RTP header of an audio packet exchanged during a video conference, for example, a user can identify who is speaking.
Information required by each protocol is contained in a header attached to a data packet as it makes its way through the network. Header information associated with the protocols at different OSI layers can be nested. For example, data sent from an application may begin as an RTP packet:
As the packet moves down through the OSI layers to be transmitted over the physical medium, a UDP header is prepended, followed by an IP header:
It is often necessary to transform voice from the connection-oriented frame-based TDM format used by the PSTN to a connectionless cell-based format, such as that used by ATM, or a packetized format such as used by an Ethernet network. A significant effort in such transformations is devoted to the preparation and attachment of header information to the data.
In addition to the various formatting operations described above, which are required to prepare voice data for transmission over the telephone system or a network, there is considerable signal processing involved in voice communications.
To reduce the amount of bandwidth required for their transmission over the telephone network, the dynamic range of voice signals is generally reduced, using one of various standard compression algorithms. At the receiving end, the original dynamic range of the signal is restored by a complementary expansion algorithm. The dynamic range of an audio signal is the difference between the loudest part of the signal and the quietest part. When an analog signal is translated into digital samples (a process known as quantizing, or digitizing the analog signal), the continuous range of values comprising the signal are approximated by a finite set of discrete values. The resolution of the sampling process refers to the number of discrete values used in this approximation, and the discrete value closest to the actual value of the analog signal is always used to approximate the signal. For example, assume that an analog signal always has a value VS within the range of 0.0-4.0 Volts, and that 4 equally-spaced discrete values (0.5, 1.5, 2.5 and 3.5) are used to represent the signal. The following chart illustrates how the closely the discrete values approximate the analog signal:
DISCRETE VALUEACTUAL ANALOG VOLTAGEAPPROXIMATION  0 ≦ Vs < 1.0 Volts0.5 Volts1.0 ≦ Vs < 2.0 Volts1.5 Volts2.0 ≦ Vs < 3.0 Volts2.5 Volts3.0 ≦ Vs < 4.0 Volts3.5 Volts
The worst-case error with this approximation is ±0.5 Volts, which is equivalent to 12.5% of the full dynamic range of the signal. However, if the same dynamic range is divided into a larger number of discrete values, the accuracy of the approximation can be greatly improved. For example, the following chart illustrates the improvement in worst-case error if 256 discrete values are used instead of 4:
DISCRETE VALUEACTUAL ANALOG VOLTAGEAPPROXIMATION    0 ≦ Vs < 0.015625 Volts 0.078125 Volts0.015625 ≦ Vs < 0.03125 Volts0.0234375 Volts......  3.984375 ≦ Vs < 4.0 Volts3.9921875 Volts
In this case, the worst-case error becomes ±0.015625 Volts. Thus, using a large number of distinct values results in a more accurate approximation. However, it also requires that more bits be used to represent each sample. By reducing the dynamic range, fewer bits can be used to obtain a given worst-case error. Uncompressed audio requires 16-bit resolution. At a sampling rate of 8 KHz this results in a required bandwidth of 128 Kbps. By compressing the dynamic range so that 8-bit samples can be used, the bandwidth requirement is only 64 Kbps.
Dynamic range compression results in a loss of information. That is to say, a signal that has been compressed and restored will not exactly match the original signal. However, the standard compression algorithms for voice communications are designed so that these losses are not noticeable. Two of the more widely used compression techniques are the μ-law and adaptive differential pulse code modulation (ADPCM) compression algorithms. μ-law compression is based on sampling a logarithm of the analog signal, rather than the signal itself. The logarithm has a narrower range of values than the raw signal. For example, if the voltage of the uncompressed voice signal has a maximum value of 10.0, its base-10 logarithm has a maximum value of only 1.0. Routines exist for highly efficient calculation of logarithms by digital logic (e.g., a computer or dedicated signal processor). Advantageously, the original signal can be recovered from the logarithm by computing the complementary antilogarithm.
ADPCM is an enhanced form of differential pulse code modulation (DPCM). Instead of quantizing the signal itself, DPCM quantizes the difference between successive samples of the signal—hence the use of the term “differential”. ADPCM, is an adaptive version of this technique, in which the assumptions based on the previous quantized sample are used to restrict the presumed range of values for the next sample. This algorithm uses only 4 bits to represent each sample of a voice signal.
In addition to compression, voice signals over the telephone network may also require echo cancellation. Echoes result from the reflection of electrical signals (usually arising from mismatched impedances) back to the sender from the receiving end of the line. The severity of the echo is related to the transit time for the signal to travel from the sender to the receiver, which in turn depends on the electrical path length between sender and receiver. If the distance is short, the echo returns so quickly that it is not perceptible. On the other hand, if the path length is sufficient to create a delay in the echo on the order of 10 ms or more, the effect can be disruptive to normal telephone conversations. In fact, the “threshold of annoyance” for echo is also related to the loudness of the echo in relation to the primary voice signal. Consequently, even if echo cannot be completely eliminated, by reducing its amplitude sufficiently it can be rendered unobtrusive.
Echo canceling algorithms are typically implemented using a digital signal processor (DSP), which is a high-speed microprocessor specially adapted for numeric computation. Such algorithms are typically adaptive—that is, they are able to quickly “learn” the transmit/receive characteristics of the line during the first few seconds after a connection is made, and also respond to any changes in those characteristics while the line is active. Over the course of a conversation, the DSP monitors the digitized voice signals being transmitted and predicts the corresponding echo signal. The predicted echo is simply subtracted from the actual return signal.
The various switching, formatting and signal processing operations performed on voice data have necessitated the use of extensive electronic circuitry in the central offices and other nodes within the telephone network. Furthermore, because of their specialized nature, these operations are generally handled independently by discrete rack-mount circuit cards and modules. Unfortunately, this has led to a proliferation of electronic devices to deal with large numbers of incoming and outgoing lines. The consumption of power and space attributable to these devices is a serious problem. Excessive heat generation and its impact on system reliability, are a further concern.
In view of these problems, it would be desirable to have a single device integrating many of the functions described above. The device should support compression and echo canceling signal processing functions. It should also be capable of segmenting data and providing headers to allow translation of frame-based and/or cell-based data formats, such as RTP packets or ATM cells. In addition, the device should be able to perform time slot interchange on incoming and outgoing TDM data, and should allow detection of both voice band and clear channel signaling. Furthermore, the device power consumption should be low, to mitigate heat dissipation problems associated with multi-device installations.