This invention relates to distributed modem pooling techniques for maximizing resource utilization in a packet-switched communication network with integrated telephony services, and more particularly, for maximizing resource utilization of the network during speech/voice processing.
Voice-over-IP (VoIP) is an emerging technology where packet-switched networks employ the Internet Protocol (IP) to offer telephony services. Examples of such networks are the Internet and private IP-based corporate local area and wide area networks (LANs and WANs). The main advantage of using this technology over Plain Old Telephone Service (POTS) is the cost savings obtained from its reduced bandwidth requirements. Unlike POTS/PSTN where a continuously available circuit-switched DSO (64 kbps) connection is dedicated to the call for its entire life, VoIP shares the network bandwidth with other information types like data and video. This is possible because speech in a VoIP network typically travels (is processed) at the low rate of either 6.3 kbps or 5.3 kbps. Therefore, for example, a corporation can use a portion of the bandwidth of its existing IP network to offer VoIP to its employees and, consequently, do away with their traditional phone services and their associated costs.
VoIP achieves its bandwidth efficiency through the use of two techniques: Speech Compression and Discontinuous Transmission (DTX). The former employs source encoding methods to represent the sampled voice signal in compressed form that can be decoded and decompressed later at the receiving end. The compression ratio achieved by this technique can be as high as 24:1, resulting in tremendous bandwidth reduction. In Discontinuous Transmission, bandwidth reductions are achieved by detecting silence in the phone conversation and, in response, either shutting down the transmitter on the non-speaking end or sending smaller frames than the regular speech ones. With DTX, the bandwidth can theoretically be halved since, ideally, in conversations one person is talking and the other is listening.
FIG. 1 conceptually illustrates the basic architecture of a typical VoIP WAN network 10 like the internet. An IP network 15 forms the backbone of the WAN network. Telephone internet gateways 20, 25, 30 connected to the IP network provide the telecommunications interface infrastructure which allows the exchange of informationxe2x80x94such as video, text, and/or speechxe2x80x94in the form of one or more recognized protocols between telephony-capable peripheral devices attached thereto. For example, one more personal computers 35 connected to an associated telephone gateway 20xe2x80x94using a modem 36 or the like equivalent devicexe2x80x94are coupled to exchange information (including speech/voice information) therebetween or with other telephony devices (41-44) connected to the network 10xe2x80x94over for example the Public Switched Telephone Network (PSTN) 40xe2x80x94via their associated telephone gateways 25, 30. (Telephony devices 41-44 can include both wireless as well as wireline devices interfaced thereto in a conventional manner over the PSTN or a dedicated telephony-based network.) The PSTN or the like interface network receives and converts speech data into digital samples which are then communicated to the associated internet telephony gateway.
All digital processing of speech samples and associated control tasks are done by special hardware in each telephone gateway in response to appropriate analog (e.g., from modem 36) or digital (e.g., DSO signals from the PSTN) speech samples originating in the form of spoken speech from such devices as PC 35 and telephony devices 41-44. The spoken speech samples are received via actual physical layer links 37 connecting the outside world to the telephone gateways internal speech processing hardware. A diagrammatic depiction of such hardware is shown in FIG. 2. The hardware includes a pulse code modulated (PCM) sample IO handler 50, an auto gain control (AGC) and echo cancellation devices 55, a voice coder/decoder (CODEC) 60, a line coder 65 and an IP network interface device 70 configured to operate in a known fashion. A typical input to voice CODEC 60, for example, might be an 8 kHz 16-bit linear PCM signal which corresponds to a data rate of 128 kbps (8000xc3x9716). In one implementation, CODEC 60 operates on blocks of 240 input samples 56 called frames, each frame having a duration of 30 msec (240/8000). The output of CODEC 60 is at one of two bit-rates: 6.3 kbps or 5.3 kbps. The higher rate has greater quality and the lower rate is, obviously, more bandwidth efficient. At 6.3 kbps, the length of the output codeword is 189 bits (0.03 secxc3x976300 bps), and at 5.3 kbps its 159 bits. The highest compression ratio is, therefore, achieved by the 5.3 kbps Codec where the input 128 kbps PCM signal is compressed 24 fold.
A block diagram of the operational processing logic of CODEC 60 is shown in FIG. 3. Discontinuous transmission (DTX) and silence compression of input samples 56 are handled by a Voice Activity Detector (VAD) 61 and by Comfort Noise Generator (CNG) 62. The VAD 61 reliably detects the presence or absence of speech and conveys that information to the CNG 62. Although this information is passed on a frame by frame basis, the determination of the presence or absence of speech is made over multiple successive frames.
The CNG 62 creates a noise signal that matches the actual background noise. It essentially computes and encodes parameters that can be used at the receiving end to synthesize this artificial noise. These parameters constitute the Silence Descriptor (SID) frames 63 which use less bits (40) than the normal speech ones and are transmitted during inactive periods. This transmission, however, is not periodic. That is, for each inactive (non-speech) frame the CNG 62 makes a decision of sending a SID frame 63 or not based on variations of the power spectrum of the background noise. As long as this spectrum remains relatively unchanged, SID frames 63 stop getting sent and the system""s transmitting modules remain idle. At the receiving end, on the other hand, the decoder always uses the last SID frame 63 received to generate the silence comfort noise.
In a typical CODEC, a speech coder 64 processes the speech portion of the PCM samples output from VAD 61 to generatexe2x80x94using appropriate coding algorithmsxe2x80x94encoded speech frames 63xe2x80x2. This processing dominates the horsepower requirements of a CODEC. Typical power consumption by coding algorithms might be 20 million instructions per second (MIPS). Comparatively, the requirements of the comfort noise generator 62 are negligible and do not exceed 1 MIPS. (The VAD 61 is immaterial to this discussion since it is common to the generation of both SID frames 63 and encoded speech frames 63xe2x80x2.)
A CNG 1 MIPS estimation represents normal operations where SID frames 63 are not being continuously built and sent. For peak processing estimations, however, one might assume that the background noise""s power spectrum is constantly changing and new SID frames must be computed and sent continuously. Exact numbers for this peak condition are not available but, nonetheless, one can safely estimate that they do not exceed a very generous 5 MIPS. To verify this, one can use the ratio of the number of bits in a SID frame 63 to the number of bits in a regular speech frame as a rough indicator. The minimum such ratio is for a 5.3 kbps coder which is 40/159=0.251. Multiplying this number with 20 MIPS gives us 5.03 MIPS as the horsepower needed to compute a SID frame 63. The 20 and 5 MIPS estimates produce a 4:1 ratio between full-time and idle-time frame generation processing. (This very rough estimation assumes that SID frame generation is as complex as that of encoded speech frame generation and that the relationship between number of bits and MIPS is linear, which is not remotely the case. Nonetheless, given that the generation of comfort noise parameters is far less complex than coding, 5 MIPS can be taken as an adequate estimate to show the difference in bandwidth resource utilization between full-time speech processing (when both encoded speech frames and SID frames are being generated) and idle-time processing (when only SID frames are being generated.)
The inventors have determined that what is needed are ways to better manage processing resources during multiple VoIP call sessions by freeing up (or reassigning) resources during idle-time processing. Of course, paramount to any efficient alternative solution is that the VoIP call session""s physical link remain substantially uninterrupted and that no significant impairment of the call is noticeable by the participants.
The invention provides a system and method for managing resources in a distributed Voice-over-IP (VoIP) speech coder pool system arrangement including multiple speech coders each comprised of a first number of front-end modules and a second number of back-end modules. Each call session has assigned thereto a front-end module and a back-end module cooperatively functioning as a speech coder. In the preferred embodiment, voice samples are passed through the assigned front-end module to the back-end module where they are encoded and placed as speech frames on the IP network. As soon as absence of speech is detected by a Voice Activity Detector (which may or may not be shared by multiple front-end modules), processing is handed over to the front-end module freeing up the back-end module to sit idle resulting in reduced power consumption. In alternate embodiment, the freed-up back-end module is reassigned to a new VoIP call session for maximum resource utilization.
Each front-end module is configured for keeping alive the physical layer link between the telephony user and the telephony gateway during non-speech portions of the call, while at the same time allowing speech processing hardware strategically located in the back-end modules. In this manner, the call session is allowed to be kept active in a reduced rate mode throughout the duration of the idle call session. The reduced rate idle-mode requires less processing capability than the high rate data-mode and can be maintained by front-end modules of lower capability and lower cost. Because the processing of the idle data are now handled by the front-end module, the back-end module may be reassigned to another call session at that point; or to the extent the front-end to back-end link is maintained, the back-end processing demands are reduced which may allow the back-end module to provide service to another call session. When speech data appear on the subscriber link, a negotiation may be performed to bring any of the available high-capability back-end modules back on line.
In another preferred embodiment, the front-end modules are provided with sufficient processing power, memory and programmed functionality to maintain more than one subscriber link active, at least during idle processing of such links.
In yet another embodiment, the front-end modules and back-end modules are each fully functional encoder devices capable of operating as either front-end devices or back-end devices, on an as needed basis.
In another preferred embodiment, the front-end modules are connected to the public switched telephone system via DSO circuits, which may be delivered physically as DS1, DS3, or any other commonly provisioned organization of circuits.
In another preferred embodiment, the front-end mode are connected via an appropriate voice-band or broadband type modem link to the call participant.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.